Status: Pre-release - A sophisticated arXiv paper management and discovery system with AI-powered recommendations
arxiver is an intelligent research tool designed to help ML researchers, AI practitioners, and academics stay up-to-date with the rapidly evolving arXiv landscape. It combines semantic search, machine learning recommendations, and AI-powered summarization to streamline academic paper discovery and management.
- π Semantic Search: Vector-based similarity search using ChromaDB and sentence transformers
- π€ AI Recommendations: TensorFlow-powered models predict papers of interest based on your reading history
- π Intelligent Summarization: LLM-generated concise summaries for quick paper evaluation
- π Smart Paper Selection: AI-powered filtering to find the most relevant papers from large result sets
- π Model Context Protocol: Enhanced MCP server with FastMCP best practices, middleware, and type safety
- π Security & Logging: Comprehensive middleware for input validation, security, and request/response logging
- β‘ Modern Stack: FastAPI backend, ChromaDB vector store, UV package management, Pydantic models
- πͺ Multiple Interfaces: CLI tools, REST API, Streamlit UI, and production-ready MCP server
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β arXiv API β β Streamlit UI β β Claude/AI β
β β β β β Assistant β
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
β β β
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Server β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββ β
β β Ingestion β β Search β β MCP Server β β
β β Pipeline β β Engine β β β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββββββ ββββββββββββββββ
β SQLite β β ChromaDB β β TensorFlow β
βDatabase β βVector Store β β Models β
βββββββββββ βββββββββββββββ ββββββββββββββββ
- Python: 3.11 or higher
- Git: For cloning the repository
- Optional: CUDA-compatible GPU for faster TensorFlow inference
uv is the fastest Python package installer and resolver.
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/woojay/arxiver.git
cd arxiver
# Install dependencies and create virtual environment
uv sync
# Activate the environment (if needed)
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows# Clone the repository
git clone https://github.com/woojay/arxiver.git
cd arxiver
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -e .
# For development
pip install -e ".[dev]"# Create conda environment
conda create -n arxiver python=3.11
conda activate arxiver
# Clone and install
git clone https://github.com/woojay/arxiver.git
cd arxiver
pip install -e .Create a .env file in the project root:
# Required: OpenAI API key for summarization and LLM features
OPENAI_API_KEY=your_openai_api_key_here
# Optional: Custom arXiv API settings
ARXIV_RESULTS_PER_PAGE=100
ARXIV_MAX_RESULTS=500
# Optional: Database settings
DATABASE_PATH=./data/arxiver.db
CHROMA_PERSIST_DIRECTORY=./data/chroma_db
# Optional: Model settings
MODEL_PATH=./predictor/
DEFAULT_EMBEDDING_MODEL=all-MiniLM-L6-v2# Initialize the database and vector store
uv run python -c "from arxiver.database import init_db; init_db()"# Using the CLI wrapper (from project root)
uv run python arxiver/main.py webserver
# Or using uvicorn directly
uv run uvicorn arxiver.main:app --reload --port 8000# Ingest papers from the last 7 days
curl -X POST http://127.0.0.1:8000/ingest \
-H "Content-Type: application/json" \
-d '{"days": 7}'
# Or use the CLI (from project root)
uv run python arxiver/main.py ingest --days 7cd ui
uv run streamlit run arxiver_ui.py --server.port 8001Visit http://localhost:8001 to access the web interface.
# Search for papers on transformers
curl -X POST http://127.0.0.1:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "transformer attention mechanisms", "top_k": 10}'
# Get AI-powered recommendations
curl -X POST http://127.0.0.1:8000/recommend \
-H "Content-Type: application/json" \
-d '{"days_back": 3}'
# Summarize a specific paper
curl -X POST http://127.0.0.1:8000/summarize \
-H "Content-Type: application/json" \
-d '{"paper_id": "2404.04292"}'
# Get the best papers from a search
curl -X POST http://127.0.0.1:8000/choose \
-H "Content-Type: application/json" \
-d '{"query": "computer vision", "top_i": 5, "search_k": 50}'import requests
# Search for papers
response = requests.post(
"http://127.0.0.1:8000/query",
json={"query": "large language models", "top_k": 5}
)
papers = response.json()
# Get recommendations
response = requests.post(
"http://127.0.0.1:8000/recommend",
json={"days_back": 7}
)
recommendations = response.json()# Show available commands (from project root)
uv run python arxiver/main.py --help
# Ingest papers from specific date range
uv run python arxiver/main.py ingest --days 14
# Add interested column to database (for ML training)
uv run python arxiver/main.py add-interested-columnarxiver includes a production-ready Model Context Protocol (MCP) server that enables AI assistants like Claude to interact with your paper database directly. The server implements FastMCP best practices with comprehensive middleware, type safety, and security features.
The MCP server has been enhanced with modern FastMCP features:
- π‘οΈ Security Middleware: Input validation, malicious pattern detection, and configurable security policies
- π Logging Middleware: Comprehensive request/response logging with sanitized parameters
- π Type Safety: Full Pydantic model integration with structured responses and error handling
- β‘ Performance: Execution time tracking and optimized response formatting
- π Standards Compliance: Full MCP protocol compliance with enhanced error responses
For detailed information about the enhancements, see FASTMCP_ENHANCEMENTS.md.
π Latest Updates: See CHANGELOG.md for detailed release notes and version history.
# Method 1: Direct Python execution (from project root)
uv run python arxiver/mcp_server.py
# Method 2: Using shell script (from project root)
./run_mcp_server.sh| Tool | Description | Parameters |
|---|---|---|
search_papers |
Semantic similarity search | query, top_k |
get_recommendations |
ML-powered recommendations | days_back |
summarize_paper |
Generate paper summaries | paper_id |
choose_best_papers |
AI-powered paper selection | query, top_i, search_k |
import_paper |
Import specific papers | arxiv_id |
get_paper_details |
Detailed paper information | paper_id |
# Search for papers (using MCP CLI if available)
mcp call search_papers '{"query": "reinforcement learning", "top_k": 10}'
# Get recommendations for the past week
mcp call get_recommendations '{"days_back": 7}'
# Import a specific paper
mcp call import_paper '{"arxiv_id": "2404.04292"}'For detailed MCP integration instructions with Claude Desktop, see README-MCP.md.
# Install development dependencies
uv sync --dev
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=arxiver --cov-report=html
# Run specific test file
uv run pytest tests/test_database.py -v# Format code
uv run black arxiver/ tests/
# Type checking
uv run mypy arxiver/
# Linting
uv run ruff check arxiver/# Train interest prediction models
cd predictor
uv run python predict_interest.py
# The training will create timestamped model files
# Latest model is automatically used for recommendations| Endpoint | Method | Description | Parameters |
|---|---|---|---|
/ingest |
POST | Bulk ingest papers | {"days": int} |
/query |
POST | Semantic search | {"query": str, "top_k": int} |
/recommend |
POST | Get recommendations | {"days_back": int} |
/summarize |
POST | Summarize paper | {"paper_id": str} |
/choose |
POST | AI paper selection | {"query": str, "top_i": int, "search_k": int} |
/import |
POST | Import specific paper | {"arxiv_id": str} |
For complete API documentation, visit http://127.0.0.1:8000/docs when the server is running.
arxiver/
βββ arxiver/ # Main package
β βββ __init__.py # Package initialization
β βββ main.py # CLI and FastAPI server
β βββ mcp_server.py # MCP protocol server
β βββ database.py # SQLite database operations
β βββ arxiv.py # arXiv API integration
β βββ llm.py # LLM and AI functionality
βββ predictor/ # ML models and training
β βββ predict_interest.py # Model training script
β βββ model-*.keras # Trained TensorFlow models
βββ ui/ # Streamlit web interface
β βββ arxiver_ui.py # Streamlit application
βββ tests/ # Test suite
β βββ test_database.py # Database tests
β βββ test_llm.py # LLM functionality tests
β βββ test_mcp_tools.py # MCP server tests
βββ data/ # Data storage (created at runtime)
β βββ arxiver.db # SQLite database
β βββ chroma_db/ # ChromaDB vector store
βββ pyproject.toml # Project configuration
βββ README.md # This file
βββ README-MCP.md # Detailed MCP documentation
βββ run_mcp_server.sh # MCP server startup script
We welcome contributions! Please open an issue on GitHub to discuss major changes before submitting a pull request.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the test suite:
uv run pytest - Format code:
uv run black arxiver/ tests/ - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use arxiver in your research, please cite:
@software{arxiver,
title={arxiver: Intelligent arXiv Paper Discovery and Management},
author={Woojay Poynter},
year={2025},
url={https://github.com/woojay/arxiver}
}- Built with FastAPI, ChromaDB, and TensorFlow
- Uses sentence-transformers for semantic embeddings
- MCP integration powered by Anthropic's MCP
- Package management with uv
- Implemented Pagination: Added automatic pagination to prevent API timeouts and 500 errors with large result sets
- Safe Page Size: Uses page size of 100 to reliably fetch up to 1500 results per day without server errors
- Improved arXiv API Error Handling: Added defensive None checks in XML parsing to prevent AttributeError
- Enhanced Retry Logic: Better error logging for malformed API responses with automatic retry mechanism
- Robust Entry Parsing: Individual paper entries with missing fields are now logged and skipped instead of failing entire ingestion
- Better Debugging: Added detailed logging for request failures, pagination progress, and XML parsing errors
- Performance Optimization: Fixed recommendation endpoint with batch embedding retrieval (99.93% reduction in database queries)
- Path Configuration: Resolved relative path issues for production deployments
- Import Fixes: Corrected relative imports in vector_db module
- Error Handling: Fixed numpy array boolean ambiguity in embedding checks
- Enhanced Database Schema: Added comprehensive metadata fields (authors, categories, publication dates, etc.)
- Fixed ChromaDB Issues: Resolved vector database compatibility problems
- Improved Search: New author and category search capabilities
- Better Error Handling: Enhanced reliability and fallback mechanisms
For detailed change history, see CHANGELOG.md.
- Changelog - Complete version history and changes
- ChromaDB Issue Resolution (2025-07-19) - Vector database compatibility fixes
- Database Migration (2025-07-19) - Schema enhancement details
- Vector DB Reconstruction (2025-07-19) - Database rebuild procedures
- Prevention Measures (2025-07-19) - Safeguards to prevent future ChromaDB issues
- Critical Analysis (2025-07-19) - System failure analysis and fixes
- Comprehensive Review (2025-07-19) - Complete system review and testing documentation
Installation Problems:
# Clear uv cache if installation fails
uv cache clean
# Reinstall dependencies
rm -rf .venv
uv syncDatabase Issues:
# Reset database
rm -f data/arxiver.db data/chroma_db/
uv run python -c "from arxiver.database import init_db; init_db()"ChromaDB Vector Database Issues:
- If experiencing '_type' errors or embedding failures, see ChromaDB Issue Resolution
- Complete vector database reconstruction may be required for corrupted databases
- Use
fill-missing-embeddingsendpoint to regenerate embeddings after fixes
MCP Server Problems:
- Ensure OpenAI API key is set in
.env - Check that the required ports are not in use (FastAPI: 8000, MCP server runs separately)
- Verify ChromaDB initialization
- Ensure database exists:
uv run python -c "from arxiver.database import init_db; init_db()"
arXiv API Ingestion Issues:
- The system uses automatic pagination with page size of 100 to prevent 500 errors and timeouts
- Large result sets (up to 1500 per day) are fetched in multiple requests with rate limiting
- The ingestion process includes automatic retry logic with exponential backoff for network errors
- Malformed API responses are logged and skipped rather than failing the entire ingestion
- If seeing HTTPError 500 or timeouts, pagination will automatically handle the load
- If seeing AttributeError related to '.text' field, the system will retry up to 10 times
- Check logs for detailed error messages including XML parsing failures and pagination progress
- Empty result sets (0 articles) are normal and will not cause errors
Model Training Issues:
- Ensure sufficient disk space for model files
- Check TensorFlow GPU installation if using GPU
- Verify training data exists in database
For more detailed troubleshooting, see README-MCP.md or open an issue on GitHub.
Happy researching! ππ