Skip to content

Latest commit

 

History

History
423 lines (311 loc) · 10.8 KB

File metadata and controls

423 lines (311 loc) · 10.8 KB

RAG System

A Retrieval-Augmented Generation (RAG) system that combines document retrieval with language model generation using LMStudio models and LangGraph orchestration.

Features

  • Document Processing: Support for PDF, TXT, Markdown, Word (.docx), and CSV files
  • Vector Storage: Qdrant integration for efficient similarity search
  • Local LLM: LMStudio integration for privacy-preserving generation
  • Workflow Orchestration: LangGraph-based pipeline management
  • Configurable: Flexible configuration system with environment variable support
  • Modular Architecture: Clean separation of concerns for maintainability

Quick Start

Prerequisites

  • Python 3.8 or higher
  • Docker (for Qdrant vector database)
  • LMStudio (for local LLM inference)

Installation

1. Set up Qdrant Vector Database

Run Qdrant using Docker with persistent storage:

# Create a directory for Qdrant data persistence
mkdir qdrant_storage

# Run Qdrant with Docker (with data persistence)
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v ./qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant

Verify Qdrant is running:

curl http://localhost:6333/health
# Should return: {"title":"qdrant - vector search engine","version":"1.x.x"}

2. Install and Configure LMStudio

Download and Install LMStudio:

  1. Visit LMStudio website
  2. Download LMStudio for your operating system (Windows/macOS/Linux)
  3. Run the installer and follow the setup wizard

Download the Required Models:

This project uses Google's Gemma 3 4B Instruct model. Download it using the CLI:

# Download the specific models used in this project
lms get unsloth/gemma-3-4b-it-GGUF

lms get second-state/All-MiniLM-L6-v2-Embedding-GGUF

# Verify models were downloaded
lms ls

Start the LMStudio Server via CLI:

# Load the model into memory
lms load unsloth/gemma-3-4b-it-GGUF

# Start the server on default port (1234)
lms server start

# Or start with specific configuration
lms server start --port 1234 --cors

# Check server status
lms server status

# View loaded models
lms ps

# Stop the server
lms server stop

3. Install Python Dependencies

  1. Clone the repository:
git clone <repository-url>
cd <repository>
  1. Install the RAG CLI system-wide using pipx (recommended):
# Install pipx if not already installed
# On macOS:
brew install pipx

# On other systems, see: https://pipx.pypa.io/stable/installation/

# Install the RAG CLI
pipx install .

# Ensure pipx is in your PATH
pipx ensurepath

# Restart your terminal or source your shell config
source ~/.zshrc  # or ~/.bashrc

Alternative: Virtual Environment Installation

If you prefer using a virtual environment:

# Create a virtual environment
python -m venv rag-env

# Activate it
# On macOS/Linux:
source rag-env/bin/activate
# On Windows:
# rag-env\Scripts\activate

# Install dependencies and package
pip install -r requirements.txt
pip install -e .

Configuration

Configuration Setup

  1. The system uses a unified configuration file with performance optimizations built-in:
# The main configuration file is already optimized for performance
config/rag_config.yaml
  1. Edit the configuration file to match your setup:
# config/rag_config.yaml
qdrant_host: "localhost"
qdrant_port: 6333
collection_name: "documents"

lmstudio_endpoint: "http://localhost:1234"
model_name: "unsloth/gemma-3-4b-it-GGUF"  # The specific model used in this project

embedding_model: "sentence-transformers/all-MiniLM-L6-v2"

# Performance optimizations are already configured
embedding_batch_size: 64          # Optimized for better GPU utilization
retrieval_enable_cache: true      # Enable query caching
qdrant_prefer_grpc: true          # Use gRPC for better performance

Troubleshooting Setup

If Qdrant won't start:

# Check if port is in use
# On macOS/Linux:
netstat -an | grep :6333
# On Windows:
netstat -an | findstr :6333

# Stop existing container
docker stop qdrant
docker rm qdrant

# Restart with fresh container
docker run -d --name qdrant -p 6333:6333 -p 6334:6334 -v ./qdrant_storage:/qdrant/storage:z qdrant/qdrant

If LMStudio connection fails:

  1. Ensure LMStudio server is running:
    lms server status
    lms ps  # Check if model is loaded
  2. Check if the correct model is loaded:
    lms ls  # List available models
    lms load unsloth/gemma-3-4b-it-GGUF  # Load the required model
  3. Restart the server if needed:
    lms server stop
    lms server start --port 1234 --cors
  4. Verify model endpoint:
    curl http://localhost:1234/v1/models

Usage

Command Line Interface

# Interactive mode
rag-cli

# Single query
rag-cli query "What is machine learning?"

# Batch processing
rag-cli batch queries.txt --output results.json

# Custom configuration
rag-cli --config production_config.yaml query "Your question"

Docker Management Commands

# Start Qdrant
docker start qdrant

# Stop Qdrant
docker stop qdrant

# View Qdrant logs
docker logs qdrant

# Remove Qdrant container (keeps data in qdrant_storage/)
docker rm qdrant

# Backup Qdrant data
# On macOS/Linux:
cp -r qdrant_storage qdrant_backup
# On Windows:
xcopy qdrant_storage qdrant_backup /E /I

# Restore Qdrant data
# On macOS/Linux:
cp -r qdrant_backup/* qdrant_storage/
# On Windows:
xcopy qdrant_backup qdrant_storage /E /I /Y

Project Structure

rag_system/
├── __init__.py              # Package initialization
├── config.py                # Configuration management
├── models.py                # Data models
├── logging_config.py        # Logging setup
├── document_processing/     # Document ingestion and processing
├── storage/                 # Vector database integration
├── retrieval/              # Query processing and retrieval
├── generation/             # LMStudio integration and generation
├── orchestration/          # LangGraph workflow management
└── interfaces/             # CLI and API interfaces

Configuration

The system uses a unified configuration file that includes both basic settings and performance optimizations:

Key Configuration Sections:

  1. Qdrant Settings: Vector database connection and performance settings
  2. LMStudio Settings: Local LLM configuration with streaming and connection pooling
  3. Embedding Settings: Model configuration with GPU optimization and batch processing
  4. Retrieval Settings: Search parameters with caching and concurrency limits
  5. Document Processing: File handling with streaming and concurrent processing
  6. Performance Settings: Memory management, async configuration, and optimization flags

Configuration Methods

The system supports configuration through:

  1. YAML file: config/rag_config.yaml (recommended)
  2. Environment variables: Prefix with RAG_ (e.g., RAG_QDRANT_HOST)
  3. Programmatic configuration: Use the RAGConfig class directly

Performance Optimizations Included

The unified configuration includes built-in performance optimizations:

  • Embedding batch size: Increased to 64 for better GPU utilization
  • Query caching: LRU cache with 128 query capacity
  • gRPC connections: Enabled for Qdrant for better performance
  • Concurrent processing: Optimized limits for file processing and searches
  • Memory management: Garbage collection hints and monitoring thresholds

Development

Installation Methods

The RAG CLI can be installed in several ways:

Method 1: pipx (Recommended for CLI tools)

# Install pipx if not already installed
# On macOS:
brew install pipx
# On other systems: https://pipx.pypa.io/stable/installation/

# Install RAG CLI
pipx install .

# Ensure pipx is in PATH
pipx ensurepath

# Restart terminal or source shell config
source ~/.zshrc  # or ~/.bashrc

Method 2: Virtual Environment

python -m venv rag-env
source rag-env/bin/activate  # On Windows: rag-env\Scripts\activate
pip install -e .

Method 3: User Installation

pip install --user .

Note: On modern Python installations (following PEP 668), system-wide pip installation may be restricted. Use pipx or virtual environments instead.

Setup Development Environment

# Install development dependencies
pip install -r requirements-dev.txt

Code Quality Checks

Before committing code, run quality checks:

# Run all quality checks (formatting, linting, type checking, tests)
python pre-commit-check.py

# Auto-fix formatting and import issues
python pre-commit-check.py --fix

# Quick check without tests
python pre-commit-check.py --skip-tests

# Show detailed output
python pre-commit-check.py --verbose

See docs/QUALITY_CHECKS.md for detailed information.

Performance Benchmarking

The system includes comprehensive benchmarking tools to measure and validate performance improvements:

Available Benchmark Scripts

  1. Quick Benchmark (quick_benchmark.py)

    • Fast performance test (< 2 minutes)
    • Tests core functionality and optimizations
    • Provides immediate performance assessment
    python quick_benchmark.py
  2. Comprehensive Benchmark (benchmark_rag_performance.py)

    • Complete performance analysis (5-10 minutes)
    • Tests all system components
    • Detailed metrics and system monitoring
    • Memory usage analysis
    python benchmark_rag_performance.py

Benchmark Results

The benchmarks test:

  • Embedding Performance: Single vs batch processing, throughput metrics
  • Retrieval Performance: Query response times, result quality
  • Cache Effectiveness: Hit/miss ratios, speedup factors
  • Batch Operations: Concurrent vs sequential processing
  • Memory Usage: Peak usage, cleanup efficiency
  • End-to-End Performance: Complete RAG pipeline timing

Running Tests

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=rag_system --cov-report=html

# Run specific test file
pytest tests/test_config.py -v

# Run specific test
pytest tests/test_config.py::TestRAGConfig::test_default_config -v

Code Style

# Format code
ruff format rag_system/ tests/

# Lint code
ruff check rag_system/ tests/

# Auto-fix linting issues
ruff check --fix rag_system/ tests/

# Type checking
pyright rag_system/