Skip to content

zeeza18/ZEEJAI-Hyper-Chat

Repository files navigation

Agentic RAG with Reinforcement Learning for Multi-Step Reasoning

A Three-Stage System for Reasoning-Enhanced Language Models

Transforming Qwen 2.5 7B into a reasoning agent through SFT, GRPO, and RAG integration

Python Unsloth FastAPI vLLM License


Tech Stack

Core Technologies:

  • Python 3.10+ - Primary programming language
  • Unsloth - Memory-efficient fine-tuning framework (2x faster training, 60% less memory)
  • FastAPI - High-performance web framework for model serving
  • vLLM - Fast inference engine with PagedAttention
  • PyTorch - Deep learning framework
  • Transformers - Hugging Face model library
  • LoRA/PEFT - Parameter-efficient fine-tuning

Table of Contents


Overview

This project implements an Agentic Retrieval Augmented Generation (RAG) system enhanced with Reinforcement Learning to enable multi-step reasoning capabilities. Unlike conventional "fetch and summarize" RAG pipelines, our system can plan, search, analyze, and conclude - mimicking human-like reasoning patterns.

The Problem

Traditional RAG systems follow a simple retrieve-and-summarize approach, which limits their ability to perform complex reasoning tasks. We aimed to build an agent that:

  • Plans its approach to answering questions
  • Searches for relevant information dynamically
  • Analyzes the retrieved data
  • Concludes with well-reasoned responses

Our Solution

We developed a three-stage pipeline that transforms the Qwen 2.5 7B model (4-bit quantized) into a reasoning agent:

  1. Stage 1 (SFT): Establish reasoning format through Supervised Fine-Tuning
  2. Stage 2 (GRPO): Enhance reasoning quality using Group Relative Policy Optimization
  3. Stage 3 (Agentic RAG): Deploy with web search integration for real-time knowledge

Key Features

  • Structured Reasoning: Model generates explicit reasoning traces before providing answers
  • Multi-Mode Operation:
    • Fast Chat: Quick conversational responses
    • Deep Reasoning: Step-by-step analytical thinking
    • Search Mode: RAG-enhanced responses with real-time web search
  • Reinforcement Learning: GRPO training for improved format compliance (+25%) and reasoning depth
  • Memory-Efficient Training: LoRA adapters enable training on single GPU (40GB VRAM)
  • Beautiful UI: Modern, responsive frontend with real-time streaming
  • Dynamic Tool Use: Integration with DuckDuckGo search and semantic filtering

Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Objective: Teach the model to follow a structured reasoning format.

Configuration:

  • Base Model: unsloth/Qwen2.5-7B-bnb-4bit (4-bit quantization)
  • Dataset: Alpaca-cleaned (5,000 instruction/response pairs)
  • Technique: LoRA (Low-Rank Adaptation)
    • Rank: 32
    • Alpha: 64
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Training: 200 steps, batch size 2, gradient accumulation 4
  • Learning Rate: 2e-4 (linear scheduler)
  • Optimizer: AdamW (8-bit)

Reasoning Format:

<start_working_out>
[Model's internal reasoning process]
<end_working_out>
<SOLUTION>
[Final answer]
</SOLUTION>

Results:

  • Training loss reduced from 1.4040 to 0.7920 (43.6% improvement)
  • Format compliance: 80% on unseen samples
  • BERTScore F1: 0.6050
  • ROUGE-L: 0.3148

Notebook: Stage_1.ipynb Outputs: stage1_sft/ (plots, models, results, logs)


Stage 2: Group Relative Policy Optimization (GRPO)

Objective: Refine reasoning quality and enforce structured output through reinforcement learning.

Why GRPO over PPO?

  • GRPO doesn't require a separate value model → 50% less memory usage
  • More efficient for resource-constrained environments
  • Better suited for local GPU training

Reward Functions:

  1. Exact Format Match (+3.0): Correct use of reasoning tags
  2. Approximate Format Match (+0.5): Partial tag compliance
  3. Answer Quality (ROUGE):
    • Exact match: +5.0
    • High ROUGE: +3.0
  4. Instruction Following: Keyword overlap and logical connectors

Configuration:

  • Base Model: Stage 1 SFT model
  • Steps: 100 training steps
  • Batch Size: 4 (gradient accumulation 2)
  • KL Divergence: Monitored to prevent policy drift

Results:

  • Format compliance: 65%90% (+25% improvement)
  • KL divergence: 12.2 → 3.7 (stable convergence)
  • Average reasoning length: +13 tokens (more detailed explanations)
  • BERTScore F1: 0.3338 → 0.3400 (+0.62%)

Notebook: Stage_2.ipynb Outputs: stage2_grpo/ (plots, models, results, logs)


Stage 3: Agentic RAG Deployment

Objective: Integrate trained models with external tools for real-time reasoning and retrieval.

System Components:

  1. Agent Orchestrator: Routes queries based on intent detection
  2. Memory Module: Rolling context window (10-turn history using deque)
  3. Tool Integration: DuckDuckGo search with semantic filtering
  4. FastAPI Backend: Serves models with LoRA adapter switching
  5. Frontend: Beautiful, responsive UI with real-time updates

RAG Pipeline:

User Query → Web Search (20 results) → Semantic Filtering (Top-5)
           → Context Injection → GRPO Model → Grounded Response

Semantic Filtering:

  • Uses SentenceTransformer (all-MiniLM-L6-v2) for embeddings
  • Computes cosine similarity between query and documents
  • Selects top-K most relevant snippets (default K=5)

Files:


Results & Performance

Training Dynamics

Stage 1: SFT Training Loss Curve

Stage 1 Training

Training loss decreased by 43.6%, demonstrating successful adaptation to the instruction-following format.


Stage 2: SFT vs GRPO Comparison Dashboard

GRPO Comparison

GRPO achieved 25% improvement in format compliance while maintaining semantic quality.


GRPO Training Dynamics

GRPO Training

Reward progression and KL divergence monitoring during GRPO training.

Quantitative Results

Metric Stage 1 (SFT) Stage 2 (GRPO) Improvement
Format Compliance 65.0% 90.0% +25.0%
BERTScore (F1) 0.3338 0.3400 +0.62%
ROUGE-L 0.1387 0.1293 -0.94%
BLEU 0.1024 0.0817 -2.07%
Avg. Reasoning Length 364 tokens 377 tokens +13 tokens

Note: ROUGE-L and BLEU decreased slightly because GRPO promotes longer reasoning chains, reducing n-gram overlap with reference answers. This is expected and acceptable given the +25% format compliance improvement.

Visual Analysis

Explore the comprehensive evaluation dashboards:

  • stage1_sft/plots/ - Dataset EDA, training dynamics, format compliance, error analysis, evaluation metrics, reasoning analysis
  • stage2_grpo/plots/ - SFT vs GRPO comparison dashboard, GRPO training dynamics, prompt length distribution

Project Structure

AGENT-RAG/
│
├── README.md
│   Complete project documentation
│
├── reference.ipynb
│   Initial prototype notebook containing all three stages as reference
│
├── Stage_1.ipynb
│   Stage 1 implementation: Supervised Fine-Tuning with reasoning format
│
├── Stage_2.ipynb
│   Stage 2 implementation: GRPO reinforcement learning training
│
├── backend/
│   │
│   ├── main.py
│   │   FastAPI server implementation
│   │   - Model loading (Qwen 2.5 7B with 4-bit quantization)
│   │   - LoRA adapter switching (SFT/GRPO models)
│   │   - RAG search integration with DuckDuckGo
│   │   - Semantic filtering using SentenceTransformer
│   │   - Three endpoints: /generate, /api/health, static frontend
│   │
│   ├── client.py
│   │   Test client for API calls
│   │
│   ├── test.py
│   │   Unit tests for backend functionality
│   │
│   ├── models/
│   │   │
│   │   ├── sft_alpaca_model/
│   │   │   Stage 1 trained model (LoRA adapter)
│   │   │   - adapter_model.safetensors
│   │   │   - adapter_config.json
│   │   │   - README.md
│   │   │
│   │   └── grpo_alpaca_model/
│   │       Stage 2 trained model (LoRA adapter)
│   │       - adapter_model.safetensors
│   │       - adapter_config.json
│   │       - README.md
│   │
│   └── unsloth_compiled_cache/
│       Compiled Unsloth trainer classes for faster execution
│       - UnslothSFTTrainer.py
│       - UnslothGRPOTrainer.py
│       - UnslothDPOTrainer.py
│       - And other trainer variants
│
├── frontend/
│   │
│   └── index.html
│       ZEEJAI Hyper Chat web interface
│       - Modern UI with gradient animations
│       - Three mode selector (Fast Chat, Deep Reasoning, Search Mode)
│       - Real-time message streaming
│       - Reasoning trace display
│       - Source citation display
│
├── stage1_sft/
│   │
│   ├── models/
│   │   └── checkpoints/
│   │       Training checkpoints at steps 50, 100, 150, 200
│   │       Each contains: model weights, optimizer state, scheduler state, tokenizer
│   │
│   ├── plots/
│   │   All Stage 1 visualizations and analysis
│   │   - training_dynamics.png - Loss curves and learning rate schedule
│   │   - evaluation_metrics_dashboard.png - BERTScore, ROUGE, BLEU metrics
│   │   - final_summary_dashboard_clean.png - Complete training summary
│   │   - format_compliance.png - Reasoning tag usage analysis
│   │   - reasoning_solution_analysis.png - Output structure analysis
│   │   - error_analysis.png - Common failure patterns
│   │   - dataset_eda.png - Dataset distribution analysis
│   │   - token_length_analysis.png - Input/output length statistics
│   │   - token_length_distribution.png - Length histograms
│   │   - model_comparison_table.png - Metric comparisons
│   │
│   ├── results/
│   │   Evaluation outputs on test sets
│   │   - Model predictions
│   │   - Metric scores
│   │   - Sample outputs
│   │
│   └── logs/
│       Training logs and TensorBoard events
│
├── stage2_grpo/
│   │
│   ├── models/
│   │   GRPO trained model checkpoints
│   │   Final adapter saved to backend/models/grpo_alpaca_model/
│   │
│   ├── plots/
│   │   All Stage 2 visualizations and analysis
│   │   - sft_vs_grpo_dashboard.png - Complete comparison between SFT and GRPO
│   │   - grpo_training_dynamics.png - Reward progression and KL divergence
│   │   - prompt_length_distribution.png - Training prompt statistics
│   │
│   ├── results/
│   │   GRPO evaluation outputs
│   │   - Format compliance scores
│   │   - Answer quality metrics
│   │   - Comparison with SFT baseline
│   │
│   └── logs/
│       GRPO training logs and reward histories
│
├── Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf
│   Complete academic research report
│   - Introduction and motivation
│   - Three-stage methodology
│   - System development with code
│   - Experiments and results
│   - Visual analysis
│   - Challenges and solutions
│   - Future work
│   - References
│
├── Stage-1.pdf
│   Detailed Stage 1 analysis with extensive visualizations
│
├── stage-2.pdf
│   Detailed Stage 2 GRPO training analysis and metrics
│
└── video.mp4
    Project demonstration video
    - System architecture walkthrough
    - Live demonstration of all three modes
    - Training process visualization
    - Results analysis
    - Discussion of challenges

Installation & Setup

Prerequisites

  • Python: 3.10+
  • CUDA: 11.8+ (for GPU acceleration)
  • GPU: NVIDIA GPU with 40GB+ VRAM (tested on A100)
  • RAM: 32GB+ recommended

Step 1: Clone Repository

git clone https://github.com/yourusername/AGENT-RAG.git
cd AGENT-RAG

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install unsloth vllm fastapi uvicorn
pip install sentence-transformers scikit-learn duckduckgo-search
pip install transformers datasets peft accelerate
pip install rouge-score nltk pandas matplotlib seaborn

Step 4: Download Base Model

The notebooks will automatically download unsloth/Qwen2.5-7B-bnb-4bit on first run.


Usage

Training (Notebooks)

Stage 1: Supervised Fine-Tuning

jupyter notebook Stage_1.ipynb
  • Trains the base model with reasoning format
  • Outputs saved to stage1_sft/

Stage 2: GRPO Reinforcement Learning

jupyter notebook Stage_2.ipynb
  • Fine-tunes SFT model with reinforcement learning
  • Outputs saved to stage2_grpo/

Deployment (FastAPI + Frontend)

Start Backend Server

cd backend
python main.py

Server runs on http://localhost:8000

Access Frontend

Open frontend/index.html in your browser or navigate to http://localhost:8000

API Endpoints

Generate Response:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of France?",
    "mode": "reasoning"
  }'

Modes:

  • chat - Fast conversational responses (SFT model)
  • reasoning - Deep analytical thinking (GRPO model)
  • search - RAG-enhanced with web search (GRPO + DuckDuckGo)

Health Check:

curl http://localhost:8000/api/health

Technical Implementation

Model Architecture

Base Model: Qwen 2.5 7B (4-bit quantization via bitsandbytes)

LoRA Configuration:

lora_rank = 32
lora_alpha = 64
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

Chat Template:

system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

RAG Implementation

Search & Retrieval:

def rag_search(query, num_results=20, top_k=5):
    # 1. Fetch results from DuckDuckGo
    results = DDGS().text(query, max_results=num_results)

    # 2. Encode documents and query
    doc_embeddings = embedding_model.encode(documents)
    query_embedding = embedding_model.encode(query)

    # 3. Compute cosine similarity
    similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

    # 4. Select top-K most relevant
    top_indices = similarities.argsort()[-top_k:][::-1]

    return context, sources

Memory Management

Techniques Used:

  • 4-bit Quantization: Reduces model size from 28GB to ~7GB
  • LoRA Adapters: Train only 0.2% of parameters
  • Gradient Checkpointing: Trade compute for memory
  • Dynamic LoRA Switching: Load adapters on-demand
  • Gradient Accumulation: Simulate larger batch sizes

Documentation & Reports

Research Report

Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf

Comprehensive academic report covering:

  • Introduction & motivation
  • Complete methodology
  • System development (code walkthrough)
  • Experiments & results
  • Visual analysis
  • Challenges & solutions
  • Future work & references

Stage Reports

  • Stage-1.pdf: Detailed SFT analysis with extensive visualizations
  • stage-2.pdf: GRPO training analysis and comparison metrics

Video Demonstration

video.mp4: Complete project walkthrough including:

  • Architecture overview
  • Live demonstration of all three modes
  • Training process visualization
  • Results analysis
  • Challenges discussion

Challenges & Solutions

Challenge 1: CUDA Out-of-Memory (Stage 3)

Problem: Deploying FastAPI server while notebook kernel was active caused GPU memory exhaustion (40GB VRAM exceeded).

Root Cause: vLLM's aggressive memory reservation created duplicate model instances.

Solution:

  • Implemented singleton pattern for model loading
  • Shut down notebook kernel before launching API server
  • Future: Use vLLM's LoRA adapter switching within single instance

Challenge 2: GRPO Reward Instability

Problem: Model exploited reward system by generating correct tags with nonsensical reasoning.

Root Cause: Format reward disproportionately high vs. semantic quality reward.

Solution:

  • Rebalanced reward weights (decreased format, increased ROUGE)
  • Added negative rewards for hallucinations (-1.0 for low ROUGE-L)
  • Monitored KL divergence to prevent policy drift

Challenge 3: Hyperparameter Tuning

Problem: Finding optimal balance between memory efficiency and model expressiveness.

Solution:

  • LoRA Rank 32 (tested 8, 16, 32) - 32 provided best format learning
  • Learning Rate 2e-4 (tested 1e-5 to 5e-4) - 2e-4 balanced speed and stability
  • Gradient Accumulation 4 - Enabled effective batch size 8 on single GPU

Future Work

Planned Enhancements

  1. Vector Database Integration

    • Replace in-memory cosine similarity with ChromaDB
    • Enable persistent, scalable document storage
    • Support for multi-session context
  2. Tool-Use Tokens

    • Train model to request retrieval actions autonomously
    • Move from rule-based routing to learned tool invocation
    • Enable multi-hop reasoning with dynamic planning
  3. Memory Optimization

    • Implement vLLM's PagedAttention for KV-cache management
    • Support concurrent multi-user serving
    • Dynamic batch sizing for throughput optimization
  4. Extended Training

    • Fine-tune on domain-specific datasets (medical, legal, scientific)
    • Multi-task learning across reasoning types
    • Longer context windows (8K → 32K tokens)
  5. Evaluation Framework

    • Automated reasoning quality assessment
    • Benchmark against GPT-4, Claude, DeepSeek-R1
    • Human evaluation on multi-step reasoning tasks

Contributors

Azeezulla Mohammed

DePaul University | MMOHA134@depaul.edu

Contributions:

  • Model training and execution (SFT + GRPO)
  • Infrastructure setup (NVIDIA A100 GPU cluster)
  • Technical implementation (Unsloth, vLLM pipelines)
  • System optimization and debugging

Jainilkumar Patel

DePaul University | JPATE186@depaul.edu

Contributions:

  • Result analysis and evaluation
  • Report writing and documentation
  • Qualitative assessment of model outputs
  • Visualization and presentation

References

  1. Bai, Z. et al. (2024). Qwen2.5 Technical Report. Alibaba Cloud.
  2. Unsloth Team. (2024). Unsloth: Fast and Memory Efficient Fine-Tuning. GitHub
  3. Zhang, K. W. et al. (2023). vLLM: Easy, Fast, and Cost Effective LLM Serving. GitHub
  4. Rafailov, R. et al. (2023). Direct Preference Optimization.
  5. Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. OpenAI.
  6. Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. OpenAI.
  7. Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS.
  8. Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP.
  9. DeepSeek-AI. (2024). DeepSeek R1: Incentivizing Reasoning Through RL.

For complete references, see the full research report.


Citation

If you use this project in your research, please cite:

@article{mohammed2024agentic,
  title={Agentic RAG with Reinforcement Learning for Multi-Step Reasoning},
  author={Mohammed, Azeezulla and Patel, Jainilkumar},
  institution={DePaul University},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Unsloth Team for efficient fine-tuning framework
  • vLLM Team for high-performance inference
  • Alibaba Cloud for Qwen 2.5 model
  • DePaul University for computational resources

Made with passion by Azeezulla Mohammed & Jainilkumar Patel

ZEEJAI Hyper Chat - Where reasoning meets generation

About

Agentic RAG + RL pipeline that turns Qwen 2.5 7B into a multi-step reasoning agent with SFT, GRPO, and FastAPI/vLLM-powered deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages