A Three-Stage System for Reasoning-Enhanced Language Models
Transforming Qwen 2.5 7B into a reasoning agent through SFT, GRPO, and RAG integration
Core Technologies:
- Python 3.10+ - Primary programming language
- Unsloth - Memory-efficient fine-tuning framework (2x faster training, 60% less memory)
- FastAPI - High-performance web framework for model serving
- vLLM - Fast inference engine with PagedAttention
- PyTorch - Deep learning framework
- Transformers - Hugging Face model library
- LoRA/PEFT - Parameter-efficient fine-tuning
- Overview
- Key Features
- Project Architecture
- Three-Stage Pipeline
- Results & Performance
- Project Structure
- Installation & Setup
- Usage
- Technical Implementation
- Documentation & Reports
- Challenges & Solutions
- Future Work
- Contributors
- References
This project implements an Agentic Retrieval Augmented Generation (RAG) system enhanced with Reinforcement Learning to enable multi-step reasoning capabilities. Unlike conventional "fetch and summarize" RAG pipelines, our system can plan, search, analyze, and conclude - mimicking human-like reasoning patterns.
Traditional RAG systems follow a simple retrieve-and-summarize approach, which limits their ability to perform complex reasoning tasks. We aimed to build an agent that:
- Plans its approach to answering questions
- Searches for relevant information dynamically
- Analyzes the retrieved data
- Concludes with well-reasoned responses
We developed a three-stage pipeline that transforms the Qwen 2.5 7B model (4-bit quantized) into a reasoning agent:
- Stage 1 (SFT): Establish reasoning format through Supervised Fine-Tuning
- Stage 2 (GRPO): Enhance reasoning quality using Group Relative Policy Optimization
- Stage 3 (Agentic RAG): Deploy with web search integration for real-time knowledge
- Structured Reasoning: Model generates explicit reasoning traces before providing answers
- Multi-Mode Operation:
- Fast Chat: Quick conversational responses
- Deep Reasoning: Step-by-step analytical thinking
- Search Mode: RAG-enhanced responses with real-time web search
- Reinforcement Learning: GRPO training for improved format compliance (+25%) and reasoning depth
- Memory-Efficient Training: LoRA adapters enable training on single GPU (40GB VRAM)
- Beautiful UI: Modern, responsive frontend with real-time streaming
- Dynamic Tool Use: Integration with DuckDuckGo search and semantic filtering
Objective: Teach the model to follow a structured reasoning format.
Configuration:
- Base Model:
unsloth/Qwen2.5-7B-bnb-4bit(4-bit quantization) - Dataset: Alpaca-cleaned (5,000 instruction/response pairs)
- Technique: LoRA (Low-Rank Adaptation)
- Rank: 32
- Alpha: 64
- Target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Training: 200 steps, batch size 2, gradient accumulation 4
- Learning Rate: 2e-4 (linear scheduler)
- Optimizer: AdamW (8-bit)
Reasoning Format:
<start_working_out>
[Model's internal reasoning process]
<end_working_out>
<SOLUTION>
[Final answer]
</SOLUTION>
Results:
- Training loss reduced from 1.4040 to 0.7920 (43.6% improvement)
- Format compliance: 80% on unseen samples
- BERTScore F1: 0.6050
- ROUGE-L: 0.3148
Notebook: Stage_1.ipynb
Outputs: stage1_sft/ (plots, models, results, logs)
Objective: Refine reasoning quality and enforce structured output through reinforcement learning.
Why GRPO over PPO?
- GRPO doesn't require a separate value model → 50% less memory usage
- More efficient for resource-constrained environments
- Better suited for local GPU training
Reward Functions:
- Exact Format Match (+3.0): Correct use of reasoning tags
- Approximate Format Match (+0.5): Partial tag compliance
- Answer Quality (ROUGE):
- Exact match: +5.0
- High ROUGE: +3.0
- Instruction Following: Keyword overlap and logical connectors
Configuration:
- Base Model: Stage 1 SFT model
- Steps: 100 training steps
- Batch Size: 4 (gradient accumulation 2)
- KL Divergence: Monitored to prevent policy drift
Results:
- Format compliance: 65% → 90% (+25% improvement)
- KL divergence: 12.2 → 3.7 (stable convergence)
- Average reasoning length: +13 tokens (more detailed explanations)
- BERTScore F1: 0.3338 → 0.3400 (+0.62%)
Notebook: Stage_2.ipynb
Outputs: stage2_grpo/ (plots, models, results, logs)
Objective: Integrate trained models with external tools for real-time reasoning and retrieval.
System Components:
- Agent Orchestrator: Routes queries based on intent detection
- Memory Module: Rolling context window (10-turn history using deque)
- Tool Integration: DuckDuckGo search with semantic filtering
- FastAPI Backend: Serves models with LoRA adapter switching
- Frontend: Beautiful, responsive UI with real-time updates
RAG Pipeline:
User Query → Web Search (20 results) → Semantic Filtering (Top-5)
→ Context Injection → GRPO Model → Grounded Response
Semantic Filtering:
- Uses
SentenceTransformer(all-MiniLM-L6-v2) for embeddings - Computes cosine similarity between query and documents
- Selects top-K most relevant snippets (default K=5)
Files:
- Backend:
backend/main.py- FastAPI server with model loading - Frontend:
frontend/index.html- ZEEJAI Hyper Chat UI - Client:
backend/client.py- Testing client
Stage 1: SFT Training Loss Curve
Training loss decreased by 43.6%, demonstrating successful adaptation to the instruction-following format.
Stage 2: SFT vs GRPO Comparison Dashboard
GRPO achieved 25% improvement in format compliance while maintaining semantic quality.
GRPO Training Dynamics
Reward progression and KL divergence monitoring during GRPO training.
| Metric | Stage 1 (SFT) | Stage 2 (GRPO) | Improvement |
|---|---|---|---|
| Format Compliance | 65.0% | 90.0% | +25.0% |
| BERTScore (F1) | 0.3338 | 0.3400 | +0.62% |
| ROUGE-L | 0.1387 | 0.1293 | -0.94% |
| BLEU | 0.1024 | 0.0817 | -2.07% |
| Avg. Reasoning Length | 364 tokens | 377 tokens | +13 tokens |
Note: ROUGE-L and BLEU decreased slightly because GRPO promotes longer reasoning chains, reducing n-gram overlap with reference answers. This is expected and acceptable given the +25% format compliance improvement.
Explore the comprehensive evaluation dashboards:
stage1_sft/plots/- Dataset EDA, training dynamics, format compliance, error analysis, evaluation metrics, reasoning analysisstage2_grpo/plots/- SFT vs GRPO comparison dashboard, GRPO training dynamics, prompt length distribution
AGENT-RAG/
│
├── README.md
│ Complete project documentation
│
├── reference.ipynb
│ Initial prototype notebook containing all three stages as reference
│
├── Stage_1.ipynb
│ Stage 1 implementation: Supervised Fine-Tuning with reasoning format
│
├── Stage_2.ipynb
│ Stage 2 implementation: GRPO reinforcement learning training
│
├── backend/
│ │
│ ├── main.py
│ │ FastAPI server implementation
│ │ - Model loading (Qwen 2.5 7B with 4-bit quantization)
│ │ - LoRA adapter switching (SFT/GRPO models)
│ │ - RAG search integration with DuckDuckGo
│ │ - Semantic filtering using SentenceTransformer
│ │ - Three endpoints: /generate, /api/health, static frontend
│ │
│ ├── client.py
│ │ Test client for API calls
│ │
│ ├── test.py
│ │ Unit tests for backend functionality
│ │
│ ├── models/
│ │ │
│ │ ├── sft_alpaca_model/
│ │ │ Stage 1 trained model (LoRA adapter)
│ │ │ - adapter_model.safetensors
│ │ │ - adapter_config.json
│ │ │ - README.md
│ │ │
│ │ └── grpo_alpaca_model/
│ │ Stage 2 trained model (LoRA adapter)
│ │ - adapter_model.safetensors
│ │ - adapter_config.json
│ │ - README.md
│ │
│ └── unsloth_compiled_cache/
│ Compiled Unsloth trainer classes for faster execution
│ - UnslothSFTTrainer.py
│ - UnslothGRPOTrainer.py
│ - UnslothDPOTrainer.py
│ - And other trainer variants
│
├── frontend/
│ │
│ └── index.html
│ ZEEJAI Hyper Chat web interface
│ - Modern UI with gradient animations
│ - Three mode selector (Fast Chat, Deep Reasoning, Search Mode)
│ - Real-time message streaming
│ - Reasoning trace display
│ - Source citation display
│
├── stage1_sft/
│ │
│ ├── models/
│ │ └── checkpoints/
│ │ Training checkpoints at steps 50, 100, 150, 200
│ │ Each contains: model weights, optimizer state, scheduler state, tokenizer
│ │
│ ├── plots/
│ │ All Stage 1 visualizations and analysis
│ │ - training_dynamics.png - Loss curves and learning rate schedule
│ │ - evaluation_metrics_dashboard.png - BERTScore, ROUGE, BLEU metrics
│ │ - final_summary_dashboard_clean.png - Complete training summary
│ │ - format_compliance.png - Reasoning tag usage analysis
│ │ - reasoning_solution_analysis.png - Output structure analysis
│ │ - error_analysis.png - Common failure patterns
│ │ - dataset_eda.png - Dataset distribution analysis
│ │ - token_length_analysis.png - Input/output length statistics
│ │ - token_length_distribution.png - Length histograms
│ │ - model_comparison_table.png - Metric comparisons
│ │
│ ├── results/
│ │ Evaluation outputs on test sets
│ │ - Model predictions
│ │ - Metric scores
│ │ - Sample outputs
│ │
│ └── logs/
│ Training logs and TensorBoard events
│
├── stage2_grpo/
│ │
│ ├── models/
│ │ GRPO trained model checkpoints
│ │ Final adapter saved to backend/models/grpo_alpaca_model/
│ │
│ ├── plots/
│ │ All Stage 2 visualizations and analysis
│ │ - sft_vs_grpo_dashboard.png - Complete comparison between SFT and GRPO
│ │ - grpo_training_dynamics.png - Reward progression and KL divergence
│ │ - prompt_length_distribution.png - Training prompt statistics
│ │
│ ├── results/
│ │ GRPO evaluation outputs
│ │ - Format compliance scores
│ │ - Answer quality metrics
│ │ - Comparison with SFT baseline
│ │
│ └── logs/
│ GRPO training logs and reward histories
│
├── Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf
│ Complete academic research report
│ - Introduction and motivation
│ - Three-stage methodology
│ - System development with code
│ - Experiments and results
│ - Visual analysis
│ - Challenges and solutions
│ - Future work
│ - References
│
├── Stage-1.pdf
│ Detailed Stage 1 analysis with extensive visualizations
│
├── stage-2.pdf
│ Detailed Stage 2 GRPO training analysis and metrics
│
└── video.mp4
Project demonstration video
- System architecture walkthrough
- Live demonstration of all three modes
- Training process visualization
- Results analysis
- Discussion of challenges
- Python: 3.10+
- CUDA: 11.8+ (for GPU acceleration)
- GPU: NVIDIA GPU with 40GB+ VRAM (tested on A100)
- RAM: 32GB+ recommended
git clone https://github.com/yourusername/AGENT-RAG.git
cd AGENT-RAGpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install unsloth vllm fastapi uvicorn
pip install sentence-transformers scikit-learn duckduckgo-search
pip install transformers datasets peft accelerate
pip install rouge-score nltk pandas matplotlib seabornThe notebooks will automatically download unsloth/Qwen2.5-7B-bnb-4bit on first run.
jupyter notebook Stage_1.ipynb- Trains the base model with reasoning format
- Outputs saved to
stage1_sft/
jupyter notebook Stage_2.ipynb- Fine-tunes SFT model with reinforcement learning
- Outputs saved to
stage2_grpo/
cd backend
python main.pyServer runs on http://localhost:8000
Open frontend/index.html in your browser or navigate to http://localhost:8000
Generate Response:
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"query": "What is the capital of France?",
"mode": "reasoning"
}'Modes:
chat- Fast conversational responses (SFT model)reasoning- Deep analytical thinking (GRPO model)search- RAG-enhanced with web search (GRPO + DuckDuckGo)
Health Check:
curl http://localhost:8000/api/healthBase Model: Qwen 2.5 7B (4-bit quantization via bitsandbytes)
LoRA Configuration:
lora_rank = 32
lora_alpha = 64
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"]Chat Template:
system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""Search & Retrieval:
def rag_search(query, num_results=20, top_k=5):
# 1. Fetch results from DuckDuckGo
results = DDGS().text(query, max_results=num_results)
# 2. Encode documents and query
doc_embeddings = embedding_model.encode(documents)
query_embedding = embedding_model.encode(query)
# 3. Compute cosine similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
# 4. Select top-K most relevant
top_indices = similarities.argsort()[-top_k:][::-1]
return context, sourcesTechniques Used:
- 4-bit Quantization: Reduces model size from 28GB to ~7GB
- LoRA Adapters: Train only 0.2% of parameters
- Gradient Checkpointing: Trade compute for memory
- Dynamic LoRA Switching: Load adapters on-demand
- Gradient Accumulation: Simulate larger batch sizes
Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf
Comprehensive academic report covering:
- Introduction & motivation
- Complete methodology
- System development (code walkthrough)
- Experiments & results
- Visual analysis
- Challenges & solutions
- Future work & references
- Stage-1.pdf: Detailed SFT analysis with extensive visualizations
- stage-2.pdf: GRPO training analysis and comparison metrics
video.mp4: Complete project walkthrough including:
- Architecture overview
- Live demonstration of all three modes
- Training process visualization
- Results analysis
- Challenges discussion
Problem: Deploying FastAPI server while notebook kernel was active caused GPU memory exhaustion (40GB VRAM exceeded).
Root Cause: vLLM's aggressive memory reservation created duplicate model instances.
Solution:
- Implemented singleton pattern for model loading
- Shut down notebook kernel before launching API server
- Future: Use vLLM's LoRA adapter switching within single instance
Problem: Model exploited reward system by generating correct tags with nonsensical reasoning.
Root Cause: Format reward disproportionately high vs. semantic quality reward.
Solution:
- Rebalanced reward weights (decreased format, increased ROUGE)
- Added negative rewards for hallucinations (-1.0 for low ROUGE-L)
- Monitored KL divergence to prevent policy drift
Problem: Finding optimal balance between memory efficiency and model expressiveness.
Solution:
- LoRA Rank 32 (tested 8, 16, 32) - 32 provided best format learning
- Learning Rate 2e-4 (tested 1e-5 to 5e-4) - 2e-4 balanced speed and stability
- Gradient Accumulation 4 - Enabled effective batch size 8 on single GPU
-
Vector Database Integration
- Replace in-memory cosine similarity with ChromaDB
- Enable persistent, scalable document storage
- Support for multi-session context
-
Tool-Use Tokens
- Train model to request retrieval actions autonomously
- Move from rule-based routing to learned tool invocation
- Enable multi-hop reasoning with dynamic planning
-
Memory Optimization
- Implement vLLM's PagedAttention for KV-cache management
- Support concurrent multi-user serving
- Dynamic batch sizing for throughput optimization
-
Extended Training
- Fine-tune on domain-specific datasets (medical, legal, scientific)
- Multi-task learning across reasoning types
- Longer context windows (8K → 32K tokens)
-
Evaluation Framework
- Automated reasoning quality assessment
- Benchmark against GPT-4, Claude, DeepSeek-R1
- Human evaluation on multi-step reasoning tasks
DePaul University | MMOHA134@depaul.edu
Contributions:
- Model training and execution (SFT + GRPO)
- Infrastructure setup (NVIDIA A100 GPU cluster)
- Technical implementation (Unsloth, vLLM pipelines)
- System optimization and debugging
DePaul University | JPATE186@depaul.edu
Contributions:
- Result analysis and evaluation
- Report writing and documentation
- Qualitative assessment of model outputs
- Visualization and presentation
- Bai, Z. et al. (2024). Qwen2.5 Technical Report. Alibaba Cloud.
- Unsloth Team. (2024). Unsloth: Fast and Memory Efficient Fine-Tuning. GitHub
- Zhang, K. W. et al. (2023). vLLM: Easy, Fast, and Cost Effective LLM Serving. GitHub
- Rafailov, R. et al. (2023). Direct Preference Optimization.
- Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. OpenAI.
- Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. OpenAI.
- Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS.
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP.
- DeepSeek-AI. (2024). DeepSeek R1: Incentivizing Reasoning Through RL.
For complete references, see the full research report.
If you use this project in your research, please cite:
@article{mohammed2024agentic,
title={Agentic RAG with Reinforcement Learning for Multi-Step Reasoning},
author={Mohammed, Azeezulla and Patel, Jainilkumar},
institution={DePaul University},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Unsloth Team for efficient fine-tuning framework
- vLLM Team for high-performance inference
- Alibaba Cloud for Qwen 2.5 model
- DePaul University for computational resources
Made with passion by Azeezulla Mohammed & Jainilkumar Patel
ZEEJAI Hyper Chat - Where reasoning meets generation


