Agentic RAG with Reinforcement Learning for Multi-Step Reasoning

A Three-Stage System for Reasoning-Enhanced Language Models

Transforming Qwen 2.5 7B into a reasoning agent through SFT, GRPO, and RAG integration

Tech Stack

Core Technologies:

Python 3.10+ - Primary programming language
Unsloth - Memory-efficient fine-tuning framework (2x faster training, 60% less memory)
FastAPI - High-performance web framework for model serving
vLLM - Fast inference engine with PagedAttention
PyTorch - Deep learning framework
Transformers - Hugging Face model library
LoRA/PEFT - Parameter-efficient fine-tuning

Overview

This project implements an Agentic Retrieval Augmented Generation (RAG) system enhanced with Reinforcement Learning to enable multi-step reasoning capabilities. Unlike conventional "fetch and summarize" RAG pipelines, our system can plan, search, analyze, and conclude - mimicking human-like reasoning patterns.

The Problem

Traditional RAG systems follow a simple retrieve-and-summarize approach, which limits their ability to perform complex reasoning tasks. We aimed to build an agent that:

Plans its approach to answering questions
Searches for relevant information dynamically
Analyzes the retrieved data
Concludes with well-reasoned responses

Our Solution

We developed a three-stage pipeline that transforms the Qwen 2.5 7B model (4-bit quantized) into a reasoning agent:

Stage 1 (SFT): Establish reasoning format through Supervised Fine-Tuning
Stage 2 (GRPO): Enhance reasoning quality using Group Relative Policy Optimization
Stage 3 (Agentic RAG): Deploy with web search integration for real-time knowledge

Key Features

Structured Reasoning: Model generates explicit reasoning traces before providing answers
Multi-Mode Operation:
- Fast Chat: Quick conversational responses
- Deep Reasoning: Step-by-step analytical thinking
- Search Mode: RAG-enhanced responses with real-time web search
Reinforcement Learning: GRPO training for improved format compliance (+25%) and reasoning depth
Memory-Efficient Training: LoRA adapters enable training on single GPU (40GB VRAM)
Beautiful UI: Modern, responsive frontend with real-time streaming
Dynamic Tool Use: Integration with DuckDuckGo search and semantic filtering

Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Objective: Teach the model to follow a structured reasoning format.

Configuration:

Base Model: unsloth/Qwen2.5-7B-bnb-4bit (4-bit quantization)
Dataset: Alpaca-cleaned (5,000 instruction/response pairs)
Technique: LoRA (Low-Rank Adaptation)
- Rank: 32
- Alpha: 64
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training: 200 steps, batch size 2, gradient accumulation 4
Learning Rate: 2e-4 (linear scheduler)
Optimizer: AdamW (8-bit)

Reasoning Format:

<start_working_out>
[Model's internal reasoning process]
<end_working_out>
<SOLUTION>
[Final answer]
</SOLUTION>

Results:

Training loss reduced from 1.4040 to 0.7920 (43.6% improvement)
Format compliance: 80% on unseen samples
BERTScore F1: 0.6050
ROUGE-L: 0.3148

Notebook: Stage_1.ipynb Outputs: stage1_sft/ (plots, models, results, logs)

Stage 2: Group Relative Policy Optimization (GRPO)

Objective: Refine reasoning quality and enforce structured output through reinforcement learning.

Why GRPO over PPO?

GRPO doesn't require a separate value model → 50% less memory usage
More efficient for resource-constrained environments
Better suited for local GPU training

Reward Functions:

Exact Format Match (+3.0): Correct use of reasoning tags
Approximate Format Match (+0.5): Partial tag compliance
Answer Quality (ROUGE):
- Exact match: +5.0
- High ROUGE: +3.0
Instruction Following: Keyword overlap and logical connectors

Configuration:

Base Model: Stage 1 SFT model
Steps: 100 training steps
Batch Size: 4 (gradient accumulation 2)
KL Divergence: Monitored to prevent policy drift

Results:

Format compliance: 65% → 90% (+25% improvement)
KL divergence: 12.2 → 3.7 (stable convergence)
Average reasoning length: +13 tokens (more detailed explanations)
BERTScore F1: 0.3338 → 0.3400 (+0.62%)

Notebook: Stage_2.ipynb Outputs: stage2_grpo/ (plots, models, results, logs)

Stage 3: Agentic RAG Deployment

Objective: Integrate trained models with external tools for real-time reasoning and retrieval.

System Components:

Agent Orchestrator: Routes queries based on intent detection
Memory Module: Rolling context window (10-turn history using deque)
Tool Integration: DuckDuckGo search with semantic filtering
FastAPI Backend: Serves models with LoRA adapter switching
Frontend: Beautiful, responsive UI with real-time updates

RAG Pipeline:

User Query → Web Search (20 results) → Semantic Filtering (Top-5)
           → Context Injection → GRPO Model → Grounded Response

Semantic Filtering:

Uses SentenceTransformer (all-MiniLM-L6-v2) for embeddings
Computes cosine similarity between query and documents
Selects top-K most relevant snippets (default K=5)

Files:

Backend: backend/main.py - FastAPI server with model loading
Frontend: frontend/index.html - ZEEJAI Hyper Chat UI
Client: backend/client.py - Testing client

Results & Performance

Training Dynamics

Stage 1: SFT Training Loss Curve

Training loss decreased by 43.6%, demonstrating successful adaptation to the instruction-following format.

Stage 2: SFT vs GRPO Comparison Dashboard

GRPO achieved 25% improvement in format compliance while maintaining semantic quality.

GRPO Training Dynamics

Reward progression and KL divergence monitoring during GRPO training.

Quantitative Results

Metric	Stage 1 (SFT)	Stage 2 (GRPO)	Improvement
Format Compliance	65.0%	90.0%	+25.0%
BERTScore (F1)	0.3338	0.3400	+0.62%
ROUGE-L	0.1387	0.1293	-0.94%
BLEU	0.1024	0.0817	-2.07%
Avg. Reasoning Length	364 tokens	377 tokens	+13 tokens

Note: ROUGE-L and BLEU decreased slightly because GRPO promotes longer reasoning chains, reducing n-gram overlap with reference answers. This is expected and acceptable given the +25% format compliance improvement.

Visual Analysis

Explore the comprehensive evaluation dashboards:

stage1_sft/plots/ - Dataset EDA, training dynamics, format compliance, error analysis, evaluation metrics, reasoning analysis
stage2_grpo/plots/ - SFT vs GRPO comparison dashboard, GRPO training dynamics, prompt length distribution

Project Structure

AGENT-RAG/
│
├── README.md
│   Complete project documentation
│
├── reference.ipynb
│   Initial prototype notebook containing all three stages as reference
│
├── Stage_1.ipynb
│   Stage 1 implementation: Supervised Fine-Tuning with reasoning format
│
├── Stage_2.ipynb
│   Stage 2 implementation: GRPO reinforcement learning training
│
├── backend/
│   │
│   ├── main.py
│   │   FastAPI server implementation
│   │   - Model loading (Qwen 2.5 7B with 4-bit quantization)
│   │   - LoRA adapter switching (SFT/GRPO models)
│   │   - RAG search integration with DuckDuckGo
│   │   - Semantic filtering using SentenceTransformer
│   │   - Three endpoints: /generate, /api/health, static frontend
│   │
│   ├── client.py
│   │   Test client for API calls
│   │
│   ├── test.py
│   │   Unit tests for backend functionality
│   │
│   ├── models/
│   │   │
│   │   ├── sft_alpaca_model/
│   │   │   Stage 1 trained model (LoRA adapter)
│   │   │   - adapter_model.safetensors
│   │   │   - adapter_config.json
│   │   │   - README.md
│   │   │
│   │   └── grpo_alpaca_model/
│   │       Stage 2 trained model (LoRA adapter)
│   │       - adapter_model.safetensors
│   │       - adapter_config.json
│   │       - README.md
│   │
│   └── unsloth_compiled_cache/
│       Compiled Unsloth trainer classes for faster execution
│       - UnslothSFTTrainer.py
│       - UnslothGRPOTrainer.py
│       - UnslothDPOTrainer.py
│       - And other trainer variants
│
├── frontend/
│   │
│   └── index.html
│       ZEEJAI Hyper Chat web interface
│       - Modern UI with gradient animations
│       - Three mode selector (Fast Chat, Deep Reasoning, Search Mode)
│       - Real-time message streaming
│       - Reasoning trace display
│       - Source citation display
│
├── stage1_sft/
│   │
│   ├── models/
│   │   └── checkpoints/
│   │       Training checkpoints at steps 50, 100, 150, 200
│   │       Each contains: model weights, optimizer state, scheduler state, tokenizer
│   │
│   ├── plots/
│   │   All Stage 1 visualizations and analysis
│   │   - training_dynamics.png - Loss curves and learning rate schedule
│   │   - evaluation_metrics_dashboard.png - BERTScore, ROUGE, BLEU metrics
│   │   - final_summary_dashboard_clean.png - Complete training summary
│   │   - format_compliance.png - Reasoning tag usage analysis
│   │   - reasoning_solution_analysis.png - Output structure analysis
│   │   - error_analysis.png - Common failure patterns
│   │   - dataset_eda.png - Dataset distribution analysis
│   │   - token_length_analysis.png - Input/output length statistics
│   │   - token_length_distribution.png - Length histograms
│   │   - model_comparison_table.png - Metric comparisons
│   │
│   ├── results/
│   │   Evaluation outputs on test sets
│   │   - Model predictions
│   │   - Metric scores
│   │   - Sample outputs
│   │
│   └── logs/
│       Training logs and TensorBoard events
│
├── stage2_grpo/
│   │
│   ├── models/
│   │   GRPO trained model checkpoints
│   │   Final adapter saved to backend/models/grpo_alpaca_model/
│   │
│   ├── plots/
│   │   All Stage 2 visualizations and analysis
│   │   - sft_vs_grpo_dashboard.png - Complete comparison between SFT and GRPO
│   │   - grpo_training_dynamics.png - Reward progression and KL divergence
│   │   - prompt_length_distribution.png - Training prompt statistics
│   │
│   ├── results/
│   │   GRPO evaluation outputs
│   │   - Format compliance scores
│   │   - Answer quality metrics
│   │   - Comparison with SFT baseline
│   │
│   └── logs/
│       GRPO training logs and reward histories
│
├── Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf
│   Complete academic research report
│   - Introduction and motivation
│   - Three-stage methodology
│   - System development with code
│   - Experiments and results
│   - Visual analysis
│   - Challenges and solutions
│   - Future work
│   - References
│
├── Stage-1.pdf
│   Detailed Stage 1 analysis with extensive visualizations
│
├── stage-2.pdf
│   Detailed Stage 2 GRPO training analysis and metrics
│
└── video.mp4
    Project demonstration video
    - System architecture walkthrough
    - Live demonstration of all three modes
    - Training process visualization
    - Results analysis
    - Discussion of challenges

Installation & Setup

Prerequisites

Python: 3.10+
CUDA: 11.8+ (for GPU acceleration)
GPU: NVIDIA GPU with 40GB+ VRAM (tested on A100)
RAM: 32GB+ recommended

Step 1: Clone Repository

git clone https://github.com/yourusername/AGENT-RAG.git
cd AGENT-RAG

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install unsloth vllm fastapi uvicorn
pip install sentence-transformers scikit-learn duckduckgo-search
pip install transformers datasets peft accelerate
pip install rouge-score nltk pandas matplotlib seaborn

Step 4: Download Base Model

The notebooks will automatically download unsloth/Qwen2.5-7B-bnb-4bit on first run.

Usage

Training (Notebooks)

Stage 1: Supervised Fine-Tuning

jupyter notebook Stage_1.ipynb

Trains the base model with reasoning format
Outputs saved to stage1_sft/

Stage 2: GRPO Reinforcement Learning

jupyter notebook Stage_2.ipynb

Fine-tunes SFT model with reinforcement learning
Outputs saved to stage2_grpo/

Deployment (FastAPI + Frontend)

Start Backend Server

cd backend
python main.py

Server runs on http://localhost:8000

Access Frontend

Open frontend/index.html in your browser or navigate to http://localhost:8000

API Endpoints

Generate Response:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of France?",
    "mode": "reasoning"
  }'

Modes:

chat - Fast conversational responses (SFT model)
reasoning - Deep analytical thinking (GRPO model)
search - RAG-enhanced with web search (GRPO + DuckDuckGo)

Health Check:

curl http://localhost:8000/api/health

Technical Implementation

Model Architecture

Base Model: Qwen 2.5 7B (4-bit quantization via bitsandbytes)

LoRA Configuration:

lora_rank = 32
lora_alpha = 64
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

Chat Template:

system_prompt = """You are given a problem.
Think about the problem and provide your working out.
Place it between <start_working_out> and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

RAG Implementation

Search & Retrieval:

def rag_search(query, num_results=20, top_k=5):
    # 1. Fetch results from DuckDuckGo
    results = DDGS().text(query, max_results=num_results)

    # 2. Encode documents and query
    doc_embeddings = embedding_model.encode(documents)
    query_embedding = embedding_model.encode(query)

    # 3. Compute cosine similarity
    similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

    # 4. Select top-K most relevant
    top_indices = similarities.argsort()[-top_k:][::-1]

    return context, sources

Memory Management

Techniques Used:

4-bit Quantization: Reduces model size from 28GB to ~7GB
LoRA Adapters: Train only 0.2% of parameters
Gradient Checkpointing: Trade compute for memory
Dynamic LoRA Switching: Load adapters on-demand
Gradient Accumulation: Simulate larger batch sizes

Documentation & Reports

Research Report

Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf

Comprehensive academic report covering:

Introduction & motivation
Complete methodology
System development (code walkthrough)
Experiments & results
Visual analysis
Challenges & solutions
Future work & references

Stage Reports

Stage-1.pdf: Detailed SFT analysis with extensive visualizations
stage-2.pdf: GRPO training analysis and comparison metrics

Video Demonstration

video.mp4: Complete project walkthrough including:

Architecture overview
Live demonstration of all three modes
Training process visualization
Results analysis
Challenges discussion

Challenges & Solutions

Challenge 1: CUDA Out-of-Memory (Stage 3)

Problem: Deploying FastAPI server while notebook kernel was active caused GPU memory exhaustion (40GB VRAM exceeded).

Root Cause: vLLM's aggressive memory reservation created duplicate model instances.

Solution:

Implemented singleton pattern for model loading
Shut down notebook kernel before launching API server
Future: Use vLLM's LoRA adapter switching within single instance

Challenge 2: GRPO Reward Instability

Problem: Model exploited reward system by generating correct tags with nonsensical reasoning.

Root Cause: Format reward disproportionately high vs. semantic quality reward.

Solution:

Rebalanced reward weights (decreased format, increased ROUGE)
Added negative rewards for hallucinations (-1.0 for low ROUGE-L)
Monitored KL divergence to prevent policy drift

Challenge 3: Hyperparameter Tuning

Problem: Finding optimal balance between memory efficiency and model expressiveness.

Solution:

LoRA Rank 32 (tested 8, 16, 32) - 32 provided best format learning
Learning Rate 2e-4 (tested 1e-5 to 5e-4) - 2e-4 balanced speed and stability
Gradient Accumulation 4 - Enabled effective batch size 8 on single GPU

Future Work

Planned Enhancements

Vector Database Integration
- Replace in-memory cosine similarity with ChromaDB
- Enable persistent, scalable document storage
- Support for multi-session context
Tool-Use Tokens
- Train model to request retrieval actions autonomously
- Move from rule-based routing to learned tool invocation
- Enable multi-hop reasoning with dynamic planning
Memory Optimization
- Implement vLLM's PagedAttention for KV-cache management
- Support concurrent multi-user serving
- Dynamic batch sizing for throughput optimization
Extended Training
- Fine-tune on domain-specific datasets (medical, legal, scientific)
- Multi-task learning across reasoning types
- Longer context windows (8K → 32K tokens)
Evaluation Framework
- Automated reasoning quality assessment
- Benchmark against GPT-4, Claude, DeepSeek-R1
- Human evaluation on multi-step reasoning tasks

Contributors

Azeezulla Mohammed

DePaul University | MMOHA134@depaul.edu

Contributions:

Model training and execution (SFT + GRPO)
Infrastructure setup (NVIDIA A100 GPU cluster)
Technical implementation (Unsloth, vLLM pipelines)
System optimization and debugging

Jainilkumar Patel

DePaul University | JPATE186@depaul.edu

Contributions:

Result analysis and evaluation
Report writing and documentation
Qualitative assessment of model outputs
Visualization and presentation

References

Bai, Z. et al. (2024). Qwen2.5 Technical Report. Alibaba Cloud.
Unsloth Team. (2024). Unsloth: Fast and Memory Efficient Fine-Tuning. GitHub
Zhang, K. W. et al. (2023). vLLM: Easy, Fast, and Cost Effective LLM Serving. GitHub
Rafailov, R. et al. (2023). Direct Preference Optimization.
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. OpenAI.
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback. OpenAI.
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP. NeurIPS.
Reimers, N. & Gurevych, I. (2019). Sentence-BERT. EMNLP.
DeepSeek-AI. (2024). DeepSeek R1: Incentivizing Reasoning Through RL.

For complete references, see the full research report.

Citation

If you use this project in your research, please cite:

@article{mohammed2024agentic,
  title={Agentic RAG with Reinforcement Learning for Multi-Step Reasoning},
  author={Mohammed, Azeezulla and Patel, Jainilkumar},
  institution={DePaul University},
  year={2024}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Unsloth Team for efficient fine-tuning framework
vLLM Team for high-performance inference
Alibaba Cloud for Qwen 2.5 model
DePaul University for computational resources

Made with passion by Azeezulla Mohammed & Jainilkumar Patel

ZEEJAI Hyper Chat - Where reasoning meets generation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backend		backend
frontend		frontend
stage1_sft		stage1_sft
stage2_grpo/plots		stage2_grpo/plots
.gitignore		.gitignore
Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf		Agentic_RAG_with_Reinforcement_Learning_for_Multi_Step_Reasoning.pdf
README.md		README.md
Stage-1.pdf		Stage-1.pdf
Stage_1.ipynb		Stage_1.ipynb
Stage_2.ipynb		Stage_2.ipynb
reference.ipynb		reference.ipynb
stage-2.pdf		stage-2.pdf

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG with Reinforcement Learning for Multi-Step Reasoning

Tech Stack

Table of Contents

Overview

The Problem

Our Solution

Key Features

Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Group Relative Policy Optimization (GRPO)

Stage 3: Agentic RAG Deployment

Results & Performance

Training Dynamics

Quantitative Results

Visual Analysis

Project Structure

Installation & Setup

Prerequisites

Step 1: Clone Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Download Base Model

Usage

Training (Notebooks)

Stage 1: Supervised Fine-Tuning

Stage 2: GRPO Reinforcement Learning

Deployment (FastAPI + Frontend)

Start Backend Server

Access Frontend

API Endpoints

Technical Implementation

Model Architecture

RAG Implementation

Memory Management

Documentation & Reports

Research Report

Stage Reports

Video Demonstration

Challenges & Solutions

Challenge 1: CUDA Out-of-Memory (Stage 3)

Challenge 2: GRPO Reward Instability

Challenge 3: Hyperparameter Tuning

Future Work

Planned Enhancements

Contributors

Azeezulla Mohammed

Jainilkumar Patel

References

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages