A powerful Streamlit-based application that enables intelligent conversations with your documents using Advanced RAG (Retrieval-Augmented Generation) technology and Google's Gemini AI.
- 📄 Multi-Format Support: PDF, Word (DOCX), PowerPoint (PPTX), HTML, Text/Markdown, and Images
- 🧠 Smart Retrieval: Uses FAISS vector store with sentence transformers for accurate document retrieval
- 💬 Conversational AI: Powered by Google's Gemini Flash for intelligent responses
- 📚 Source References: View exact document snippets used to generate answers
- 🎨 Modern UI: Clean, intuitive interface with custom styling
- 💾 Session Memory: Maintains conversation context throughout your session
- 🔍 OCR Capability: Extract text from scanned documents and images using Tesseract
- 🔄 Hybrid Search: Combines semantic search (embeddings) with BM25 keyword matching
- 🎯 Query Decomposition: Breaks complex queries into sub-questions for better accuracy
- 📊 Document Reranking: Uses cross-encoder models to rank retrieved documents by relevance
- 📝 Smart Summaries: Generate brief, detailed, or comprehensive document summaries
- ❓ Follow-up Questions: Auto-generates contextual follow-up questions
- Clone the repository
git clone https://github.com/AjayChikate/SmartBot.git
cd smartbot- Install dependencies
pip install -r requirements.txt- Set up environment variables
Create a .env file in the project root:
GOOGLE_API_KEY=your_google_api_key_here
TESSERACT_CMD=C:/Program Files/Tesseract-OCR/tesseract.exe
POPPLER_PATH=C:/Program Files/poppler/Library/bin- Run the application
streamlit run app.py- Upload Documents: Use the sidebar to upload one or more documents in supported formats
- Enable OCR: Toggle OCR if you have scanned documents or images with text
- Process: Click "Process Documents" to build the knowledge base
- Chat: Ask questions about your documents in natural language
- View Sources: Expand the source references to see which document sections were used
- Follow ups: Click follow-up questions to dive deeper
- Summary: Generate document summaries (Brief/Detailed/Comprehensive)
SmartBot/
│
├── app.py # Main Streamlit application
├── rag.py # RAG pipeline (chunking, vectorstore, conversation chain)
├── processor.py # Document processing orchestrator
├── extraction.py # Text extraction for various file formats
├── ocr.py # OCR functionality using Tesseract
├── htmlTemplates.py # CSS and HTML templates for UI
├── requirements.txt # Project dependencies
├── .env # Environment variables (not in repo)
└── README.md
- Embedding Model: BAAI/bge-small-en-v1.5 (384 dimensions)
- Reranking Model: cross-encoder/ms-marco-MiniLM-L-6-v2
- Vector Store: Chroma (SQLite-backed)
- LLM: Google Gemini Flash
-
GPU Acceleration: Enable for faster embeddings
- Requires CUDA-capable NVIDIA GPU
- ~3-5x faster embedding generation
-
Chunking Strategy:
- Smaller chunks (200-300): Better precision
- Larger chunks (500-600): Better context
-
Retrieval Count:
- Small documents: k=4-5
- Large documents: k=8-10
-
Reranking: Disable for <5 retrieved docs
Contributions are welcome! Please feel free to submit a Pull Request.
⭐ If you find this project helpful, please consider giving it a star!




