A powerful RAG (Retrieval-Augmented Generation) application that allows you to upload PDF documents and ask questions about their content using AI. Built with Streamlit, LangChain, and Google Gemini.
- PDF Upload: Upload single or multiple PDF files
- Intelligent Question Answering: Ask questions about your PDF content and get accurate AI-generated answers
- Vector Search: Uses FAISS for efficient similarity search
- Local Embeddings: HuggingFace embeddings run locally (no API costs for embeddings)
- Google Gemini Integration: Powered by Google's Gemini 2.5 Flash model for answer generation
- User-Friendly Interface: Clean and intuitive Streamlit web interface
- Frontend: Streamlit
- LLM: Google Gemini 2.5 Flash (via LangChain)
- Embeddings: HuggingFace
sentence-transformers/all-MiniLM-L6-v2 - Vector Store: FAISS
- PDF Processing: PyPDF
- Text Splitting: LangChain RecursiveCharacterTextSplitter
- Python 3.8 or higher
- Google API Key (for Gemini)
-
Clone the repository
git clone <repository-url> cd pdfQuery
-
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
Create a
.envfile in the project root:GOOGLE_API_KEY=your_google_api_key_here
To get a Google API key:
- Visit Google AI Studio
- Create a new API key
- Copy and paste it into your
.envfile
-
Start the application
streamlit run main.py
-
Upload PDFs
- Click on the sidebar menu
- Upload one or more PDF files
- Click "Submit & Process" to process the documents
-
Ask Questions
- Type your question in the text input field
- The AI will search through your PDFs and provide relevant answers
- PDF Processing: Extracts text from uploaded PDF files
- Text Chunking: Splits text into manageable chunks (1000 characters with 200 character overlap)
- Embedding Generation: Creates vector embeddings using HuggingFace models (runs locally)
- Vector Storage: Stores embeddings in FAISS index for fast similarity search
- Question Answering:
- Converts user question to embeddings
- Searches for similar chunks in FAISS index
- Sends relevant chunks to Google Gemini for answer generation
- Returns AI-generated answer based on document context
pdfQuery/
├── main.py # Main application file
├── requirements.txt # Python dependencies
├── .env # Environment variables (not in git)
├── .gitignore # Git ignore file
├── README.md # This file
├── faiss_index/ # FAISS vector database (generated)
└── venv/ # Virtual environment (not in git)
You can modify the following parameters in main.py:
- Chunk Size: Default is 1000 characters (line 31)
- Chunk Overlap: Default is 200 characters (line 32)
- Embedding Model: Default is
sentence-transformers/all-MiniLM-L6-v2(line 42) - LLM Model: Default is
gemini-2.5-flash(line 56) - Temperature: Default is 0.3 for more focused answers (line 57)
-
"GOOGLE_API_KEY not found" error
- Ensure your
.envfile exists and contains the API key - Verify the key is valid
- Ensure your
-
"Please process the PDF first" error
- Upload a PDF file first
- Click "Submit & Process" before asking questions
-
Slow processing
- First run downloads the HuggingFace model (~80MB)
- Large PDFs take longer to process
- Consider reducing chunk size for faster processing
-
Memory issues
- For very large PDFs, consider processing them in batches
- Reduce chunk size or increase chunk overlap
Key dependencies include:
streamlit- Web interfacelangchain- LLM frameworklangchain-google-genai- Google Gemini integrationlangchain-huggingface- HuggingFace embeddingsfaiss-cpu- Vector similarity searchpypdf- PDF text extractionpython-dotenv- Environment variable management
See requirements.txt for complete list.
- Never commit your
.envfile or API keys to version control - The
.gitignorefile is configured to exclude sensitive files - Keep your Google API key secure and don't share it
- Embeddings: Run locally on CPU, no API costs
- LLM Calls: Only made when asking questions (uses Google API)
- FAISS Index: Saved locally for faster subsequent queries
- Model Caching: HuggingFace models are cached after first download
Potential improvements:
- Support for more document formats (DOCX, TXT, etc.)
- Conversation history and context
- Multiple language support
- Custom embedding models
- Export answers to file
- Advanced search filters
Built with ❤️ using Streamlit, LangChain, and Google Gemini