A Python-based Retrieval-Augmented Generation (RAG) application for chatting with GitHub repositories and profiles.
It indexes code files, builds embeddings, stores them in FAISS, and uses an LLM to answer questions with repository-grounded context.
Developers often spend a lot of time manually exploring unfamiliar repositories to:
- understand project structure
- locate entry points and important files
- explain functions/classes
- create summaries/documentation
This project solves that by letting users ask natural-language questions about a codebase and receive code-grounded answers through a chat interface.
- Accepts a GitHub repository URL or GitHub profile URL
- Clones and indexes one or multiple repositories
- Scans supported code files and chunks them for retrieval
- Builds repo map and symbol map metadata for better navigation
- Generates embeddings using
sentence-transformers - Stores vectors in
FAISS - Retrieves relevant code snippets for each question
- Uses a Groq-hosted LLM for answer generation
- Supports repo summarization, file-focused questions, and entry-point questions
- Python
- Gradio
- SentenceTransformers
- FAISS (CPU)
- Groq API (LLM inference)
- GitPython
- User enters a GitHub repo/profile URL in the UI.
- The app resolves the URL using the GitHub API.
- Repository/repositories are cloned locally into
data/repos. - Code files are scanned and filtered by supported extensions.
- Files are chunked with overlap (and basic boundary-aware chunking).
- Repo metadata is created:
- repo map (file list)
- symbol map (top symbols per file)
- Chunks are embedded using a SentenceTransformer model.
- Embeddings and metadata are stored in a FAISS index (
data/faiss_index).
- User asks a question in the Gradio chat UI.
- The app detects query intent (general question, file question, symbol/function, entry point, summary).
- Relevant chunks are retrieved from FAISS (or directly from file/symbol-matched metadata).
- The app builds a token-budgeted prompt to avoid oversized requests.
- The LLM generates an answer using only the provided code context.
- The response is shown in the chat UI.
- Python 3.10+ (recommended)
- Git installed
- Internet access (for GitHub cloning and LLM API calls)
git clone https://github.com/mak4x13/Codebase-RAG.git
cd Codebase-RAGWindows (PowerShell):
python -m venv venv
.\\venv\\Scripts\\Activate.ps1pip install -r requirements.txtCreate a .env file in the project root and add:
GROQ_API_KEY=your_groq_api_key
GITHUB_TOKEN=your_github_token_optionalNotes:
GROQ_API_KEYis required for LLM responses.GITHUB_TOKENis optional but recommended to avoid GitHub API rate limits.
python main.pyThe app will:
- clear previous temporary repo/index data
- preload the embedding model
- launch the Gradio UI
Then open the local URL shown in the terminal (usually http://127.0.0.1:7860).
Summarize the repoSummarize each repoExplain main.pyWhere is the entry point?Explain function chunk_filePrepare detailed documentation
main.py- app entry point, startup cleanup, and UI launchapp/ui/- Gradio UI and query routing logicapp/preprocessing/- file scanning, chunking, repo preprocessingapp/embeddings/- embedding model loaderapp/vectorstore/- FAISS storage and metadata handlingapp/retrieval/- vector retrieval logicapp/llm/- Groq LLM clientapp/github/- GitHub URL/profile resolverdata/- temporary cloned repositories and FAISS indices
- Very large repositories may still require narrower questions (e.g., specify a file or function).
- Symbol extraction is regex-based (lightweight) and not a full AST parser.
- Index data is temporary and reset on app startup/shutdown in the current workflow.