A Retrieval-Augmented Generation (RAG) system built with FastAPI that can:
- Crawl websites and store textual content
- Index the crawled data into vector embeddings
- Answer questions grounded in the retrieved context
- Explicitly respond with "not found in crawled content" if information is missing
REPOSITORT STRUCTURE
RAG_Service/ │ ├─ app/ │ ├─ crawler.py │ ├─ indexer.py │ ├─ qa.py │ ├─ embedder.py │ └─ chunker.py │ ├─ data/ │ └─ crawls/ # auto-created crawl data │ ├─ images/ # screenshots for README ├─ main.py # FastAPI entrypoint ├─ requirements.txt ├─ README.md └─ .gitignore
Architecture Overview: "crawl → clean → chunk → embed → vector index → retrieve → grounded prompt → answer"
- Core Components:
Crawler (crawler.py) – Respects robots.txt, limits crawl depth/pages, normalizes HTML to text, and stores page content. Indexer (indexer.py) – Splits documents into 800-character chunks with 200-character overlap, embeds text using SentenceTransformers, and builds a FAISS index. Question-Answer (qa.py) – Retrieves top-k most similar chunks, constructs a grounded prompt, and generates answers using OpenAI or a fallback snippet.
- API (main.py) – Exposes three endpoints: POST /crawl → Crawl domain POST /index → Build or update FAISS index POST /ask → Ask question and retrieve grounded answer
This project demonstrates a complete RAG pipeline:
- Crawl text data from a given website.
- Chunk and embed the crawled data into a FAISS vector index.
- Retrieve relevant chunks when a user asks a question and generate a grounded answer.
- Crawl web pages within a given domain
- Store structured text and metadata (URL, title, content)
- Chunk text into overlapping sections for semantic embedding
- Embed chunks using
all-MiniLM-L6-v2model - Store embeddings in FAISS for efficient retrieval
- Query system with “not enough information” safeguard
- FastAPI backend with Swagger UI interface
STEPS TO IMPLEMENT
- GIT Clone:
git clone https://github.com/AR0910/RAG_Service.git cd RAG_Service
- Create a Virtual Environment:
python -m venv .venv
source .venv/bin/activate # (Linux/Mac)
.venv\Scripts\activate # (Windows)
- Install Requirements:
pip install -r requirements.txt
- Run the Server:
uvicorn main:app --reload
-
Now open the localhost server http://127.0.0.1:8000/docs for interactive API testing.
-
here in the /crawl put the website needed to be crawled, for example: { "start_url": "https://www.python.org/", "max_pages": 10, "max_depth": 2, "crawl_delay_ms": 200 }
-
in /index , edit the request as the following example: { "source": "data/crawls/crawl_20251013_210000", "chunk_size": 800, "chunk_overlap": 100, "embedding_model": "all-MiniLM-L6-v2" }
-
in /ask, ask a query , such as: { "question": "What is Python used for?", "top_k": 5 }
OVERALL WORKING
Step 1: Crawl -> POST to /crawl with a start_url -> Save pages automatically to data/crawls/crawl_
Step 2: Index -> POST to /index with the path returned in step 1
Step 3: Ask -> POST to /ask with your question -> If not found → response = "not found in crawled content"
TOOLING AND PROMPTS
| Component | Model / Library | Purpose |
|---|---|---|
| Embeddings | text-embedding-3-small (OpenAI) |
Convert text chunks and questions into vector representations for similarity search. |
| Question Answering | gpt-4o-mini (OpenAI) |
Generate grounded answers from retrieved context. |
"Libraries and Frameworks"
- FastAPI – API server for /crawl, /index, /ask endpoints.
- aiohttp – Asynchronous HTTP requests for crawling pages.
- BeautifulSoup4 – HTML parsing and link extraction.
- readability-lxml – Extract main textual content from HTML pages.
- tldextract – Determine the registrable domain for domain-limited crawling.
- FAISS – Vector indexing for similarity search.
- Pydantic – Request validation and data modeling.
- OpenAI Python SDK – API calls for embeddings and chat generation.
- Python 3.10+ – Project language environment.
PROMPTS USED IN CODE Context: [1] (source_url_1) snippet text 1...
[2] (source_url_2) snippet text 2...
Question: {user_question}
Instructions:
- Answer using ONLY the above context.
- Cite sources using [1], [2], etc.
- If the context does not contain enough information, respond exactly: 'not found in crawled content'.
SCREENSHOT EXAMPLES
==> attached in the folder " /images "