RAG_Service

A Retrieval-Augmented Generation (RAG) system built with FastAPI that can:

Crawl websites and store textual content
Index the crawled data into vector embeddings
Answer questions grounded in the retrieved context
Explicitly respond with "not found in crawled content" if information is missing

REPOSITORT STRUCTURE

RAG_Service/ │ ├─ app/ │ ├─ crawler.py │ ├─ indexer.py │ ├─ qa.py │ ├─ embedder.py │ └─ chunker.py │ ├─ data/ │ └─ crawls/ # auto-created crawl data │ ├─ images/ # screenshots for README ├─ main.py # FastAPI entrypoint ├─ requirements.txt ├─ README.md └─ .gitignore

Architecture Overview: "crawl → clean → chunk → embed → vector index → retrieve → grounded prompt → answer"

Core Components:

Crawler (crawler.py) – Respects robots.txt, limits crawl depth/pages, normalizes HTML to text, and stores page content. Indexer (indexer.py) – Splits documents into 800-character chunks with 200-character overlap, embeds text using SentenceTransformers, and builds a FAISS index. Question-Answer (qa.py) – Retrieves top-k most similar chunks, constructs a grounded prompt, and generates answers using OpenAI or a fallback snippet.

API (main.py) – Exposes three endpoints: POST /crawl → Crawl domain POST /index → Build or update FAISS index POST /ask → Ask question and retrieve grounded answer

Overview

This project demonstrates a complete RAG pipeline:

Crawl text data from a given website.
Chunk and embed the crawled data into a FAISS vector index.
Retrieve relevant chunks when a user asks a question and generate a grounded answer.

Features

Crawl web pages within a given domain
Store structured text and metadata (URL, title, content)
Chunk text into overlapping sections for semantic embedding
Embed chunks using all-MiniLM-L6-v2 model
Store embeddings in FAISS for efficient retrieval
Query system with “not enough information” safeguard
FastAPI backend with Swagger UI interface

STEPS TO IMPLEMENT

GIT Clone:

git clone https://github.com/AR0910/RAG_Service.git
cd RAG_Service

Create a Virtual Environment:

python -m venv .venv

  source .venv/bin/activate    # (Linux/Mac)
 .venv\Scripts\activate       # (Windows)

Install Requirements:

pip install -r requirements.txt

Run the Server:

uvicorn main:app --reload

Now open the localhost server http://127.0.0.1:8000/docs for interactive API testing.
here in the /crawl put the website needed to be crawled, for example: { "start_url": "https://www.python.org/", "max_pages": 10, "max_depth": 2, "crawl_delay_ms": 200 }
in /index , edit the request as the following example: { "source": "data/crawls/crawl_20251013_210000", "chunk_size": 800, "chunk_overlap": 100, "embedding_model": "all-MiniLM-L6-v2" }
in /ask, ask a query , such as: { "question": "What is Python used for?", "top_k": 5 }

OVERALL WORKING

Step 1: Crawl -> POST to /crawl with a start_url -> Save pages automatically to data/crawls/crawl_

Step 2: Index -> POST to /index with the path returned in step 1

Step 3: Ask -> POST to /ask with your question -> If not found → response = "not found in crawled content"

TOOLING AND PROMPTS

Component	Model / Library	Purpose
Embeddings	`text-embedding-3-small` (OpenAI)	Convert text chunks and questions into vector representations for similarity search.
Question Answering	`gpt-4o-mini` (OpenAI)	Generate grounded answers from retrieved context.

"Libraries and Frameworks"

FastAPI – API server for /crawl, /index, /ask endpoints.
aiohttp – Asynchronous HTTP requests for crawling pages.
BeautifulSoup4 – HTML parsing and link extraction.
readability-lxml – Extract main textual content from HTML pages.
tldextract – Determine the registrable domain for domain-limited crawling.
FAISS – Vector indexing for similarity search.
Pydantic – Request validation and data modeling.
OpenAI Python SDK – API calls for embeddings and chat generation.
Python 3.10+ – Project language environment.

PROMPTS USED IN CODE Context: [1] (source_url_1) snippet text 1...

[2] (source_url_2) snippet text 2...

Question: {user_question}

Instructions:

Answer using ONLY the above context.
Cite sources using [1], [2], etc.
If the context does not contain enough information, respond exactly: 'not found in crawled content'.

SCREENSHOT EXAMPLES

==> attached in the folder " /images "

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG_Service

Overview

Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
app		app
data		data
images		images
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG_Service

Overview

Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages