Skip to content

AR0910/RAG_Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG_Service

A Retrieval-Augmented Generation (RAG) system built with FastAPI that can:

  • Crawl websites and store textual content
  • Index the crawled data into vector embeddings
  • Answer questions grounded in the retrieved context
  • Explicitly respond with "not found in crawled content" if information is missing

REPOSITORT STRUCTURE

RAG_Service/ │ ├─ app/ │ ├─ crawler.py │ ├─ indexer.py │ ├─ qa.py │ ├─ embedder.py │ └─ chunker.py │ ├─ data/ │ └─ crawls/ # auto-created crawl data │ ├─ images/ # screenshots for README ├─ main.py # FastAPI entrypoint ├─ requirements.txt ├─ README.md └─ .gitignore


Architecture Overview: "crawl → clean → chunk → embed → vector index → retrieve → grounded prompt → answer"

  • Core Components:

Crawler (crawler.py) – Respects robots.txt, limits crawl depth/pages, normalizes HTML to text, and stores page content. Indexer (indexer.py) – Splits documents into 800-character chunks with 200-character overlap, embeds text using SentenceTransformers, and builds a FAISS index. Question-Answer (qa.py) – Retrieves top-k most similar chunks, constructs a grounded prompt, and generates answers using OpenAI or a fallback snippet.

  • API (main.py) – Exposes three endpoints: POST /crawl → Crawl domain POST /index → Build or update FAISS index POST /ask → Ask question and retrieve grounded answer

Overview

This project demonstrates a complete RAG pipeline:

  1. Crawl text data from a given website.
  2. Chunk and embed the crawled data into a FAISS vector index.
  3. Retrieve relevant chunks when a user asks a question and generate a grounded answer.

Features

  1. Crawl web pages within a given domain
  2. Store structured text and metadata (URL, title, content)
  3. Chunk text into overlapping sections for semantic embedding
  4. Embed chunks using all-MiniLM-L6-v2 model
  5. Store embeddings in FAISS for efficient retrieval
  6. Query system with “not enough information” safeguard
  7. FastAPI backend with Swagger UI interface

STEPS TO IMPLEMENT

  1. GIT Clone:
git clone https://github.com/AR0910/RAG_Service.git
cd RAG_Service
  1. Create a Virtual Environment:
python -m venv .venv
  source .venv/bin/activate    # (Linux/Mac)
 .venv\Scripts\activate       # (Windows)
  1. Install Requirements:
pip install -r requirements.txt
  1. Run the Server:
uvicorn main:app --reload
  • Now open the localhost server http://127.0.0.1:8000/docs for interactive API testing.

  • here in the /crawl put the website needed to be crawled, for example: { "start_url": "https://www.python.org/", "max_pages": 10, "max_depth": 2, "crawl_delay_ms": 200 }

  • in /index , edit the request as the following example: { "source": "data/crawls/crawl_20251013_210000", "chunk_size": 800, "chunk_overlap": 100, "embedding_model": "all-MiniLM-L6-v2" }

  • in /ask, ask a query , such as: { "question": "What is Python used for?", "top_k": 5 }


OVERALL WORKING

Step 1: Crawl -> POST to /crawl with a start_url -> Save pages automatically to data/crawls/crawl_

Step 2: Index -> POST to /index with the path returned in step 1

Step 3: Ask -> POST to /ask with your question -> If not found → response = "not found in crawled content"


TOOLING AND PROMPTS

Component Model / Library Purpose
Embeddings text-embedding-3-small (OpenAI) Convert text chunks and questions into vector representations for similarity search.
Question Answering gpt-4o-mini (OpenAI) Generate grounded answers from retrieved context.

"Libraries and Frameworks"

  • FastAPI – API server for /crawl, /index, /ask endpoints.
  • aiohttp – Asynchronous HTTP requests for crawling pages.
  • BeautifulSoup4 – HTML parsing and link extraction.
  • readability-lxml – Extract main textual content from HTML pages.
  • tldextract – Determine the registrable domain for domain-limited crawling.
  • FAISS – Vector indexing for similarity search.
  • Pydantic – Request validation and data modeling.
  • OpenAI Python SDK – API calls for embeddings and chat generation.
  • Python 3.10+ – Project language environment.

PROMPTS USED IN CODE Context: [1] (source_url_1) snippet text 1...

[2] (source_url_2) snippet text 2...

Question: {user_question}

Instructions:

  • Answer using ONLY the above context.
  • Cite sources using [1], [2], etc.
  • If the context does not contain enough information, respond exactly: 'not found in crawled content'.

SCREENSHOT EXAMPLES

==> attached in the folder " /images "

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages