A production-oriented Retrieval-Augmented Generation (RAG) system built over the official FastAPI documentation. The project is built on a single, opinionated thesis: retrieval quality determines answer quality. Every architectural decision — from the custom crawler to the metadata schema — is a direct consequence of that belief.
Off-the-shelf RAG pipelines apply generic document loaders and token-window chunking to documentation. This destroys the structural information that makes documentation useful: section hierarchy, the co-location of code examples with their prose explanations, and the canonical ordering of pages.
The result is a retrieval index that returns fragments — disconnected sentences, orphaned code blocks, partial definitions — producing answers that hallucinate or contradict the source.
This project solves that by treating documentation structure as a first-class engineering constraint.
flowchart TD
A[FastAPI Docs Website] -->|HTTP GET| B[Canonical URL Crawler]
B -->|url.json - 104 ordered URLs| C[Page Scraper]
C -->|Structured per-page JSON| D[Hierarchy Reconstructor]
D -->|h1 / h2 / h3 section tree| E[Structure-Aware Chunker]
E -->|~300 enriched chunks| F[Metadata Schema]
F -->|section_path, depth, url, has_code| G[HuggingFace Embeddings\nall-mpnet-base-v2, 768-dim]
G -->|Dense vectors| H[Pinecone Vector Index\ncosine similarity, serverless AWS]
H -->|Top-k retrieval| I[Context Assembler]
I -->|Augmented prompt| J[Groq LLM\nLlama-3.1-8b-instant]
J -->|Grounded answer + citations| K[Client]
style A fill:#1a1a2e,color:#e0e0e0
style B fill:#16213e,color:#e0e0e0
style C fill:#16213e,color:#e0e0e0
style D fill:#0f3460,color:#e0e0e0
style E fill:#0f3460,color:#e0e0e0
style F fill:#0f3460,color:#e0e0e0
style G fill:#533483,color:#e0e0e0
style H fill:#533483,color:#e0e0e0
style I fill:#533483,color:#e0e0e0
style J fill:#e94560,color:#ffffff
style K fill:#e94560,color:#ffffff
| # | Decision | Rationale |
|---|---|---|
| 1 | Canonical URL crawler over sitemap or DFS | FastAPI docs use <link rel="canonical"> + <link rel="next"> to define a canonical reading order. Following these guarantees a stable, deduplicated, ordered URL list — a prerequisite for deterministic hierarchy reconstruction. |
| 2 | Typed DOM walker over get_text() or markdownify |
Classifies every node (heading, text, list_item, code) by tag and CSS class. Separates tabbed code panels from inline highlights, strips admonition noise, and preserves indentation with CODE / /CODE sentinels — all silently lost by flat-text extraction. |
| 3 | Section-boundary chunking over token windows | Each chunk maps to one semantic unit: a heading plus its prose and code. Keeps explanations and examples co-located so retrieval returns complete context, not orphaned fragments. |
| 4 | Rich metadata schema (section_path, depth, url, has_code) |
Enables filtered retrieval beyond embedding similarity — e.g., restrict to code-heavy chunks, or surface the exact breadcrumb for citation grounding. Smaller LLMs respond better with tighter, scoped context. |
| 5 | all-mpnet-base-v2 + Pinecone serverless |
Top-performing dense retrieval model without GPU inference; cosine metric is correct for normalized embeddings. Serverless tier removes pod-sizing overhead during the research phase. |
| 6 | Provisional LLM (Llama-3.1-8b-instant) |
LLM selection deferred until retrieval is validated. Benchmarking generation quality against a weak retrieval baseline produces misleading signals; final model choice gates on RAGAS results. |
| Component | Decision |
|---|---|
| Canonical URL crawler | No sitemap; uses <link rel="canonical"> + <link rel="next"> traversal |
| Typed DOM parser | Walks descendants; classifies each node by tag and CSS class |
| Code block normalizer | Strips whitespace noise, preserves indentation, adds sentinels |
| Metadata schema | section_path, depth, URL, code flag per chunk |
Standard library components (LangChain, Pinecone SDK, HuggingFace) handle embedding, vector storage, and LLM invocation.
- 104 FastAPI documentation pages crawled, parsed, and indexed with zero loss of code blocks or section hierarchy
- ~300 structure-aligned chunks produced, each carrying full section-path metadata for citation grounding
- Retrieval operates over 768-dimensional dense embeddings with cosine similarity in a serverless vector index — no infrastructure to manage during development
- Answers include source URLs and section breadcrumbs, making the system auditable rather than opaque
- Python 3.10+
- A Pinecone account and API key (free tier is sufficient)
- A Groq API key (free tier is sufficient)
# Clone the repository
git clone https://github.com/your-username/doc-struct-rag.git
cd doc-struct-rag
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
PINECONE_API_KEY=your_pinecone_api_key
HUGGINGFACE_API_KEY=your_huggingface_api_key
GROQ_API_KEY=your_groq_api_key
python -m scraping.url_crawl
# Produces: url.json (104 ordered FastAPI documentation URLs)Open scraping_notebook.ipynb and run the pipeline cell, or use the module directly:
from scraping.scraping_pipeline import scrape_all_pages
import json
with open("url.json") as f:
url_list = json.load(f)
scrape_all_pages("https://fastapi.tiangolo.com/", url_list)
# Produces: document.json (structured, hierarchy-aware page content)The rag.py module handles embedding and Pinecone upsert. Uncomment the upsert loop in rag.py and run:
python rag.py
# Embeds chunks with all-mpnet-base-v2 and upserts into Pineconedoc-struct-rag/
├── scraping/
│ ├── url_crawl.py # Canonical-aware documentation crawler
│ ├── scrape_page.py # Typed DOM parser with code normalization
│ ├── scraping_pipeline.py # Orchestrates multi-page scrape to document.json
│ └── webscrape.py # Generic page text extractor (prototype utility)
├── rag.py # Embedding, Pinecone indexing, and generation chain
├── main.py # FastAPI REST interface (POST /api/rag)
├── requirements.txt
└── scraping_notebook.ipynb # Exploratory pipeline notebook
| Item | Status | Notes |
|---|---|---|
| Evaluation with RAGAS | Active | Measuring context precision, answer faithfulness, and recall against ground-truth QA pairs |
| Metadata-filtered retrieval | Active | Implementing has_code and depth filters in Pinecone query calls |
| Multi-turn conversation | Planned | Stateful chatbot with session-level context window management |
| LLM benchmarking | Planned | Comparing Llama-3.1-8b vs. larger models on faithfulness vs. latency; final model selection gates on RAGAS results |
| Dockerized deployment | Planned | Single container for API server; Pinecone remains external managed service |
| Usage analytics | Planned | Query logging and retrieval latency tracking per request |
| Code refactor | Planned | Separating pipeline stages into modular CLI commands; removing prototype scripts from production path |
| Layer | Technology |
|---|---|
| Crawling + Parsing | Python, Requests, BeautifulSoup4 |
| Chunking | Custom hierarchy-aware strategy |
| Embeddings | sentence-transformers/all-mpnet-base-v2 (HuggingFace) |
| Vector Database | Pinecone Serverless (AWS us-east-1, cosine) |
| LLM | Llama-3.1-8b-instant via Groq |
| Orchestration | LangChain (PromptTemplate, ChatModel, chain composition) |
| API | FastAPI + Uvicorn |
| Environment | python-dotenv |
MIT