English-language documentation for developers who want to understand, run, and extend the project.
This repository contains a complete Retrieval-Augmented Generation (RAG) stack that answers questions about Datapizza-AI by combining local FAQ markdown files and the official documentation indexed via MCP. The chatbot uses Google Gemini for embeddings and generation, Streamlit for the UI, and Qdrant as the vector database. When no relevant information is found, it falls back to the exact sentence “Non sono ancora state fatte domande a riguardo.”
- Streamlit front end with multilingual UI and debugging tools
- Semantic retrieval powered by Google embeddings (
gemini-embedding-001by default) - Query rewriting to improve recall on ambiguous prompts
- Qdrant vector store with dynamic dimension detection
- Google Gemini 2.5 Flash for response generation with conversation memory
- Dual interface: modern web experience and terminal chatbot
- Modular pipelines for ingestion, retrieval, and generator orchestration
- Optional integration with the official docs through the MCP server
Markdown FAQs → TextParser → NodeSplitter → ChunkEmbedder → Qdrant vector store
User query → ToolRewriter → Embedder → Qdrant retrieval → Prompt template → Gemini 2.5 Flash + Memory
The system is split into FAQ ingestion, official documentation ingestion via MCP, and a retrieval/generation stack that merges both sources when available.
graph TD
A1["Markdown FAQ<br/>(datapizza_faq.md, FAQ_Video.md, Scripts/*.md)"] --> P1[TextParser]
P1 --> S1["Node/Recursive Splitter"]
S1 --> E1["ChunkEmbedder<br/>Google Gemini"]
E1 --> Q1["Qdrant collection:<br/>datapizzai_faq"]
graph TD
B1["GitHub repo<br/>(datapizza-ai/docs)"] --> P2[TextParser]
P2 --> S2[RecursiveSplitter]
S2 --> E2["ChunkEmbedder<br/>OpenAI text-embedding-3-small"]
E2 --> Q2["Qdrant collection:<br/>datapizza_official_docs"]
graph TD
U["User question"] --> RW["ToolRewriter<br/>(Gemini)"]
RW --> GE["Google Embedder"]
GE --> VR1["Qdrant search<br/>datapizzai_faq"]
U -.->|MCP| EMB["OpenAI Embedding"]
EMB --> VR2["Qdrant search<br/>datapizza_official_docs"]
VR1 --> CTX["Context builder"]
VR2 --> CTX
CTX --> PR["Prompt Template"]
PR --> LLM["Gemini 2.5 Flash"]
LLM --> OUT["Final answer"]
MEM[(Memory)] <--> LLM
APP["Streamlit app.py"] --> OUT
TEST["test_mcp_retriever.py"] --> VR2
# Activate the virtual environment
source rag/bin/activate
# Configure .env with your Google and OpenAI keys
# Start Qdrant in another terminal
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
# Launch the Streamlit UI
./run_web.sh
# or
streamlit run app.pysource rag/bin/activate
# Set GOOGLE_API_KEY in .env
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
python chatbot_faq.py
# or
./run_chatbot.shsource rag/bin/activate
pip install -r requirements.txt
cp .env.example .env
# add your API keys
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant
python test_setup.py # validate environment
python ingest_faq.py # index FAQ files
./run_web.sh # Streamlit UI
# or
python chatbot_faq.py # terminal modeYou: What makes Datapizza-AI different from LangChain?
Bot: The main difference is the abstraction level of each module. LangChain
keeps you on rails, while Datapizza-AI exposes low-level knobs so you can
tune each step of your RAG workflow...
You: Does it support open-source models?
Bot: Yes. The framework works with Llama models as documented in the official
guides, where you can run a client or server locally...
You: What is photosynthesis?
Bot: Non sono ancora state fatte domande a riguardo.
datapizzaAI-RAG/
├── app.py # Streamlit front end
├── chatbot_faq.py # FAQ chatbot (Gemini + DagPipeline)
├── chatbot_enhanced.py # FAQ + MCP docs chatbot
├── ingest_faq.py # FAQ ingestion script
├── official_docs_retriever.py# MCP retriever helper
├── datapizza_faq.md # Main FAQ file
├── FAQ_Video.md # FAQs extracted from video tutorials
├── Scripts/*.md # English transcripts, auto-tagged
├── qdrant_config.py # Qdrant helpers and collection setup
├── run_web.sh / run_chatbot.sh
├── tests/*.py # Setup, chatbot, and MCP tests
├── README.md, START_HERE.md, USAGE_GUIDE.md, etc.
└── requirements.txt
- Datapizza-AI for pipelines, modules, and clients
- Streamlit for the user interface
- Google Gemini for embeddings and LLM responses
- Qdrant for similarity search
- Python 3.13 or newer
Builds an IngestionPipeline that reads markdown FAQ files, splits content into semantically meaningful chunks, generates embeddings with Google Gemini, automatically includes English scripts under Scripts/ with metadata (language="en", type="scripts"), and stores everything in the datapizzai_faq Qdrant collection. The script detects embedding dimensionality at runtime so the vector store is always created with the correct size.
Implements a DagPipeline chatbot with query rewriting, vector retrieval, Gemini generation, and conversation memory. If no relevant information is returned, the answer falls back to “Non sono ancora state fatte domande a riguardo.” The class exposes parameters for k, score_threshold, maximum chunk size, and debug mode.
Streamlit front end with multilingual support (Italian, English, German). It offers configuration toggles, statistics, debugging panels for retrieved chunks, and an optional hook for the official documentation if MCP indexing is available.
Extended chatbot that merges FAQ chunks and documentation chunks retrieved through the MCP server and the datapizza_official_docs collection. It manages language-specific fallbacks, asynchronous calls, and fine-grained debug traces.
You can tweak the chatbot behavior in chatbot_faq.py:
k = 10 # number of chunks retrieved
score_threshold = 0.5
max_char = 2000 # chunk size for NodeSplitterTo switch models, set environment variables (FAQ_EMBEDDING_MODEL, FAQ_EMBEDDING_DIM) or instantiate alternative clients such as OpenAIClient or AnthropicClient.
-
“GOOGLE_API_KEY not found”
Create.env, setGOOGLE_API_KEY, and reload the shell. -
“Connection refused” when calling Qdrant
Ensure the Docker container is running:docker ps | grep qdrant. -
“Collection not found”
Runpython ingest_faq.py. If you changed embedding dimensions, delete the existing collection first. -
Bot always responds with the fallback
Verify ingestion logs, lowerscore_threshold, and confirm embeddings were created with the same model used at inference time.
This project is meant as a working example. Feel free to add FAQs, experiment with new models, improve prompts, or extend the UI. Pull requests and issue reports are welcome.
Sample project demonstrating the Datapizza-AI framework. Refer to the repository license for details.
Built as an example integration for the Datapizza-AI framework.