Turn raw audio into searchable, explainable knowledge — with exact timestamps.
This project is an end‑to‑end audio intelligence pipeline that batch‑transcribes MP3 (or video audio) files, translates them into clean English, splits them into precise timestamped semantic chunks, and enables grounded question answering using a modern RAG (Retrieval‑Augmented Generation) architecture.
Designed for courses, podcasts, lectures, meetings, and interviews, this system lets you ask natural‑language questions and get accurate answers with exact video/audio time references — no hallucinations, no guesswork.
- 🎙️ High‑quality Gemini transcription (multilingual → clean English)
- ⏱️ Timestamped semantic chunking for traceable answers
- 🧠 Local embeddings (bge‑m3 via Ollama) — fast, private, cost‑efficient
- 📐 Cosine similarity retrieval using scikit‑learn
- 📄 Strictly grounded AI responses (answers only from audio content)
- 🧩 Structured JSON pipelines ready for RAG, search, and notes
- ⚡ Batch‑friendly & modular design
This system follows a production‑grade Audio‑RAG workflow:
- Audio ingestion & transcription
- Timestamped JSON segmentation
- Local embedding generation
- Vector storage
- Semantic retrieval
- Gemini‑powered grounded answering
Audio / Video Files
↓
[ Gemini Transcription ]
↓
Structured JSON Transcripts
(title + timestamped segments)
↓
[ Local Embedding Generator ]
(bge-m3 via Ollama)
↓
Embedded Chunks (JSON)
↓
Vector Store (joblib)
↓
User Query
↓
Semantic Retrieval (Cosine Similarity)
↓
Relevant Context Chunks
↓
[ Gemini RAG Answer Generator ]
↓
Grounded Answer with Timestamps
- Accepts MP3 or extracted audio from videos
- Designed for batch processing
-
Uses Google Gemini for:
- High‑accuracy speech‑to‑text
- Automatic translation into clean English
-
Handles noisy or conversational audio gracefully
Each file is converted into a fully structured JSON document:
{
"title": "Introduction to Web Development",
"segments": [
{
"start": 0.0,
"end": 6.5,
"text": "Guys, in today's video I will give you an exercise..."
}
]
}This makes the pipeline LLM‑friendly, debuggable, and future‑proof.
-
Splits transcripts into semantic chunks
-
Preserves:
- Start time
- End time
- Spoken meaning
These timestamps become your explainability backbone.
-
Uses bge‑m3 embeddings via Ollama (local server)
-
Benefits:
- 🔒 Fully local & private
- ⚡ Fast inference
- 💰 Zero per‑query cost
Each chunk → high‑dimensional vector.
-
Implemented using:
- NumPy arrays
- Joblib serialization
-
Stores:
- Embeddings
- Chunk text
- Start / end timestamps
- Source title
Easy to swap with FAISS, Milvus, or Pinecone later.
Users can ask questions like:
“Where does the course transition from HTML to CSS?”
- Query is embedded using the same bge‑m3 model
- Cosine similarity (sklearn) ranks all chunks
- Top‑k most relevant segments are retrieved
-
Relevant chunks are merged
-
Gemini prompt is constructed with strict rules:
- Use only retrieved chunks
- Provide timestamp references
-
Uses Google Gemini 2.5 Flash
-
Generates:
- Concise explanation
- Multiple corroborating references
- Exact time ranges
The course has concluded HTML, having covered almost everything. In the concluding video, the instructor discussed some miscellaneous topics to help with web development and ensure a complete understanding of HTML. After this, the course transitions into CSS.
-
Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 8.32 → 8.41 Confirms HTML is completed and CSS is next.
-
Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 0.0 → 0.12 Explains that HTML is almost fully covered and will conclude here.
-
Video: Introduction to CSS: Styling Web Pages and Beyond Time: 240.166 → 244.206 Explicit transition from HTML to CSS.
- Python 3.9+
- Ollama (running locally)
- Google Gemini API key
Create a .env file:
GOOGLE_API_KEY=your_api_key_here
python rag_query.py
| Component | Technology |
|---|---|
| LLM | Google Gemini 2.5 Flash |
| Embeddings | bge-m3 (Ollama – Local) |
| Language | Python |
| Vector Store | NumPy + Joblib |
| Similarity | Cosine Similarity (scikit-learn) |
| API Security | python-dotenv |
| Data Format | JSON |
- Online course Q&A systems
- Podcast and interview analysis
- Meeting intelligence tools
- Lecture‑based RAG assistants
- Video summarization with citations