Gemini Audio Transcription & Timestamped RAG Pipeline

Turn raw audio into searchable, explainable knowledge — with exact timestamps.

This project is an end‑to‑end audio intelligence pipeline that batch‑transcribes MP3 (or video audio) files, translates them into clean English, splits them into precise timestamped semantic chunks, and enables grounded question answering using a modern RAG (Retrieval‑Augmented Generation) architecture.

Designed for courses, podcasts, lectures, meetings, and interviews, this system lets you ask natural‑language questions and get accurate answers with exact video/audio time references — no hallucinations, no guesswork.

✨ What Makes This Special

🎙️ High‑quality Gemini transcription (multilingual → clean English)
⏱️ Timestamped semantic chunking for traceable answers
🧠 Local embeddings (bge‑m3 via Ollama) — fast, private, cost‑efficient
📐 Cosine similarity retrieval using scikit‑learn
📄 Strictly grounded AI responses (answers only from audio content)
🧩 Structured JSON pipelines ready for RAG, search, and notes
⚡ Batch‑friendly & modular design

🏗️ High‑Level Architecture (Audio RAG)

This system follows a production‑grade Audio‑RAG workflow:

Audio ingestion & transcription
Timestamped JSON segmentation
Local embedding generation
Vector storage
Semantic retrieval
Gemini‑powered grounded answering

🔄 Pipeline Overview

Audio / Video Files
        ↓
[ Gemini Transcription ]
        ↓
Structured JSON Transcripts
(title + timestamped segments)
        ↓
[ Local Embedding Generator ]
(bge-m3 via Ollama)
        ↓
Embedded Chunks (JSON)
        ↓
Vector Store (joblib)
        ↓
User Query
        ↓
Semantic Retrieval (Cosine Similarity)
        ↓
Relevant Context Chunks
        ↓
[ Gemini RAG Answer Generator ]
        ↓
Grounded Answer with Timestamps

🧩 Detailed Stage‑by‑Stage Explanation

1️⃣ Audio / Video Input

Accepts MP3 or extracted audio from videos
Designed for batch processing

2️⃣ Gemini Transcription Layer

Uses Google Gemini for:
- High‑accuracy speech‑to‑text
- Automatic translation into clean English
Handles noisy or conversational audio gracefully

3️⃣ Structured JSON Transcript Generation

Each file is converted into a fully structured JSON document:

{
  "title": "Introduction to Web Development",
  "segments": [
    {
      "start": 0.0,
      "end": 6.5,
      "text": "Guys, in today's video I will give you an exercise..."
    }
  ]
}

This makes the pipeline LLM‑friendly, debuggable, and future‑proof.

4️⃣ Timestamped Chunking Engine

Splits transcripts into semantic chunks
Preserves:
- Start time
- End time
- Spoken meaning

These timestamps become your explainability backbone.

5️⃣ Local Embedding Generator

Uses bge‑m3 embeddings via Ollama (local server)
Benefits:
- 🔒 Fully local & private
- ⚡ Fast inference
- 💰 Zero per‑query cost

Each chunk → high‑dimensional vector.

6️⃣ Vector Store

Implemented using:
- NumPy arrays
- Joblib serialization
Stores:
- Embeddings
- Chunk text
- Start / end timestamps
- Source title

Easy to swap with FAISS, Milvus, or Pinecone later.

7️⃣ User Query

Users can ask questions like:

“Where does the course transition from HTML to CSS?”

8️⃣ Semantic Retrieval

Query is embedded using the same bge‑m3 model
Cosine similarity (sklearn) ranks all chunks
Top‑k most relevant segments are retrieved

9️⃣ Context Builder

Relevant chunks are merged
Gemini prompt is constructed with strict rules:
- Use only retrieved chunks
- Provide timestamp references

🔟 Gemini RAG Answer Generator

Uses Google Gemini 2.5 Flash
Generates:
- Concise explanation
- Multiple corroborating references
- Exact time ranges

🧪 Example Output

📌 Explanation

The course has concluded HTML, having covered almost everything. In the concluding video, the instructor discussed some miscellaneous topics to help with web development and ensure a complete understanding of HTML. After this, the course transitions into CSS.

⏱️ Where it is taught

Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 8.32 → 8.41 Confirms HTML is completed and CSS is next.
Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 0.0 → 0.12 Explains that HTML is almost fully covered and will conclude here.
Video: Introduction to CSS: Styling Web Pages and Beyond Time: 240.166 → 244.206 Explicit transition from HTML to CSS.

⚙️ Setup & Usage

Requirements

Python 3.9+
Ollama (running locally)
Google Gemini API key

Environment Variables

Create a .env file:

GOOGLE_API_KEY=your_api_key_here

Run

python rag_query.py

🧰 Tech Stack

Component	Technology
LLM	Google Gemini 2.5 Flash
Embeddings	bge-m3 (Ollama – Local)
Language	Python
Vector Store	NumPy + Joblib
Similarity	Cosine Similarity (scikit-learn)
API Security	python-dotenv
Data Format	JSON

🎯 Ideal Use Cases

Online course Q&A systems
Podcast and interview analysis
Meeting intelligence tools
Lecture‑based RAG assistants
Video summarization with citations

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
create_chunk.py		create_chunk.py
query_output.txt		query_output.txt
quiry.py		quiry.py
read_chunks_json.py		read_chunks_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemini Audio Transcription & Timestamped RAG Pipeline

✨ What Makes This Special

🏗️ High‑Level Architecture (Audio RAG)

🔄 Pipeline Overview

🧩 Detailed Stage‑by‑Stage Explanation

1️⃣ Audio / Video Input

2️⃣ Gemini Transcription Layer

3️⃣ Structured JSON Transcript Generation

4️⃣ Timestamped Chunking Engine

5️⃣ Local Embedding Generator

6️⃣ Vector Store

7️⃣ User Query

8️⃣ Semantic Retrieval

9️⃣ Context Builder

🔟 Gemini RAG Answer Generator

🧪 Example Output

📌 Explanation

⏱️ Where it is taught

⚙️ Setup & Usage

Requirements

Environment Variables

Run

🧰 Tech Stack

🎯 Ideal Use Cases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemini Audio Transcription & Timestamped RAG Pipeline

✨ What Makes This Special

🏗️ High‑Level Architecture (Audio RAG)

🔄 Pipeline Overview

🧩 Detailed Stage‑by‑Stage Explanation

1️⃣ Audio / Video Input

2️⃣ Gemini Transcription Layer

3️⃣ Structured JSON Transcript Generation

4️⃣ Timestamped Chunking Engine

5️⃣ Local Embedding Generator

6️⃣ Vector Store

7️⃣ User Query

8️⃣ Semantic Retrieval

9️⃣ Context Builder

🔟 Gemini RAG Answer Generator

🧪 Example Output

📌 Explanation

⏱️ Where it is taught

⚙️ Setup & Usage

Requirements

Environment Variables

Run

🧰 Tech Stack

🎯 Ideal Use Cases

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages