Skip to content

Sab-429/RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemini Audio Transcription & Timestamped RAG Pipeline

Turn raw audio into searchable, explainable knowledge — with exact timestamps.

This project is an end‑to‑end audio intelligence pipeline that batch‑transcribes MP3 (or video audio) files, translates them into clean English, splits them into precise timestamped semantic chunks, and enables grounded question answering using a modern RAG (Retrieval‑Augmented Generation) architecture.

Designed for courses, podcasts, lectures, meetings, and interviews, this system lets you ask natural‑language questions and get accurate answers with exact video/audio time references — no hallucinations, no guesswork.


✨ What Makes This Special

  • 🎙️ High‑quality Gemini transcription (multilingual → clean English)
  • ⏱️ Timestamped semantic chunking for traceable answers
  • 🧠 Local embeddings (bge‑m3 via Ollama) — fast, private, cost‑efficient
  • 📐 Cosine similarity retrieval using scikit‑learn
  • 📄 Strictly grounded AI responses (answers only from audio content)
  • 🧩 Structured JSON pipelines ready for RAG, search, and notes
  • Batch‑friendly & modular design

🏗️ High‑Level Architecture (Audio RAG)

This system follows a production‑grade Audio‑RAG workflow:

  1. Audio ingestion & transcription
  2. Timestamped JSON segmentation
  3. Local embedding generation
  4. Vector storage
  5. Semantic retrieval
  6. Gemini‑powered grounded answering

🔄 Pipeline Overview

Audio / Video Files
        ↓
[ Gemini Transcription ]
        ↓
Structured JSON Transcripts
(title + timestamped segments)
        ↓
[ Local Embedding Generator ]
(bge-m3 via Ollama)
        ↓
Embedded Chunks (JSON)
        ↓
Vector Store (joblib)
        ↓
User Query
        ↓
Semantic Retrieval (Cosine Similarity)
        ↓
Relevant Context Chunks
        ↓
[ Gemini RAG Answer Generator ]
        ↓
Grounded Answer with Timestamps

🧩 Detailed Stage‑by‑Stage Explanation

1️⃣ Audio / Video Input

  • Accepts MP3 or extracted audio from videos
  • Designed for batch processing

2️⃣ Gemini Transcription Layer

  • Uses Google Gemini for:

    • High‑accuracy speech‑to‑text
    • Automatic translation into clean English
  • Handles noisy or conversational audio gracefully


3️⃣ Structured JSON Transcript Generation

Each file is converted into a fully structured JSON document:

{
  "title": "Introduction to Web Development",
  "segments": [
    {
      "start": 0.0,
      "end": 6.5,
      "text": "Guys, in today's video I will give you an exercise..."
    }
  ]
}

This makes the pipeline LLM‑friendly, debuggable, and future‑proof.


4️⃣ Timestamped Chunking Engine

  • Splits transcripts into semantic chunks

  • Preserves:

    • Start time
    • End time
    • Spoken meaning

These timestamps become your explainability backbone.


5️⃣ Local Embedding Generator

  • Uses bge‑m3 embeddings via Ollama (local server)

  • Benefits:

    • 🔒 Fully local & private
    • ⚡ Fast inference
    • 💰 Zero per‑query cost

Each chunk → high‑dimensional vector.


6️⃣ Vector Store

  • Implemented using:

    • NumPy arrays
    • Joblib serialization
  • Stores:

    • Embeddings
    • Chunk text
    • Start / end timestamps
    • Source title

Easy to swap with FAISS, Milvus, or Pinecone later.


7️⃣ User Query

Users can ask questions like:

“Where does the course transition from HTML to CSS?”


8️⃣ Semantic Retrieval

  • Query is embedded using the same bge‑m3 model
  • Cosine similarity (sklearn) ranks all chunks
  • Top‑k most relevant segments are retrieved

9️⃣ Context Builder

  • Relevant chunks are merged

  • Gemini prompt is constructed with strict rules:

    • Use only retrieved chunks
    • Provide timestamp references

🔟 Gemini RAG Answer Generator

  • Uses Google Gemini 2.5 Flash

  • Generates:

    • Concise explanation
    • Multiple corroborating references
    • Exact time ranges

🧪 Example Output

📌 Explanation

The course has concluded HTML, having covered almost everything. In the concluding video, the instructor discussed some miscellaneous topics to help with web development and ensure a complete understanding of HTML. After this, the course transitions into CSS.

⏱️ Where it is taught

  • Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 8.32 → 8.41 Confirms HTML is completed and CSS is next.

  • Video: HTML Conclusion: Entities, Tags, and Best Practices Time: 0.0 → 0.12 Explains that HTML is almost fully covered and will conclude here.

  • Video: Introduction to CSS: Styling Web Pages and Beyond Time: 240.166 → 244.206 Explicit transition from HTML to CSS.


⚙️ Setup & Usage

Requirements

  • Python 3.9+
  • Ollama (running locally)
  • Google Gemini API key

Environment Variables

Create a .env file:

GOOGLE_API_KEY=your_api_key_here

Run

python rag_query.py

🧰 Tech Stack

Component Technology
LLM Google Gemini 2.5 Flash
Embeddings bge-m3 (Ollama – Local)
Language Python
Vector Store NumPy + Joblib
Similarity Cosine Similarity (scikit-learn)
API Security python-dotenv
Data Format JSON

🎯 Ideal Use Cases

  • Online course Q&A systems
  • Podcast and interview analysis
  • Meeting intelligence tools
  • Lecture‑based RAG assistants
  • Video summarization with citations

About

RAG-based video/audio learning assistant using Gemini LLM and local embeddings with timestamped, source-grounded answers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages