Indic-ASR: Fine-Tuning Whisper Large-V3 for Hindi on A100

📌 Project Overview

This project focuses on specializing OpenAI's Whisper Large-V3 model for Hindi Automatic Speech Recognition (ASR) using Parameter-Efficient Fine-Tuning (PEFT).

Trained on an NVIDIA A100 GPU, the model was fine-tuned on the Google FLEURS dataset using QLoRA (Quantized Low-Rank Adaptation). The goal was to bridge the gap between general-purpose English-centric models and specialized Indic models.

The final model was rigorously benchmarked against:

OpenAI Baseline: The original Whisper Large V3.
Bhashini: The open-source champion (AI4Bharat).
Sarvam AI: The commercial state-of-the-art service.

🚀 Key Results & Analysis

Rank	Model	Type	WER (Word Error Rate)	Verdict
🥇 1	Sarvam AI	Commercial SOTA	12.16%	Gold Standard (Proprietary Data)
🥈 2	Bhashini (Vasista)	Open-Source SOTA	19.52%	The Benchmark to Beat
🥉 3	Ours (Fine-Tuned)	QLoRA Adapter	20.75%	Virtual Tie with SOTA
❌ 4	Base Whisper	Baseline	30.18%	Significant Failure Rate

📈 Training Metrics & Resource Usage

The following charts visualize the stability of the fine-tuning process on the A100 GPU.

Training Loss Consistent descent: 0.9 → 0.1	Learning Schedule Optimized decay for stability
GPU Utilization High efficiency (~95% load)	VRAM Usage Memory capacity management

🧐 Result Justification

31% Improvement: We successfully reduced the error rate from 30.18% (Base) to 20.75% (Ours). This proves that the fine-tuning pipeline effectively adapted the model to Hindi acoustics without "catastrophic forgetting."
Parity with Bhashini: Achieving a WER of 20.75% places this model within 1.2% of AI4Bharat's Bhashini. This is a significant engineering achievement, proving that efficient fine-tuning on a single GPU can rival models trained by large research labs.
The Data Gap: The remaining gap to Sarvam (12.16%) is attributed to data scale. Sarvam utilizes 10,000+ hours of proprietary audio, whereas this project utilized the open-source FLEURS dataset (~12 hours).

🛠️ Installation & Environment

1. Prerequisites

Hardware: NVIDIA GPU (A100 Recommended for efficient training; T4/A10G compatible with adjustments).
OS: Linux (tested on Ubuntu/CentOS) / Windows (WSL2).
Python: 3.10+.

2. Setup Script

# Clone the repository
git clone https://github.com/yourusername/indic-asr.git
cd indic-asr/ASR

# Install Dependencies
# (See requirements.txt for full list including pyannote.audio, peft, bitsandbytes)
pip install -r requirements.txt

Tip

A100 Users: Ensure you have flash-attn or sdpa enabled in PyTorch for maximum speed. pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3. Configuration (`src/config.py`)

Warning

Security Notice: Do not commit your API keys to GitHub! Use environment variables or a local .env file.

Ensure src/config.py has valid keys for benchmarking:

HF_TOKEN = os.getenv("HF_TOKEN")             # Required for Model Download & Pyannote
SARVAM_API_KEY = os.getenv("SARVAM_API_KEY") # Required for Benchmarking Comparison
WANDB_KEY = os.getenv("WANDB_KEY")           # Optional for Experiment Tracking

🔄 Execution Workflow

Step 1: Data Preparation

We use the Google FLEURS (Hindi) dataset. This script downloads the streaming dataset, resamples all audio to 16kHz, filters out invalid files, and prepares it for the trainer.

python src/data_prep.py

Output: data_processed/ directory containing ~2,100 curated training samples.

Step 2: Fine-Tuning (The "Heavy" Run)

We use a QLoRA configuration optimized for the A100 to maximize model capacity while minimizing memory usage.

Rank (r): 64 (High capacity)
Alpha: 128
Precision: 4-bit NF4 (Frozen Base) + Float16 (Trainable Adapter)
Epochs: 5
Results: Loss dropped from 0.91 to 0.10.

python src/train.py

Output: models/final_adapter/ containing the trained LoRA weights.

Step 3: Benchmarking

We perform a 4-Way Benchmark on 100 unseen samples from the test set. This script handles the complexity of:

Running the local LoRA adapter.
Running the Base model.
Manually fixing Bhashini's legacy config (Float16 casting).
Calling the Sarvam AI API for comparison.

python src/benchmark.py

Output: results/benchmark_ultimate.csv and final WER scores printed to console.

Optional: Single File Inference

To transcribe a specific file with Speaker Diarization:

python src/inference.py

(Modify the script to point to your audio file).

📂 Repository Structure

ASR/
├── src/
│   ├── config.py              # Central configuration & keys
│   ├── data_prep.py           # Dataset downloader & processor
│   ├── train.py               # Main QLoRA training loop
│   ├── benchmark.py           # 4-way comparison & evaluation script
│   └── inference.py           # Inference pipeline with Diarization
├── requirements.txt           # Python dependencies
├── setup.sh                   # Environment setup helper
├── run_all.sh                 # Master script to run full pipeline
└── README.md                  # Project documentation

📜 Credits & Acknowledgments

Base Model: OpenAI Whisper Large-V3
Dataset: Google FLEURS (Hugging Face)
Benchmark Comparison: AI4Bharat (Bhashini), Sarvam AI
Infrastructure: IIT Kharagpur A100 Cluster

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_all.sh		run_all.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indic-ASR: Fine-Tuning Whisper Large-V3 for Hindi on A100

📌 Project Overview

🚀 Key Results & Analysis

📈 Training Metrics & Resource Usage

🧐 Result Justification

🛠️ Installation & Environment

1. Prerequisites

2. Setup Script

3. Configuration (`src/config.py`)

🔄 Execution Workflow

Step 1: Data Preparation

Step 2: Fine-Tuning (The "Heavy" Run)

Step 3: Benchmarking

Optional: Single File Inference

📂 Repository Structure

📜 Credits & Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Indic-ASR: Fine-Tuning Whisper Large-V3 for Hindi on A100

📌 Project Overview

🚀 Key Results & Analysis

📈 Training Metrics & Resource Usage

🧐 Result Justification

🛠️ Installation & Environment

1. Prerequisites

2. Setup Script

3. Configuration (src/config.py)

🔄 Execution Workflow

Step 1: Data Preparation

Step 2: Fine-Tuning (The "Heavy" Run)

Step 3: Benchmarking

Optional: Single File Inference

📂 Repository Structure

📜 Credits & Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3. Configuration (`src/config.py`)

Packages