A simple ASR pipeline that uses Shallow Fusion to fix domain-specific words (hotwords) in lecture audio. It catches things Whisper usually misses (like "eigenvalue" vs "icon value") by combining phonetic matching with a local language model.
The code listens to audio in chunks and transcribes them via Whisper. If Whisper is unsure about a word, the system:
- Scans
hotwords.txtfor similar-sounding terms. - Uses GPT-2 to check if a candidate hotword actually makes sense in the current sentence.
- Swaps the word if the combined confidence (ASR + LM) is higher.
- Install dependencies:
pip install sounddevice openai-whisper transformers jellyfish Metaphone numpy torch
- Put whatever jargon you need in
hotwords.txt. - Run the live listener:
python3 main.py
main.py: Entry point for the live audio stream.fusion_processor.py: The actual rescoring logic.phonetic_matcher.py: Metaphone + Levenshtein fuzzy matching.asr_engine.py/lm_rescorer.py: Model wrappers for Whisper and GPT-2.test_fusion.py: Quick script to verify rescoring logic without needing a mic.