A comprehensive benchmarking platform for automatic speech recognition (ASR) systems. This tool provides controlled audio degradation, audio enhancement, loudness normalization, multiple evaluation metrics, and comparative analysis across languages and model architectures.
Try the interactive visualization dashboard online:
👉 ASR.lab Demo on Hugging Face Spaces
ASR.lab enables systematic evaluation of speech recognition engines under various acoustic conditions. It supports multiple ASR frameworks, applies configurable audio degradations, tests audio enhancement algorithms, and generates detailed performance reports with interactive visualizations.
- Multi-Engine Support: Compare performance across different ASR frameworks (Whisper, Wav2Vec2, NeMo, Vosk, SeamlessM4T, etc.)
- Audio Degradation: Apply controlled acoustic degradations (reverb, noise, compression) via VST3 plugins
- Audio Enhancement: Test denoising/enhancement algorithms (Demucs; DeepFilterNet available via optional extras:
pip install .[deepfilter]) on degraded audio - Loudness Normalization: Grid search across different LUFS normalization levels (EBU R128 compliant)
- Evaluation Metrics: WER, CER, MER, WIL, WIP for comprehensive transcription analysis
- Interactive Reports: HTML reports with sortable tables, Plotly visualizations, and multi-filter dropdowns
- Multilingual: Support for multiple languages and language-specific models
- Extensible: Plugin architecture for adding new engines and metrics
- Grid Search: Automatic Cartesian product of all test parameters (degradation × enhancement × normalization)
The benchmark pipeline processes audio in the following order:
Original Audio → Degradation (VST3) → Enhancement (Demucs) → Normalization (LUFS) → ASR Engine → Metrics
Each stage is optional and configurable. Multiple options at each stage create a grid search.
| Engine | Status | Notes |
|---|---|---|
| Whisper (OpenAI) | ✅ Tested | Multilingual, long-form transcription support |
| Wav2Vec 2.0 (Meta) | ✅ Tested | Language-specific fine-tuning, outputs normalized to lowercase |
| SeamlessM4T (Meta) | ✅ Tested | Detects v2 models, selects appropriate model class, and caps generation tokens (max_new_tokens=256) |
| NeMo (NVIDIA) | ✅ Tested | Windows support via runtime SIGKILL compatibility patch; runtime setup handled automatically |
| Vosk | ✅ Tested | Offline recognition; models auto-downloaded and extracted when needed |
| HuBERT (Meta) | Uses Wav2Vec2 tokenizer fallback |
| Metric | Name | Description |
|---|---|---|
| WER | Word Error Rate | Standard ASR metric: (S+D+I)/N |
| CER | Character Error Rate | Better for CJK languages |
| MER | Match Error Rate | Bounded version of WER (0-1) |
| WIL | Word Information Lost | Proportion of information lost |
| WIP | Word Information Preserved | Complement of WIL |
Text normalization is applied systematically as a grid search dimension. Each transcription generates two results:
| Preset | Description |
|---|---|
| raw | Texte brut, aucune transformation |
| normalized | Minuscules + sans ponctuation + espaces normalisés |
Normalized preset applies:
- Lowercase conversion: Case-insensitive comparison
- Punctuation removal: Ignores punctuation differences
- Whitespace normalization: Collapses multiple spaces, trims
In the interactive report, use the "Texte" dropdown to:
- View both raw and normalized results side-by-side
- Filter to raw only to see exact ASR output
- Filter to normalized only for standard ASR evaluation
Visual Encoding in Interactive Reports:
- Symbol = Degradation type (circle = original, diamond = reverb, etc.)
- Color = Engine (whisper, nemo, etc.)
- Size = Text normalization (Large = Normalized, Small = Raw)
- Python 3.12 or higher
- Optional CUDA-capable GPU
git clone https://github.com/berangerthomas/ASR.lab.git
cd ASR.lab
uv syncBenchmarks are defined in YAML configuration files located in configs/.
See configs/example.yaml for a complete configuration reference.
benchmark:
name: "my_benchmark"
description: "Description of the benchmark"
data:
audio_source_dir: "data/audio" # Directory, file, or glob pattern
processed_dir: "data/processed"
audio_processing:
sample_rate: 16000
channels: 1
# Loudness normalization (grid search)
normalizations:
- name: "broadcast"
enabled: true
method: "lufs"
target_loudness: -23.0
- name: "no_norm"
enabled: true
method: "none"
# Audio degradation via VST3 plugins
degradations:
vst_plugin_path: "path/to/reverb.vst3"
presets:
- name: "cathedral"
preset_name: "Cathedral"
# Audio enhancement/denoising
enhancements:
- name: "demucs"
enabled: true
method: "demucs"
model_name: "htdemucs"
engines:
whisper:
- id: "whisper-tiny"
model_id: "openai/whisper-tiny"
enabled: true
chunk_length_s: 30
wav2vec2:
- id: "wav2vec2-fr"
model_id: "facebook/wav2vec2-large-xlsr-53-french"
enabled: true
metrics:
- name: "wer"
enabled: true
- name: "cer"
enabled: truepython main.py run -c configs/default.yamlThis will:
- Run transcriptions with all configured engines
- Compute metrics for both raw and normalized text (grid search dimension)
- Generate an interactive HTML report with dropdown filters including text normalization
The generated report_interactive.html includes:
- Interactive Metric Normalization: Toggle lowercase, punctuation removal, and contraction expansion - metrics update instantly without regenerating the report
- Multi-filter dropdowns: Filter by engine, degradation, enhancement, and normalization
- Sortable tables: Click column headers to sort
- Side-by-side diff: Compare reference and hypothesis with error highlighting
- Multiple visualizations: Scatter plots, heatmaps, box plots
Results are saved to results/reports/<config>/:
report_interactive.html: Interactive HTML report with embedded metrics variantsresults.csv: Results in CSV formatraw_results.json: JSON backup with reference and hypothesis text
Open results/reports/<config>/report_interactive.html in a web browser. The report includes:
- Interactive Plotly charts with multi-filter dropdowns (engine, degradation, enhancement, normalization)
- Sortable summary table (click column headers)
- Side-by-side transcription comparison with diff highlighting
- Performance metrics (WER, CER, processing time)
Manifest files are JSON files listing audio samples and their reference transcriptions.
The audio_source_dir config value supports three modes:
| Mode | Example | Behavior |
|---|---|---|
| Directory | audio_source_dir: "data/audio" |
Loads all *.json files in the directory |
| Single file | audio_source_dir: "data/audio/manifest_en.json" |
Loads that file only |
| Glob pattern | audio_source_dir: "data/audio/manifest_fr*.json" |
Loads all matching .json files |
Each manifest is a JSON array:
[
{
"audio_filepath": "sample1.wav",
"text": "This is a sample transcription.",
"lang": "en"
},
{
"audio_filepath": "sample2.wav",
"text": "Ceci est une transcription.",
"lang": "fr"
}
]Required fields:
audio_filepath: Path to audio file (relative to manifest location or absolute)text: Reference transcription (ground truth)lang: Language code (ISO 639-1: en, fr, de, es, etc.)
Audio format requirements:
- WAV format recommended
- 16kHz sample rate (or configure in
audio_processing.sample_rate) - Mono channel recommended
Most models automatically use CUDA if available. Check GPU usage:
import torch
print(torch.cuda.is_available())Reduce batch size or use smaller models. For Whisper, use tiny or base variants.
Ensure audio files are:
- WAV format
- 16kHz sample rate (or configure accordingly)
- Mono channel
Engine-specific prerequisites are handled automatically when running a benchmark:
- Vosk: Models listed in the configuration (
model_path) are downloaded and extracted on the fly if not already present. - NeMo on Windows: The
signal.SIGKILLcompatibility patch is applied at runtime (exp_manager.py).
Both mechanisms are implemented in src/asr_lab/setup (utilities: engine_setup, nemo_patch, vosk_setup). The benchmark runner calls ensure_engines_ready(...) before engine initialization so most manual setup steps are no longer necessary.
See LICENSE file for details.
Contributions are welcome. Please submit pull requests or open issues for bugs and feature requests.