Skip to content

berangerthomas/ASR.lab

Repository files navigation

ASR.lab

Python License Hugging Face Spaces

A comprehensive benchmarking platform for automatic speech recognition (ASR) systems. This tool provides controlled audio degradation, audio enhancement, loudness normalization, multiple evaluation metrics, and comparative analysis across languages and model architectures.

Live Demo

Try the interactive visualization dashboard online:
👉 ASR.lab Demo on Hugging Face Spaces

Overview

ASR.lab enables systematic evaluation of speech recognition engines under various acoustic conditions. It supports multiple ASR frameworks, applies configurable audio degradations, tests audio enhancement algorithms, and generates detailed performance reports with interactive visualizations.

Features

  • Multi-Engine Support: Compare performance across different ASR frameworks (Whisper, Wav2Vec2, NeMo, Vosk, SeamlessM4T, etc.)
  • Audio Degradation: Apply controlled acoustic degradations (reverb, noise, compression) via VST3 plugins
  • Audio Enhancement: Test denoising/enhancement algorithms (Demucs; DeepFilterNet available via optional extras: pip install .[deepfilter]) on degraded audio
  • Loudness Normalization: Grid search across different LUFS normalization levels (EBU R128 compliant)
  • Evaluation Metrics: WER, CER, MER, WIL, WIP for comprehensive transcription analysis
  • Interactive Reports: HTML reports with sortable tables, Plotly visualizations, and multi-filter dropdowns
  • Multilingual: Support for multiple languages and language-specific models
  • Extensible: Plugin architecture for adding new engines and metrics
  • Grid Search: Automatic Cartesian product of all test parameters (degradation × enhancement × normalization)

Processing Pipeline

The benchmark pipeline processes audio in the following order:

Original Audio → Degradation (VST3) → Enhancement (Demucs) → Normalization (LUFS) → ASR Engine → Metrics

Each stage is optional and configurable. Multiple options at each stage create a grid search.

Supported ASR Engines

Engine Status Notes
Whisper (OpenAI) ✅ Tested Multilingual, long-form transcription support
Wav2Vec 2.0 (Meta) ✅ Tested Language-specific fine-tuning, outputs normalized to lowercase
SeamlessM4T (Meta) ✅ Tested Detects v2 models, selects appropriate model class, and caps generation tokens (max_new_tokens=256)
NeMo (NVIDIA) ✅ Tested Windows support via runtime SIGKILL compatibility patch; runtime setup handled automatically
Vosk ✅ Tested Offline recognition; models auto-downloaded and extracted when needed
HuBERT (Meta) ⚠️ Experimental Uses Wav2Vec2 tokenizer fallback

Evaluation Metrics

Metric Name Description
WER Word Error Rate Standard ASR metric: (S+D+I)/N
CER Character Error Rate Better for CJK languages
MER Match Error Rate Bounded version of WER (0-1)
WIL Word Information Lost Proportion of information lost
WIP Word Information Preserved Complement of WIL

Text Normalization (Grid Search Dimension)

Text normalization is applied systematically as a grid search dimension. Each transcription generates two results:

Preset Description
raw Texte brut, aucune transformation
normalized Minuscules + sans ponctuation + espaces normalisés

Normalized preset applies:

  • Lowercase conversion: Case-insensitive comparison
  • Punctuation removal: Ignores punctuation differences
  • Whitespace normalization: Collapses multiple spaces, trims

In the interactive report, use the "Texte" dropdown to:

  • View both raw and normalized results side-by-side
  • Filter to raw only to see exact ASR output
  • Filter to normalized only for standard ASR evaluation

Visual Encoding in Interactive Reports:

  • Symbol = Degradation type (circle = original, diamond = reverb, etc.)
  • Color = Engine (whisper, nemo, etc.)
  • Size = Text normalization (Large = Normalized, Small = Raw)

Installation

Requirements

  • Python 3.12 or higher
  • Optional CUDA-capable GPU

Basic Installation

git clone https://github.com/berangerthomas/ASR.lab.git
cd ASR.lab
uv sync

Configuration

Benchmarks are defined in YAML configuration files located in configs/.

See configs/example.yaml for a complete configuration reference.

Basic Configuration Structure

benchmark:
  name: "my_benchmark"
  description: "Description of the benchmark"

data:
  audio_source_dir: "data/audio"      # Directory, file, or glob pattern
  processed_dir: "data/processed"

audio_processing:
  sample_rate: 16000
  channels: 1

# Loudness normalization (grid search)
normalizations:
  - name: "broadcast"
    enabled: true
    method: "lufs"
    target_loudness: -23.0
  - name: "no_norm"
    enabled: true
    method: "none"

# Audio degradation via VST3 plugins
degradations:
  vst_plugin_path: "path/to/reverb.vst3"
  presets:
    - name: "cathedral"
      preset_name: "Cathedral"

# Audio enhancement/denoising
enhancements:
  - name: "demucs"
    enabled: true
    method: "demucs"
    model_name: "htdemucs"

engines:
  whisper:
    - id: "whisper-tiny"
      model_id: "openai/whisper-tiny"
      enabled: true
      chunk_length_s: 30
  
  wav2vec2:
    - id: "wav2vec2-fr"
      model_id: "facebook/wav2vec2-large-xlsr-53-french"
      enabled: true

metrics:
  - name: "wer"
    enabled: true
  - name: "cer"
    enabled: true

Usage

Running a Benchmark

python main.py run -c configs/default.yaml

This will:

  1. Run transcriptions with all configured engines
  2. Compute metrics for both raw and normalized text (grid search dimension)
  3. Generate an interactive HTML report with dropdown filters including text normalization

Interactive Report Features

The generated report_interactive.html includes:

  • Interactive Metric Normalization: Toggle lowercase, punctuation removal, and contraction expansion - metrics update instantly without regenerating the report
  • Multi-filter dropdowns: Filter by engine, degradation, enhancement, and normalization
  • Sortable tables: Click column headers to sort
  • Side-by-side diff: Compare reference and hypothesis with error highlighting
  • Multiple visualizations: Scatter plots, heatmaps, box plots

Output

Results are saved to results/reports/<config>/:

  • report_interactive.html: Interactive HTML report with embedded metrics variants
  • results.csv: Results in CSV format
  • raw_results.json: JSON backup with reference and hypothesis text

Viewing Results

Open results/reports/<config>/report_interactive.html in a web browser. The report includes:

  • Interactive Plotly charts with multi-filter dropdowns (engine, degradation, enhancement, normalization)
  • Sortable summary table (click column headers)
  • Side-by-side transcription comparison with diff highlighting
  • Performance metrics (WER, CER, processing time)

Data Organization

Manifest Files

Manifest files are JSON files listing audio samples and their reference transcriptions.

The audio_source_dir config value supports three modes:

Mode Example Behavior
Directory audio_source_dir: "data/audio" Loads all *.json files in the directory
Single file audio_source_dir: "data/audio/manifest_en.json" Loads that file only
Glob pattern audio_source_dir: "data/audio/manifest_fr*.json" Loads all matching .json files

Each manifest is a JSON array:

[
  {
    "audio_filepath": "sample1.wav",
    "text": "This is a sample transcription.",
    "lang": "en"
  },
  {
    "audio_filepath": "sample2.wav",
    "text": "Ceci est une transcription.",
    "lang": "fr"
  }
]

Required fields:

  • audio_filepath: Path to audio file (relative to manifest location or absolute)
  • text: Reference transcription (ground truth)
  • lang: Language code (ISO 639-1: en, fr, de, es, etc.)

Audio format requirements:

  • WAV format recommended
  • 16kHz sample rate (or configure in audio_processing.sample_rate)
  • Mono channel recommended

Performance Considerations

GPU Acceleration

Most models automatically use CUDA if available. Check GPU usage:

import torch
print(torch.cuda.is_available())

Troubleshooting

CUDA Out of Memory

Reduce batch size or use smaller models. For Whisper, use tiny or base variants.

Audio Format Errors

Ensure audio files are:

  • WAV format
  • 16kHz sample rate (or configure accordingly)
  • Mono channel

Vosk & NeMo — Automatic Setup

Engine-specific prerequisites are handled automatically when running a benchmark:

  • Vosk: Models listed in the configuration (model_path) are downloaded and extracted on the fly if not already present.
  • NeMo on Windows: The signal.SIGKILL compatibility patch is applied at runtime (exp_manager.py).

Both mechanisms are implemented in src/asr_lab/setup (utilities: engine_setup, nemo_patch, vosk_setup). The benchmark runner calls ensure_engines_ready(...) before engine initialization so most manual setup steps are no longer necessary.

License

See LICENSE file for details.

Contributing

Contributions are welcome. Please submit pull requests or open issues for bugs and feature requests.

About

Benchmarking platform for automatic speech recognition models

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors