Multimodal models are often evaluated on downstream tasks (captioning, QA, retrieval) without a clear understanding of whether their internal representations actually align audio and visual information meaningfully. A model might perform well on VQA by relying on text priors while barely using the audio signal.
AV-Align gives you tools to measure and improve this alignment directly. Instead of asking "does the model get the right answer?", we ask "does the model actually connect what it hears to what it sees?"
Three complementary evaluation strategies:
-
Retrieval metrics — Given an audio clip, can you find the matching visual content? Measures how well representations capture cross-modal correspondence.
-
Probing tasks — Train a simple linear classifier on frozen features. If a linear model can distinguish matched from mismatched audio-visual pairs, the features encode alignment information.
-
Temporal alignment — Does the temporal structure of audio features mirror the visual sequence? Measured via diagonal dominance of the cross-modal similarity matrix.
Plus a contrastive training module for learning better-aligned representations from paired audio-visual data.
from avalign.metrics import retrieval_metrics
from avalign.models import AudioFeatureExtractor, VisualFeatureExtractor
audio_ext = AudioFeatureExtractor("facebook/hubert-base-ls960")
visual_ext = VisualFeatureExtractor("openai/clip-vit-base-patch32")
audio_feats = audio_ext(audio_batch).mean(dim=1) # pool temporal
visual_feats = visual_ext(frame_batch).mean(dim=1) # pool frames
results = retrieval_metrics(audio_feats, visual_feats, k_values=[1, 5, 10])
# {'a2v_R@1': 0.34, 'a2v_R@5': 0.72, 'v2a_R@1': 0.31, ...}git clone https://github.com/frankdoer/av-align.git
cd av-align
pip install -e ".[models]"python scripts/train_alignment.py \
--data-dir ./data/paired_av_data \
--audio-model facebook/hubert-base-ls960 \
--visual-model openai/clip-vit-base-patch32 \
--epochs 30python scripts/evaluate_alignment.py \
--data-dir ./data/test \
--checkpoint outputs/alignment/best_model.pt \
--output results.json Audio Visual
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ HuBERT / │ │ CLIP / │
│ Whisper │ │ DINOv2 │
│ (frozen) │ │ (frozen) │
└──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Projection │ │ Projection │
│ Head │ │ Head │
└──────┬───────┘ └──────┬───────┘
│ │
└────────┬───────────────┘
│
┌────────▼────────┐
│ Shared Space │
│ (InfoNCE) │
└────────┬────────┘
│
┌───────────┼───────────┐
│ │ │
Retrieval Probing Temporal
Metrics Tasks Alignment
| Metric | What it measures | Range |
|---|---|---|
| R@K (retrieval) | Cross-modal matching accuracy | 0–1 (↑) |
| AV matching probe | Linear separability of matched/unmatched pairs | 0–1 (↑) |
| Temporal alignment | Diagonal dominance of similarity matrix | 0–1 (↑) |
| Mutual information | Statistical dependency between modalities | 0+ (↑) |
av-align/
├── avalign/
│ ├── metrics/ # Alignment, retrieval, MI metrics + viz
│ ├── models/ # Feature extractors + contrastive model
│ ├── probes/ # Linear probing tasks
│ └── data/ # Dataset loaders
├── scripts/ # Training and evaluation scripts
└── tests/ # Unit tests
BSD-3-Clause — see LICENSE.