AV-Align — Audio-Visual Alignment Toolkit

The Problem

Multimodal models are often evaluated on downstream tasks (captioning, QA, retrieval) without a clear understanding of whether their internal representations actually align audio and visual information meaningfully. A model might perform well on VQA by relying on text priors while barely using the audio signal.

AV-Align gives you tools to measure and improve this alignment directly. Instead of asking "does the model get the right answer?", we ask "does the model actually connect what it hears to what it sees?"

Our Approach

Three complementary evaluation strategies:

Retrieval metrics — Given an audio clip, can you find the matching visual content? Measures how well representations capture cross-modal correspondence.
Probing tasks — Train a simple linear classifier on frozen features. If a linear model can distinguish matched from mismatched audio-visual pairs, the features encode alignment information.
Temporal alignment — Does the temporal structure of audio features mirror the visual sequence? Measured via diagonal dominance of the cross-modal similarity matrix.

Plus a contrastive training module for learning better-aligned representations from paired audio-visual data.

Show Me

from avalign.metrics import retrieval_metrics
from avalign.models import AudioFeatureExtractor, VisualFeatureExtractor

audio_ext = AudioFeatureExtractor("facebook/hubert-base-ls960")
visual_ext = VisualFeatureExtractor("openai/clip-vit-base-patch32")

audio_feats = audio_ext(audio_batch).mean(dim=1)    # pool temporal
visual_feats = visual_ext(frame_batch).mean(dim=1)  # pool frames

results = retrieval_metrics(audio_feats, visual_feats, k_values=[1, 5, 10])
# {'a2v_R@1': 0.34, 'a2v_R@5': 0.72, 'v2a_R@1': 0.31, ...}

Getting Started

git clone https://github.com/frankdoer/av-align.git
cd av-align
pip install -e ".[models]"

Train Alignment Model

python scripts/train_alignment.py \
  --data-dir ./data/paired_av_data \
  --audio-model facebook/hubert-base-ls960 \
  --visual-model openai/clip-vit-base-patch32 \
  --epochs 30

Run Evaluation

python scripts/evaluate_alignment.py \
  --data-dir ./data/test \
  --checkpoint outputs/alignment/best_model.pt \
  --output results.json

How It Works

                    Audio                    Visual
                      │                        │
                      ▼                        ▼
              ┌──────────────┐         ┌──────────────┐
              │  HuBERT /    │         │  CLIP /      │
              │  Whisper     │         │  DINOv2      │
              │  (frozen)    │         │  (frozen)    │
              └──────┬───────┘         └──────┬───────┘
                     │                        │
                     ▼                        ▼
              ┌──────────────┐         ┌──────────────┐
              │  Projection  │         │  Projection  │
              │  Head        │         │  Head        │
              └──────┬───────┘         └──────┬───────┘
                     │                        │
                     └────────┬───────────────┘
                              │
                     ┌────────▼────────┐
                     │  Shared Space   │
                     │  (InfoNCE)      │
                     └────────┬────────┘
                              │
                  ┌───────────┼───────────┐
                  │           │           │
              Retrieval   Probing    Temporal
              Metrics     Tasks     Alignment

Evaluation Dimensions

Metric	What it measures	Range
R@K (retrieval)	Cross-modal matching accuracy	0–1 (↑)
AV matching probe	Linear separability of matched/unmatched pairs	0–1 (↑)
Temporal alignment	Diagonal dominance of similarity matrix	0–1 (↑)
Mutual information	Statistical dependency between modalities	0+ (↑)

Project Structure

av-align/
├── avalign/
│   ├── metrics/       # Alignment, retrieval, MI metrics + viz
│   ├── models/        # Feature extractors + contrastive model
│   ├── probes/        # Linear probing tasks
│   └── data/          # Dataset loaders
├── scripts/           # Training and evaluation scripts
└── tests/             # Unit tests

License

BSD-3-Clause — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
avalign		avalign
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AV-Align — Audio-Visual Alignment Toolkit

The Problem

Our Approach

Show Me

Getting Started

Train Alignment Model

Run Evaluation

How It Works

Evaluation Dimensions

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AV-Align — Audio-Visual Alignment Toolkit

The Problem

Our Approach

Show Me

Getting Started

Train Alignment Model

Run Evaluation

How It Works

Evaluation Dimensions

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages