Skip to content

frankdoer/av-align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AV-Align — Audio-Visual Alignment Toolkit

The Problem

Multimodal models are often evaluated on downstream tasks (captioning, QA, retrieval) without a clear understanding of whether their internal representations actually align audio and visual information meaningfully. A model might perform well on VQA by relying on text priors while barely using the audio signal.

AV-Align gives you tools to measure and improve this alignment directly. Instead of asking "does the model get the right answer?", we ask "does the model actually connect what it hears to what it sees?"

Our Approach

Three complementary evaluation strategies:

  1. Retrieval metrics — Given an audio clip, can you find the matching visual content? Measures how well representations capture cross-modal correspondence.

  2. Probing tasks — Train a simple linear classifier on frozen features. If a linear model can distinguish matched from mismatched audio-visual pairs, the features encode alignment information.

  3. Temporal alignment — Does the temporal structure of audio features mirror the visual sequence? Measured via diagonal dominance of the cross-modal similarity matrix.

Plus a contrastive training module for learning better-aligned representations from paired audio-visual data.

Show Me

from avalign.metrics import retrieval_metrics
from avalign.models import AudioFeatureExtractor, VisualFeatureExtractor

audio_ext = AudioFeatureExtractor("facebook/hubert-base-ls960")
visual_ext = VisualFeatureExtractor("openai/clip-vit-base-patch32")

audio_feats = audio_ext(audio_batch).mean(dim=1)    # pool temporal
visual_feats = visual_ext(frame_batch).mean(dim=1)  # pool frames

results = retrieval_metrics(audio_feats, visual_feats, k_values=[1, 5, 10])
# {'a2v_R@1': 0.34, 'a2v_R@5': 0.72, 'v2a_R@1': 0.31, ...}

Getting Started

git clone https://github.com/frankdoer/av-align.git
cd av-align
pip install -e ".[models]"

Train Alignment Model

python scripts/train_alignment.py \
  --data-dir ./data/paired_av_data \
  --audio-model facebook/hubert-base-ls960 \
  --visual-model openai/clip-vit-base-patch32 \
  --epochs 30

Run Evaluation

python scripts/evaluate_alignment.py \
  --data-dir ./data/test \
  --checkpoint outputs/alignment/best_model.pt \
  --output results.json

How It Works

                    Audio                    Visual
                      │                        │
                      ▼                        ▼
              ┌──────────────┐         ┌──────────────┐
              │  HuBERT /    │         │  CLIP /      │
              │  Whisper     │         │  DINOv2      │
              │  (frozen)    │         │  (frozen)    │
              └──────┬───────┘         └──────┬───────┘
                     │                        │
                     ▼                        ▼
              ┌──────────────┐         ┌──────────────┐
              │  Projection  │         │  Projection  │
              │  Head        │         │  Head        │
              └──────┬───────┘         └──────┬───────┘
                     │                        │
                     └────────┬───────────────┘
                              │
                     ┌────────▼────────┐
                     │  Shared Space   │
                     │  (InfoNCE)      │
                     └────────┬────────┘
                              │
                  ┌───────────┼───────────┐
                  │           │           │
              Retrieval   Probing    Temporal
              Metrics     Tasks     Alignment

Evaluation Dimensions

Metric What it measures Range
R@K (retrieval) Cross-modal matching accuracy 0–1 (↑)
AV matching probe Linear separability of matched/unmatched pairs 0–1 (↑)
Temporal alignment Diagonal dominance of similarity matrix 0–1 (↑)
Mutual information Statistical dependency between modalities 0+ (↑)

Project Structure

av-align/
├── avalign/
│   ├── metrics/       # Alignment, retrieval, MI metrics + viz
│   ├── models/        # Feature extractors + contrastive model
│   ├── probes/        # Linear probing tasks
│   └── data/          # Dataset loaders
├── scripts/           # Training and evaluation scripts
└── tests/             # Unit tests

License

BSD-3-Clause — see LICENSE.

About

Audio-visual alignment and evaluation toolkit for multimodal models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages