This project demonstrates forced alignment using Wav2Vec2 to align phoneme sequences to audio data. Designed for researchers, linguists, and developers working with speech processing, phoneme alignment, TTS, and prosody modeling.
- Batch Processing: Efficient forced alignment across multiple audio files.
- Time-Aligned Segments: Outputs alignment results as JSON with per-phoneme timing.
- Flexible Dataset: Supports any dataset with
.wavaudio and.txtphoneme (IPA) transcriptions. - Visualization: Utilities for visualizing alignments.
We recommend starting with Phonemized-VCTK (srinathnr/TTS_DATASET), a turn-key dataset ready for forced alignment and TTS research.
Folder layout:
TTS_DATASET/
train/
wav/<spk>/
phonemized/<spk>/
validation/
wav/<spk>/
phonemized/<spk>/
test/
wav/<spk>/
phonemized/<spk>/
Each .txt contains whitespace-separated IPA tokens matching the audio file.
Quick Hugging Face Loading:
from datasets import load_dataset
ds_train = load_dataset("srinathnr/TTS_DATASET", split="train", trust_remote_code=True, streaming=True)
ds_val = load_dataset("srinathnr/TTS_DATASET", split="validation", trust_remote_code=True, streaming=True)
ds_test = load_dataset("srinathnr/TTS_DATASET", split="test", trust_remote_code=True, streaming=True).
├── aligner.py # Core forced alignment logic
├── dataloader.py # Dataset and preprocessing
├── main.py # Main entry point for batch alignment
├── plot_utils.py # Plotting and visualization utilities
├── visualize_data.py # Quick visualizations of alignments
├── requirements.txt
├── README.md
└── dataset/ # (user‑supplied or from HF)
Download or symlink TTS_DATASET (see above) or provide your own with the required structure:
dataset/
wav/<spk>/ # Audio files (.wav)
phonemized/<spk>/ # Phoneme files (.txt, IPA, whitespace‑separated)
python main.py --config config.yamlpython visualize_data.pyExample: Loading & exploring a single item
from pathlib import Path
import json, soundfile as sf
import numpy as np
root = Path("TTS_DATASET/train")
wav, sr = sf.read(root/"wav/p225/p225_001.wav")
ipa = (root/"phonemized/p225/p225_001.txt").read_text().strip().split()
print(f"IPA: {ipa}")
# If alignments are generated:
segs = json.loads((root/"segments/p225/p225_001.json").read_text())
print("Aligned segments:", segs)If you use this project or the dataset, please cite:
@misc{yours2025phonvctk,
title = {Phonemized-VCTK: An enriched version of VCTK with IPA, alignments and embeddings},
author = {Your Name},
year = {2025},
howpublished = {{https://huggingface.co/datasets/srinathnr/TTS_DATASET}}
}
@inproceedings{yamagishi2019cstr,
title={The CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
author={Yamagishi, Junichi et al.},
booktitle={Proc. LREC},
year={2019}
}Questions? Open an issue or explore srinathnr/TTS_DATASET to get started!

