Forced Alignment with Wav2Vec2 and Phonemized-VCTK

This project demonstrates forced alignment using Wav2Vec2 to align phoneme sequences to audio data. Designed for researchers, linguists, and developers working with speech processing, phoneme alignment, TTS, and prosody modeling.

🚀 Features

Batch Processing: Efficient forced alignment across multiple audio files.
Time-Aligned Segments: Outputs alignment results as JSON with per-phoneme timing.
Flexible Dataset: Supports any dataset with .wav audio and .txt phoneme (IPA) transcriptions.
Visualization: Utilities for visualizing alignments.

📦 Dataset Recommendation

We recommend starting with Phonemized-VCTK (srinathnr/TTS_DATASET), a turn-key dataset ready for forced alignment and TTS research.

Folder layout:

TTS_DATASET/
  train/
    wav/<spk>/
    phonemized/<spk>/
  validation/
    wav/<spk>/
    phonemized/<spk>/
  test/
    wav/<spk>/
    phonemized/<spk>/

Each .txt contains whitespace-separated IPA tokens matching the audio file.

Quick Hugging Face Loading:

from datasets import load_dataset

ds_train = load_dataset("srinathnr/TTS_DATASET", split="train", trust_remote_code=True, streaming=True)
ds_val = load_dataset("srinathnr/TTS_DATASET", split="validation", trust_remote_code=True, streaming=True)
ds_test = load_dataset("srinathnr/TTS_DATASET", split="test", trust_remote_code=True, streaming=True)

🛠️ Project Structure

.
├── aligner.py          # Core forced alignment logic
├── dataloader.py       # Dataset and preprocessing
├── main.py             # Main entry point for batch alignment
├── plot_utils.py       # Plotting and visualization utilities
├── visualize_data.py   # Quick visualizations of alignments
├── requirements.txt
├── README.md
└── dataset/            # (user‑supplied or from HF)

⚡ Usage

1. Prepare Dataset

Download or symlink TTS_DATASET (see above) or provide your own with the required structure:

dataset/
  wav/<spk>/         # Audio files (.wav)
  phonemized/<spk>/  # Phoneme files (.txt, IPA, whitespace‑separated)

2. Run Alignment

python main.py --config config.yaml

3. Visualize a Single Example

python visualize_data.py

📂 Explore the Dataset

Example: Loading & exploring a single item

from pathlib import Path
import json, soundfile as sf
import numpy as np

root = Path("TTS_DATASET/train")
wav, sr = sf.read(root/"wav/p225/p225_001.wav")
ipa  = (root/"phonemized/p225/p225_001.txt").read_text().strip().split()
print(f"IPA: {ipa}")

# If alignments are generated:
segs = json.loads((root/"segments/p225/p225_001.json").read_text())
print("Aligned segments:", segs)

📝 Citation

If you use this project or the dataset, please cite:

@misc{yours2025phonvctk,
  title        = {Phonemized-VCTK: An enriched version of VCTK with IPA, alignments and embeddings},
  author       = {Your Name},
  year         = {2025},
  howpublished = {{https://huggingface.co/datasets/srinathnr/TTS_DATASET}}
}

@inproceedings{yamagishi2019cstr,
  title={The CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
  author={Yamagishi, Junichi et al.},
  booktitle={Proc. LREC},
  year={2019}
}

Questions? Open an issue or explore srinathnr/TTS_DATASET to get started!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forced Alignment with Wav2Vec2 and Phonemized-VCTK

🚀 Features

📦 Dataset Recommendation

🛠️ Project Structure

⚡ Usage

1. Prepare Dataset

2. Run Alignment

3. Visualize a Single Example

📂 Explore the Dataset

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitattributes		.gitattributes
Diagram 1.png		Diagram 1.png
Diagram 2.png		Diagram 2.png
Diagram 3.png		Diagram 3.png
README.md		README.md
aligner.py		aligner.py
dataloader.py		dataloader.py
main.py		main.py
plot_utils.py		plot_utils.py
requirements.txt		requirements.txt
visualize_data.py		visualize_data.py

Folders and files

Latest commit

History

Repository files navigation

Forced Alignment with Wav2Vec2 and Phonemized-VCTK

🚀 Features

📦 Dataset Recommendation

🛠️ Project Structure

⚡ Usage

1. Prepare Dataset

2. Run Alignment

3. Visualize a Single Example

📂 Explore the Dataset

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages