Skip to content

Srinath-N-R/text-speech-forced-alignment-IPA

Repository files navigation

Forced Alignment with Wav2Vec2 and Phonemized-VCTK

This project demonstrates forced alignment using Wav2Vec2 to align phoneme sequences to audio data. Designed for researchers, linguists, and developers working with speech processing, phoneme alignment, TTS, and prosody modeling.


🚀 Features

  • Batch Processing: Efficient forced alignment across multiple audio files.
  • Time-Aligned Segments: Outputs alignment results as JSON with per-phoneme timing.
  • Flexible Dataset: Supports any dataset with .wav audio and .txt phoneme (IPA) transcriptions.
  • Visualization: Utilities for visualizing alignments.

📦 Dataset Recommendation

We recommend starting with Phonemized-VCTK (srinathnr/TTS_DATASET), a turn-key dataset ready for forced alignment and TTS research.

Folder layout:

TTS_DATASET/
  train/
    wav/<spk>/
    phonemized/<spk>/
  validation/
    wav/<spk>/
    phonemized/<spk>/
  test/
    wav/<spk>/
    phonemized/<spk>/

Each .txt contains whitespace-separated IPA tokens matching the audio file.

Quick Hugging Face Loading:

from datasets import load_dataset

ds_train = load_dataset("srinathnr/TTS_DATASET", split="train", trust_remote_code=True, streaming=True)
ds_val = load_dataset("srinathnr/TTS_DATASET", split="validation", trust_remote_code=True, streaming=True)
ds_test = load_dataset("srinathnr/TTS_DATASET", split="test", trust_remote_code=True, streaming=True)

🛠️ Project Structure

.
├── aligner.py          # Core forced alignment logic
├── dataloader.py       # Dataset and preprocessing
├── main.py             # Main entry point for batch alignment
├── plot_utils.py       # Plotting and visualization utilities
├── visualize_data.py   # Quick visualizations of alignments
├── requirements.txt
├── README.md
└── dataset/            # (user‑supplied or from HF)

⚡ Usage

1. Prepare Dataset

Download or symlink TTS_DATASET (see above) or provide your own with the required structure:

dataset/
  wav/<spk>/         # Audio files (.wav)
  phonemized/<spk>/  # Phoneme files (.txt, IPA, whitespace‑separated)

2. Run Alignment

python main.py --config config.yaml

3. Visualize a Single Example

python visualize_data.py

📂 Explore the Dataset

Example: Loading & exploring a single item

from pathlib import Path
import json, soundfile as sf
import numpy as np

root = Path("TTS_DATASET/train")
wav, sr = sf.read(root/"wav/p225/p225_001.wav")
ipa  = (root/"phonemized/p225/p225_001.txt").read_text().strip().split()
print(f"IPA: {ipa}")

# If alignments are generated:
segs = json.loads((root/"segments/p225/p225_001.json").read_text())
print("Aligned segments:", segs)

📝 Citation

If you use this project or the dataset, please cite:

@misc{yours2025phonvctk,
  title        = {Phonemized-VCTK: An enriched version of VCTK with IPA, alignments and embeddings},
  author       = {Your Name},
  year         = {2025},
  howpublished = {{https://huggingface.co/datasets/srinathnr/TTS_DATASET}}
}

@inproceedings{yamagishi2019cstr,
  title={The CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
  author={Yamagishi, Junichi et al.},
  booktitle={Proc. LREC},
  year={2019}
}

Diagram Example

Diagram Example 2


Questions? Open an issue or explore srinathnr/TTS_DATASET to get started!

About

Forced speech-phoneme alignment for IPA symbols

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages