Skip to content

Halleck45/OpenPronounce

Repository files navigation

This project provides tools for analyzing and improving English pronunciation using AI models. It includes a web application and a command-line interface (CLI) for comparing audio files against expected text, providing detailed feedback on phonemes and prosody, and visualizing visemes. It leverages Wav2Vec 2.0 for audio feature extraction and DTW for phoneme alignment.

See my blog post for more details about the approach.

Open in Colab Tests Sponsor License: MIT

Demo notebook

You can explore the project and test the pronunciation analysis directly in the provided Jupyter notebook:

➡️ Open in Jupyter Notebook

This notebook walks through:

  • loading an audio sample,
  • transcribing speech with Wav2Vec2,
  • comparing predicted and expected phonemes,
  • visualizing pronunciation and prosody scores.

Installation

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

As library

import audio
import speech

audio_path = "example.mp3"
expected_text = "Hello I am a developer"

sound = audio.load(audio_path)
prediction = speech.compare_audio_with_text(sound, expected_text)

print(prediction.score)

As CLI

python cli.py <file.mp3 or file.wav> "Expected text"

Predicted output

You will get JSON with the following fields:

Phonemes

  • score: [0-100] pronunciation score

  • differences.phoneme_distance: DTW distance between expected and predicted phonemes

  • differences.phonemes: list of phonemes with their start and end time,

  • differences.errors: array of errors, with:

    • position: index of the word in the expected text
    • expected: expected phonemes
    • actual: predicted phonemes
    • word: the expected word

    For example, an error can be:

    {
        "position": 0,
        "expected": "hæloʊ",
        "actual": "hɛl",
        "word": "Hell"
    },
  • transcribe: the transcription of the audio

Prosody

You also get prosody information, with:

  • prosody.f0: the fundamental frequency (pitch) contour
  • prosody.energy: the energy (loudness) contour

Visemes

This application comes with a (very!) basic phoneme to viseme javascript implementation, for English language. An better implementation could be done using dedicated models.

import { Viseme } from "/static/viseme.js";
const mouthImage = document.getElementById('the-img-node-you-want-to-use');
const viseme = new Viseme(mouthImage);

// Play the phonemes
viseme.play(['həloʊ', 'huː', 'ɑːɹ', 'juː']);

Web application

FastAPI Server

Mount the server:

python -m uvicorn server:app --host 0.0.0.0 --port 8000 --reload

Web application screenshot

Note: recording using your micro is not possible in Chrome with a no-https local environment.

Streamlit Application

You can also run the application using Streamlit:

streamlit run streamlit_app.py

The Streamlit app provides:

  • Text input for the expected pronunciation
  • Audio file upload (WAV, MP3, M4A, OGG, WEBM)
  • Text-to-speech to listen to the correct pronunciation
  • Detailed analysis results with scores, transcription, and word-by-word feedback
  • Interactive charts for prosody (F0 and energy) and phoneme comparison

Contributing

Please keep pytests up-to-date:

pytest -v

References

Visemes

The viseme images come from the HumanBeanCMU39 viseme set.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

AI-powered open-source toolkit for real-time English pronunciation feedback.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Contributors