This project provides tools for analyzing and improving English pronunciation using AI models. It includes a web application and a command-line interface (CLI) for comparing audio files against expected text, providing detailed feedback on phonemes and prosody, and visualizing visemes. It leverages Wav2Vec 2.0 for audio feature extraction and DTW for phoneme alignment.
See my blog post for more details about the approach.
You can explore the project and test the pronunciation analysis directly in the provided Jupyter notebook:
This notebook walks through:
- loading an audio sample,
- transcribing speech with Wav2Vec2,
- comparing predicted and expected phonemes,
- visualizing pronunciation and prosody scores.
pip install -r requirements.txt
python -m spacy download en_core_web_smimport audio
import speech
audio_path = "example.mp3"
expected_text = "Hello I am a developer"
sound = audio.load(audio_path)
prediction = speech.compare_audio_with_text(sound, expected_text)
print(prediction.score)python cli.py <file.mp3 or file.wav> "Expected text"You will get JSON with the following fields:
Phonemes
-
score: [0-100] pronunciation score -
differences.phoneme_distance: DTW distance between expected and predicted phonemes -
differences.phonemes: list of phonemes with their start and end time, -
differences.errors: array of errors, with:position: index of the word in the expected textexpected: expected phonemesactual: predicted phonemesword: the expected word
For example, an error can be:
{ "position": 0, "expected": "hæloʊ", "actual": "hɛl", "word": "Hell" }, -
transcribe: the transcription of the audio
Prosody
You also get prosody information, with:
prosody.f0: the fundamental frequency (pitch) contourprosody.energy: the energy (loudness) contour
This application comes with a (very!) basic phoneme to viseme javascript implementation, for English language. An better implementation could be done using dedicated models.
import { Viseme } from "/static/viseme.js";
const mouthImage = document.getElementById('the-img-node-you-want-to-use');
const viseme = new Viseme(mouthImage);
// Play the phonemes
viseme.play(['həloʊ', 'huː', 'ɑːɹ', 'juː']);Mount the server:
python -m uvicorn server:app --host 0.0.0.0 --port 8000 --reloadNote: recording using your micro is not possible in Chrome with a no-https local environment.
You can also run the application using Streamlit:
streamlit run streamlit_app.pyThe Streamlit app provides:
- Text input for the expected pronunciation
- Audio file upload (WAV, MP3, M4A, OGG, WEBM)
- Text-to-speech to listen to the correct pronunciation
- Detailed analysis results with scores, transcription, and word-by-word feedback
- Interactive charts for prosody (F0 and energy) and phoneme comparison
Please keep pytests up-to-date:
pytest -vThe viseme images come from the HumanBeanCMU39 viseme set.
This project is licensed under the MIT License - see the LICENSE file for details
