Single-page EFL pronunciation evaluator for noun/verb stress-shift pairs, with A2A-compatible discovery + JSON-RPC endpoints.
Try it: https://guildaidemo.talknicer.com
- Paragraph selector for all annotated paragraphs loaded from
PARAGRAPHS.txt(currently 10, but dynamic). - Browser microphone recording with WAV encoding.
- Optional "Bring Your Own Deepgram API Key" UI section that stores a user-supplied key in a
deepgram_api_keybrowser cookie (365-day expiry). - "native exemplar" checkbox in the UI to mark exemplar-candidate submissions for later review.
- Deepgram transcription with per-word confidence.
- Inexact token alignment (Needleman–Wunsch style) between prompt and recognized words.
- Word-level stress inference from PocketSphinx phoneme alignment and pronunciation overlap scoring.
- Confidence visualization based on
confidence_cubed = confidence ** 3as background color. - A2A-compatible remote agent interface (Agent Card discovery + JSON-RPC endpoint) so other agents/platforms can call it.
GET /.well-known/agent.jsonPOST /a2a(agent.about,paragraphs.count,paragraphs.get_text,pronunciation.evaluatewith optionaldeepgram_api_key)
- Production-minded observability: request/trace IDs propagated through requests, responses, and logs for run correlation.
- Health endpoint (/healthz) to support deployment/monitoring and “is it alive?” checks.
- Structured, machine-consumable outputs (clear JSON schema for words, alignments, target evaluations, and summary metrics).
- "Control-plane friendly" service boundaries: stable HTTP endpoints + discoverable capabilities designed to plug into orchestration/routing.
- UI as an operator surface: single-page workflow that makes the agent capability usable and demoable (not just a script).
- Built-in developer visibility: the app page documents the agent endpoints and example calls to ease integration and adoption.
- Synchronous persistence of every analyzed submission as two files in a mounted bucket directory.
-
Optionally set
DEEPGRAM_API_KEYas a shared fallback key for server-side transcription. The app now prefers a non-emptydeepgram_api_keycookie when present and falls back to this environment variable when needed. -
(Optional, but recommended for persistence testing) set a custom bucket mount directory:
export BUCKET_DIR=/bucket- Run the dev server (creates
.venv, installs deps, minifies assets, starts Flask):
./devserver.sh- Open:
http://localhost:8080
GET /api/paragraphsPOST /api/analyze(multipart/form-data:paragraph_id,audio_wav, optionalnative_exemplar)GET /healthzGET /api/healthz(alias for environments that only route/api/*)GET /.well-known/agent.jsonPOST /a2a
Agent card discovery advertises method-level capabilities for agent.about, paragraphs.count, paragraphs.get_text, and pronunciation.evaluate, including required vs optional params and the JSON-RPC endpoint URL.
For each successful analysis, the app writes two files synchronously into BUCKET_DIR: {recording_id}.wav (original uploaded WAV bytes) and {recording_id}.json (analysis sidecar). The recording_id is a microsecond-resolution timestamp string. For native exemplar uploads (native_exemplar=true), the timestamp is suffixed with e before the file extension so pairs are easier to spot in bucket listings. Example:
/bucket/260226140321123456.wav
/bucket/260226140321123456.json
/bucket/260226140321123456e.wav # native exemplar
/bucket/260226140321123456e.json # native exemplar
Each sidecar file contains the following top-level structure:
{
"schema_version": 1,
"recording_id": "YYMMDDHHMMSSffffff",
"created_at_hst": "2026-02-26T14:03:21.123456-10:00",
"timezone": "Pacific/Honolulu",
"source": "web",
"request_id": "uuid-or-upstream-id",
"request_ip": "198.51.100.23",
"paragraph_id": 3,
"paragraph_text_hash": "sha256:...",
"native_exemplar": true,
"audio": {
"path": "/bucket/YYMMDDHHMMSSffffff.wav",
"bytes": 482344,
"content_type": "audio/wav",
"sample_rate_hz": 16000,
"channels": 1,
"sample_width_bytes": 2
},
"analysis_summary": {
"percent_correct": 71.43,
"total_targets": 7,
"scored_targets": 7,
"missing_targets": 0,
"unaligned_targets": 0
},
"targets": [
{
"token_index": 12,
"word_display": "record",
"word_norm": "record",
"label": "N",
"expected_stress": 1,
"inferred_stress": 2,
"status": "ok",
"correct": false,
"core_phones": {"syll1": "EH", "syll2": "AO"},
"core_durations": {"syll1": 0.07, "syll2": 0.11},
"duration_ratio": 0.636364,
"duration_ratio_log": -0.451985,
"deepgram_word_index": 13,
"deepgram_confidence": 0.93,
"deepgram_confidence_cubed": 0.804357,
"feedback": "Shift stress to syllable 1 and lengthen that vowel."
}
],
"pipeline": {
"asr_provider": "deepgram",
"asr_model": "nova-2",
"aligner": "pocketsphinx"
}
}When native_exemplar is true, recording_id and persisted filenames append e before the extension (for example, YYMMDDHHMMSSffffffe.wav and YYMMDDHHMMSSffffffe.json).
targets[*].duration_ratio is computed as syll1/syll2 when both values exist and syll2 > 0; otherwise it is null.
targets[*].duration_ratio_log is computed as ln((syll1 + 1e-4)/(syll2 + 1e-4)) when both values exist; otherwise it is null.
The sidecar JSON is written with an atomic temp-file-and-rename pattern (.json.tmp then os.replace) to reduce partial-write risk.
The inference pipeline can learn stress decision thresholds from previously persisted sidecars in $BUCKET_DIR/*.json, reading all matching sidecars fresh for each analysis request.
- Data source: all JSON sidecars in
BUCKET_DIR(default/bucket). - Training filter: only samples where top-level
native_exemplaristrue, targetstatusis"ok",expected_stressis 1 or 2, and target syllable durations are usable. - Feature:
duration_ratio_log = ln((syll1 + 1e-4)/(syll2 + 1e-4)). - Threshold method: for each key, fit class summaries in log-ratio space and compute a Gaussian intersection boundary using T = (μ₁σ₂ + μ₂σ₁) / (σ₁ + σ₂), where class 1 is expected stress 1 (noun pattern) and class 2 is expected stress 2 (verb pattern).
- Keys:
- context key:
(word_norm, paragraph_id, token_index)when available, - fallback key:
word_norm.
- context key:
- Minimum data guardrail: learned thresholds are used only when both classes have at least 3 exemplar samples, so each class has enough points for a valid standard deviation.
- Boundary safety fallback: if σ₁ + σ₂ = 0 (identical spread), the threshold falls back to midpoint(T) = (μ₁ + μ₂) / 2.
- Decision fallback: if no usable learned threshold exists (or sidecar data is empty/corrupt), the app keeps the original heuristic:
syll1 >= syll2 => stress 1 else stress 2.
Target output now includes debug fields:
decision_method("learned_threshold"or"naive_duration"when inference succeeds)duration_ratio_loglearned_threshold(nullable)threshold_key(nullable)threshold_stats(nullable object withmu1,mu2,sigma1,sigma2, class counts, and effective threshold)decision_confidence(nullable, 0.0-100.0, only whendecision_methodis"learned_threshold")
- Read the discoverable model card:
curl -s "$BASE_URL/.well-known/agent.json" | jq .- Build base64 payload from the paragraph 3 WAV fixture:
AUDIO_B64=$(python - <<'PY'
import base64
from pathlib import Path
print(base64.b64encode(Path("tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wav").read_bytes()).decode())
PY
)- Discover paragraph count before selecting ids:
curl -s -X POST "$BASE_URL/a2a" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"p-count","method":"paragraphs.count","params":{}}' | jq .- Fetch plain paragraph text (unannotated) for the selected id:
curl -s -X POST "$BASE_URL/a2a" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":"p-text-3","method":"paragraphs.get_text","params":{"paragraph_id":3}}' | jq .- Submit
pronunciation.evaluateas an A2A client:
jq -n --arg audio "$AUDIO_B64" --arg dg "dg_live_xxx" '{jsonrpc:"2.0",id:"p3-a2a-demo",method:"pronunciation.evaluate",params:{paragraph_id:3,audio_wav_base64:$audio,deepgram_api_key:$dg}}' \\
| curl -s -X POST "$BASE_URL/a2a" \
-H "Content-Type: application/json" \
-d @- | jq .For requests that need transcription, the app resolves the Deepgram key in this order:
deepgram_api_keyA2A method parameter (if provided and non-empty)deepgram_api_keybrowser cookie (if provided and non-empty)DEEPGRAM_API_KEYenvironment variable
Server logs include only the source (a2a_param, cookie, or env) and never print key values.
- Select a paragraph.
- Click Start Recording, read paragraph, click Stop Recording.
- Optionally check native exemplar if this submission should be flagged as an exemplar candidate.
- Click Submit for Analysis.
- Verify:
- confidence-colored word backgrounds,
- prominent red boxes around incorrect stress targets,
- dashed orange boxes for missing/unaligned targets,
- per-target table populated,
- Developer/A2A docs visible on same page,
- matching
{recording_id}.wavand{recording_id}.jsonfiles appear inBUCKET_DIR({recording_id}ends withefor native exemplar submissions).
Run unit tests:
pytest -qIncluded tests cover:
- paragraph parsing and target extraction,
- sequence alignment behavior,
- confidence-cubed and deterministic background normalization,
- persistence of sidecar and WAV files and schema fields using the paragraph 3 test WAV fixture.
To quickly verify colored confidence rendering and the target table without recording live audio, submit the bundled paragraph 3 WAV fixture and then view the Results section in the browser:
curl -sS -X POST http://127.0.0.1:8080/api/analyze \
-F paragraph_id=3 \
-F native_exemplar=false \
-F audio_wav=@tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wavThis uses the same analysis pipeline as normal uploads (including Deepgram transcription and adaptive thresholding), so it is suitable for screenshot-based regression checks.
- Provide
DEEPGRAM_API_KEYvia the service environment configuration. Keep secrets out of source control. Procfilefor Google Cloud Run invocations runsapp.pyasapp:app.- Validate locally with
devserver.shbefore deployment.
This section describes how to connect the pronunciation evaluator to WhatsApp so users can receive a practice paragraph, record a voice message reply, and get evaluation feedback — all within WhatsApp.
A lightweight n8n workflow acts as the orchestrator:
- A user sends any text message to your WhatsApp number → n8n sends back a random paragraph (e.g., "Please read paragraph 3 aloud and reply with a voice message: ...")
- The user replies with a WhatsApp voice message → n8n downloads the OGG/Opus audio
- n8n calls a small audio conversion helper to produce a 16 kHz mono WAV
- n8n base64-encodes the WAV and POSTs it to this app's
pronunciation.evaluateA2A endpoint with the paragraph_id parsed from the outbound message text - n8n formats the JSON result into a plain-language summary and sends it back to the user
No server-side session state is required: the paragraph number is embedded in the text message sent to the user and parsed from the Meta webhook payload's conversation context when the voice reply arrives.
WhatsApp's API is gated by Meta regardless of what orchestration layer you use. You will need:
- A Meta developer account at developers.facebook.com
- A Meta Business portfolio (formerly Business Manager), verified with a real business or individual identity
- A WhatsApp Business Cloud API app created in the Meta developer console, with a verified phone number attached
Meta's verification process typically takes 1–3 business days for individual/small business accounts, though it can be faster. The WhatsApp Cloud API itself is free at low message volumes (Meta's pricing is per-conversation, with a free tier for developer testing).
n8n's WhatsApp Business Cloud trigger node registers your webhook with Meta automatically using an OAuth token — you do not need to copy webhook URLs manually — but the underlying Meta account and app setup is still required.
Either:
- Self-hosted (recommended for this use case):
docker compose upwith the official n8n Docker image. Self-hosting gives you shell access for audio conversion and no workflow execution limits. Running n8n on Cloud Run or a small GCP VM keeps everything in the same GCP environment as the main app. - n8n Cloud: easier to start but restricts arbitrary command execution, which affects audio conversion (see below).
WhatsApp voice messages are delivered as OGG/Opus files. The pronunciation.evaluate endpoint requires 16 kHz mono WAV. Convert with ffmpeg:
ffmpeg -i input.ogg -ar 16000 -ac 1 -sample_fmt s16 output.wavOn self-hosted n8n you can run this in an Execute Command node. On n8n Cloud, or to keep the conversion cleanly separable, deploy a minimal Cloud Run function that accepts a POST with the OGG bytes and returns WAV bytes. This converter is itself a small A2A-compatible service if you want to expose it as a discoverable agent capability.
- WhatsApp Trigger node — fires on any incoming WhatsApp message to your business number
- IF node — branch on message type:
- Text message → pick a random paragraph_id, fetch its text via
paragraphs.get_text, send back "Please read paragraph {N} aloud and reply with a voice message: {text}" - Voice message → proceed to conversion and evaluation
- Text message → pick a random paragraph_id, fetch its text via
- HTTP Request node — download the OGG audio from the Media URL in the webhook payload
- Audio conversion — Execute Command node (self-hosted) or HTTP Request to your conversion microservice
- Code node — base64-encode the WAV bytes; parse the paragraph_id from the last outbound message text in the webhook context
- HTTP Request node — POST to
/a2a:
{
"jsonrpc": "2.0",
"id": "whatsapp-eval",
"method": "pronunciation.evaluate",
"params": {
"paragraph_id": 3,
"audio_wav_base64": "<base64>"
}
}- Code node — format result: extract
score_summary.percent_correctand per-target feedback strings - WhatsApp Business Cloud send node — reply with the formatted summary
Results for paragraph 3: 5/7 correct (71%)
✓ record (verb) — stress correct
✓ permit (verb) — stress correct
✗ project (noun) — should be PRO-ject; you stressed pro-JECT
✗ object (noun) — should be OB-ject; you stressed ob-JECT
✓ present (verb) — stress correct
✓ conduct (noun) — stress correct
✗ increase (noun) — should be IN-crease; you stressed in-CREASE