Syllable Stress Assessment Agent

Single-page EFL pronunciation evaluator for noun/verb stress-shift pairs, with A2A-compatible discovery + JSON-RPC endpoints.

Try it: https://guildaidemo.talknicer.com

Features

Paragraph selector for all annotated paragraphs loaded from PARAGRAPHS.txt (currently 10, but dynamic).
Browser microphone recording with WAV encoding.
Optional "Bring Your Own Deepgram API Key" UI section that stores a user-supplied key in a deepgram_api_key browser cookie (365-day expiry).
"native exemplar" checkbox in the UI to mark exemplar-candidate submissions for later review.
Deepgram transcription with per-word confidence.
Inexact token alignment (Needleman–Wunsch style) between prompt and recognized words.
Word-level stress inference from PocketSphinx phoneme alignment and pronunciation overlap scoring.
Confidence visualization based on confidence_cubed = confidence ** 3 as background color.
A2A-compatible remote agent interface (Agent Card discovery + JSON-RPC endpoint) so other agents/platforms can call it.
- GET /.well-known/agent.json
- POST /a2a (agent.about, paragraphs.count, paragraphs.get_text, pronunciation.evaluate with optional deepgram_api_key)
Production-minded observability: request/trace IDs propagated through requests, responses, and logs for run correlation.
Health endpoint (/healthz) to support deployment/monitoring and “is it alive?” checks.
Structured, machine-consumable outputs (clear JSON schema for words, alignments, target evaluations, and summary metrics).
"Control-plane friendly" service boundaries: stable HTTP endpoints + discoverable capabilities designed to plug into orchestration/routing.
UI as an operator surface: single-page workflow that makes the agent capability usable and demoable (not just a script).
Built-in developer visibility: the app page documents the agent endpoints and example calls to ease integration and adoption.
Synchronous persistence of every analyzed submission as two files in a mounted bucket directory.

Local setup

Optionally set DEEPGRAM_API_KEY as a shared fallback key for server-side transcription. The app now prefers a non-empty deepgram_api_key cookie when present and falls back to this environment variable when needed.
(Optional, but recommended for persistence testing) set a custom bucket mount directory:

export BUCKET_DIR=/bucket

Run the dev server (creates .venv, installs deps, minifies assets, starts Flask):

./devserver.sh

Open:

http://localhost:8080

API endpoints

GET /api/paragraphs
POST /api/analyze (multipart/form-data: paragraph_id, audio_wav, optional native_exemplar)
GET /healthz
GET /api/healthz (alias for environments that only route /api/*)
GET /.well-known/agent.json
POST /a2a

Agent card discovery advertises method-level capabilities for agent.about, paragraphs.count, paragraphs.get_text, and pronunciation.evaluate, including required vs optional params and the JSON-RPC endpoint URL.

Persistence behavior (`/api/analyze` and `/a2a`)

For each successful analysis, the app writes two files synchronously into BUCKET_DIR: {recording_id}.wav (original uploaded WAV bytes) and {recording_id}.json (analysis sidecar). The recording_id is a microsecond-resolution timestamp string. For native exemplar uploads (native_exemplar=true), the timestamp is suffixed with e before the file extension so pairs are easier to spot in bucket listings. Example:

/bucket/260226140321123456.wav
/bucket/260226140321123456.json
/bucket/260226140321123456e.wav   # native exemplar
/bucket/260226140321123456e.json  # native exemplar

Recording JSON schema

Each sidecar file contains the following top-level structure:

{
  "schema_version": 1,
  "recording_id": "YYMMDDHHMMSSffffff",
  "created_at_hst": "2026-02-26T14:03:21.123456-10:00",
  "timezone": "Pacific/Honolulu",
  "source": "web",
  "request_id": "uuid-or-upstream-id",
  "request_ip": "198.51.100.23",
  "paragraph_id": 3,
  "paragraph_text_hash": "sha256:...",
  "native_exemplar": true,
  "audio": {
    "path": "/bucket/YYMMDDHHMMSSffffff.wav",
    "bytes": 482344,
    "content_type": "audio/wav",
    "sample_rate_hz": 16000,
    "channels": 1,
    "sample_width_bytes": 2
  },
  "analysis_summary": {
    "percent_correct": 71.43,
    "total_targets": 7,
    "scored_targets": 7,
    "missing_targets": 0,
    "unaligned_targets": 0
  },
  "targets": [
    {
      "token_index": 12,
      "word_display": "record",
      "word_norm": "record",
      "label": "N",
      "expected_stress": 1,
      "inferred_stress": 2,
      "status": "ok",
      "correct": false,
      "core_phones": {"syll1": "EH", "syll2": "AO"},
      "core_durations": {"syll1": 0.07, "syll2": 0.11},
      "duration_ratio": 0.636364,
      "duration_ratio_log": -0.451985,
      "deepgram_word_index": 13,
      "deepgram_confidence": 0.93,
      "deepgram_confidence_cubed": 0.804357,
      "feedback": "Shift stress to syllable 1 and lengthen that vowel."
    }
  ],
  "pipeline": {
    "asr_provider": "deepgram",
    "asr_model": "nova-2",
    "aligner": "pocketsphinx"
  }
}

When native_exemplar is true, recording_id and persisted filenames append e before the extension (for example, YYMMDDHHMMSSffffffe.wav and YYMMDDHHMMSSffffffe.json).

targets[*].duration_ratio is computed as syll1/syll2 when both values exist and syll2 > 0; otherwise it is null.

targets[*].duration_ratio_log is computed as ln((syll1 + 1e-4)/(syll2 + 1e-4)) when both values exist; otherwise it is null.

The sidecar JSON is written with an atomic temp-file-and-rename pattern (.json.tmp then os.replace) to reduce partial-write risk.

Adaptive thresholding from native exemplar sidecars

The inference pipeline can learn stress decision thresholds from previously persisted sidecars in $BUCKET_DIR/*.json, reading all matching sidecars fresh for each analysis request.

Data source: all JSON sidecars in BUCKET_DIR (default /bucket).
Training filter: only samples where top-level native_exemplar is true, target status is "ok", expected_stress is 1 or 2, and target syllable durations are usable.
Feature: duration_ratio_log = ln((syll1 + 1e-4)/(syll2 + 1e-4)).
Threshold method: for each key, fit class summaries in log-ratio space and compute a Gaussian intersection boundary using T = (μ₁σ₂ + μ₂σ₁) / (σ₁ + σ₂), where class 1 is expected stress 1 (noun pattern) and class 2 is expected stress 2 (verb pattern).
Keys:
- context key: (word_norm, paragraph_id, token_index) when available,
- fallback key: word_norm.
Minimum data guardrail: learned thresholds are used only when both classes have at least 3 exemplar samples, so each class has enough points for a valid standard deviation.
Boundary safety fallback: if σ₁ + σ₂ = 0 (identical spread), the threshold falls back to midpoint(T) = (μ₁ + μ₂) / 2.
Decision fallback: if no usable learned threshold exists (or sidecar data is empty/corrupt), the app keeps the original heuristic: syll1 >= syll2 => stress 1 else stress 2.

Target output now includes debug fields:

decision_method ("learned_threshold" or "naive_duration" when inference succeeds)
duration_ratio_log
learned_threshold (nullable)
threshold_key (nullable)
threshold_stats (nullable object with mu1, mu2, sigma1, sigma2, class counts, and effective threshold)
decision_confidence (nullable, 0.0-100.0, only when decision_method is "learned_threshold")

A2A JSON-RPC quickstart (paragraph 3 WAV fixture)

Read the discoverable model card:

curl -s "$BASE_URL/.well-known/agent.json" | jq .

Build base64 payload from the paragraph 3 WAV fixture:

AUDIO_B64=$(python - <<'PY'
import base64
from pathlib import Path
print(base64.b64encode(Path("tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wav").read_bytes()).decode())
PY
)

Discover paragraph count before selecting ids:

curl -s -X POST "$BASE_URL/a2a" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"p-count","method":"paragraphs.count","params":{}}' | jq .

Fetch plain paragraph text (unannotated) for the selected id:

curl -s -X POST "$BASE_URL/a2a" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"p-text-3","method":"paragraphs.get_text","params":{"paragraph_id":3}}' | jq .

Submit pronunciation.evaluate as an A2A client:

jq -n --arg audio "$AUDIO_B64" --arg dg "dg_live_xxx" '{jsonrpc:"2.0",id:"p3-a2a-demo",method:"pronunciation.evaluate",params:{paragraph_id:3,audio_wav_base64:$audio,deepgram_api_key:$dg}}' \\
| curl -s -X POST "$BASE_URL/a2a" \
    -H "Content-Type: application/json" \
    -d @- | jq .

Deepgram API key resolution order

For requests that need transcription, the app resolves the Deepgram key in this order:

deepgram_api_key A2A method parameter (if provided and non-empty)
deepgram_api_key browser cookie (if provided and non-empty)
DEEPGRAM_API_KEY environment variable

Server logs include only the source (a2a_param, cookie, or env) and never print key values.

Manual validation flow

Select a paragraph.
Click Start Recording, read paragraph, click Stop Recording.
Optionally check native exemplar if this submission should be flagged as an exemplar candidate.
Click Submit for Analysis.
Verify:
- confidence-colored word backgrounds,
- prominent red boxes around incorrect stress targets,
- dashed orange boxes for missing/unaligned targets,
- per-target table populated,
- Developer/A2A docs visible on same page,
- matching {recording_id}.wav and {recording_id}.json files appear in BUCKET_DIR ({recording_id} ends with e for native exemplar submissions).

Tests

Run unit tests:

pytest -q

Included tests cover:

paragraph parsing and target extraction,
sequence alignment behavior,
confidence-cubed and deterministic background normalization,
persistence of sidecar and WAV files and schema fields using the paragraph 3 test WAV fixture.

Fixture-based visual QA

To quickly verify colored confidence rendering and the target table without recording live audio, submit the bundled paragraph 3 WAV fixture and then view the Results section in the browser:

curl -sS -X POST http://127.0.0.1:8080/api/analyze \
  -F paragraph_id=3 \
  -F native_exemplar=false \
  -F audio_wav=@tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wav

This uses the same analysis pipeline as normal uploads (including Deepgram transcription and adaptive thresholding), so it is suitable for screenshot-based regression checks.

Cloud Run notes

Provide DEEPGRAM_API_KEY via the service environment configuration. Keep secrets out of source control.
Procfile for Google Cloud Run invocations runs app.py as app:app.
Validate locally with devserver.sh before deployment.

Idea for multi-agent orchistration: WhatsApp Voice Message Bot via n8n

This section describes how to connect the pronunciation evaluator to WhatsApp so users can receive a practice paragraph, record a voice message reply, and get evaluation feedback — all within WhatsApp.

Architecture overview

A lightweight n8n workflow acts as the orchestrator:

A user sends any text message to your WhatsApp number → n8n sends back a random paragraph (e.g., "Please read paragraph 3 aloud and reply with a voice message: ...")
The user replies with a WhatsApp voice message → n8n downloads the OGG/Opus audio
n8n calls a small audio conversion helper to produce a 16 kHz mono WAV
n8n base64-encodes the WAV and POSTs it to this app's pronunciation.evaluate A2A endpoint with the paragraph_id parsed from the outbound message text
n8n formats the JSON result into a plain-language summary and sends it back to the user

No server-side session state is required: the paragraph number is embedded in the text message sent to the user and parsed from the Meta webhook payload's conversation context when the voice reply arrives.

Prerequisites

Meta developer account and WhatsApp Business Cloud API

WhatsApp's API is gated by Meta regardless of what orchestration layer you use. You will need:

A Meta developer account at developers.facebook.com
A Meta Business portfolio (formerly Business Manager), verified with a real business or individual identity
A WhatsApp Business Cloud API app created in the Meta developer console, with a verified phone number attached

Meta's verification process typically takes 1–3 business days for individual/small business accounts, though it can be faster. The WhatsApp Cloud API itself is free at low message volumes (Meta's pricing is per-conversation, with a free tier for developer testing).

n8n's WhatsApp Business Cloud trigger node registers your webhook with Meta automatically using an OAuth token — you do not need to copy webhook URLs manually — but the underlying Meta account and app setup is still required.

n8n

Either:

Self-hosted (recommended for this use case): docker compose up with the official n8n Docker image. Self-hosting gives you shell access for audio conversion and no workflow execution limits. Running n8n on Cloud Run or a small GCP VM keeps everything in the same GCP environment as the main app.
n8n Cloud: easier to start but restricts arbitrary command execution, which affects audio conversion (see below).

Audio format conversion

WhatsApp voice messages are delivered as OGG/Opus files. The pronunciation.evaluate endpoint requires 16 kHz mono WAV. Convert with ffmpeg:

ffmpeg -i input.ogg -ar 16000 -ac 1 -sample_fmt s16 output.wav

On self-hosted n8n you can run this in an Execute Command node. On n8n Cloud, or to keep the conversion cleanly separable, deploy a minimal Cloud Run function that accepts a POST with the OGG bytes and returns WAV bytes. This converter is itself a small A2A-compatible service if you want to expose it as a discoverable agent capability.

n8n workflow outline

WhatsApp Trigger node — fires on any incoming WhatsApp message to your business number
IF node — branch on message type:
- Text message → pick a random paragraph_id, fetch its text via paragraphs.get_text, send back "Please read paragraph {N} aloud and reply with a voice message: {text}"
- Voice message → proceed to conversion and evaluation
HTTP Request node — download the OGG audio from the Media URL in the webhook payload
Audio conversion — Execute Command node (self-hosted) or HTTP Request to your conversion microservice
Code node — base64-encode the WAV bytes; parse the paragraph_id from the last outbound message text in the webhook context
HTTP Request node — POST to /a2a:

   {
     "jsonrpc": "2.0",
     "id": "whatsapp-eval",
     "method": "pronunciation.evaluate",
     "params": {
       "paragraph_id": 3,
       "audio_wav_base64": "<base64>"
     }
   }

Code node — format result: extract score_summary.percent_correct and per-target feedback strings
WhatsApp Business Cloud send node — reply with the formatted summary

Example summary message

Results for paragraph 3: 5/7 correct (71%)
✓ record (verb) — stress correct
✓ permit (verb) — stress correct
✗ project (noun) — should be PRO-ject; you stressed pro-JECT
✗ object (noun) — should be OB-ject; you stressed ob-JECT
✓ present (verb) — stress correct
✓ conduct (noun) — stress correct
✗ increase (noun) — should be IN-crease; you stressed in-CREASE

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
static		static
templates		templates
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PARAGRAPHS.txt		PARAGRAPHS.txt
Procfile		Procfile
README.md		README.md
app.py		app.py
cloudbuild.yaml		cloudbuild.yaml
devserver.sh		devserver.sh
requirements.txt		requirements.txt
script.js		script.js
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syllable Stress Assessment Agent

Features

Local setup

API endpoints

Persistence behavior (`/api/analyze` and `/a2a`)

Recording JSON schema

Adaptive thresholding from native exemplar sidecars

A2A JSON-RPC quickstart (paragraph 3 WAV fixture)

Deepgram API key resolution order

Manual validation flow

Tests

Fixture-based visual QA

Cloud Run notes

Idea for multi-agent orchistration: WhatsApp Voice Message Bot via n8n

Architecture overview

Prerequisites

Meta developer account and WhatsApp Business Cloud API

n8n

Audio format conversion

n8n workflow outline

Example summary message

Resources

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Syllable Stress Assessment Agent

Features

Local setup

API endpoints

Persistence behavior (/api/analyze and /a2a)

Recording JSON schema

Adaptive thresholding from native exemplar sidecars

A2A JSON-RPC quickstart (paragraph 3 WAV fixture)

Deepgram API key resolution order

Manual validation flow

Tests

Fixture-based visual QA

Cloud Run notes

Idea for multi-agent orchistration: WhatsApp Voice Message Bot via n8n

Architecture overview

Prerequisites

Meta developer account and WhatsApp Business Cloud API

n8n

Audio format conversion

n8n workflow outline

Example summary message

Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages

Persistence behavior (`/api/analyze` and `/a2a`)