Skip to content

jsalsman/guildaidemo

Repository files navigation

Syllable Stress Assessment Agent

Try on Cloud Run Agent health Python version 3.13 Flask version 3.1 A2A Compatible MIT License Donate

Single-page EFL pronunciation evaluator for noun/verb stress-shift pairs, with A2A-compatible discovery + JSON-RPC endpoints.

Try it: https://guildaidemo.talknicer.com

Features

  • Paragraph selector for all annotated paragraphs loaded from PARAGRAPHS.txt (currently 10, but dynamic).
  • Browser microphone recording with WAV encoding.
  • Optional "Bring Your Own Deepgram API Key" UI section that stores a user-supplied key in a deepgram_api_key browser cookie (365-day expiry).
  • "native exemplar" checkbox in the UI to mark exemplar-candidate submissions for later review.
  • Deepgram transcription with per-word confidence.
  • Inexact token alignment (Needleman–Wunsch style) between prompt and recognized words.
  • Word-level stress inference from PocketSphinx phoneme alignment and pronunciation overlap scoring.
  • Confidence visualization based on confidence_cubed = confidence ** 3 as background color.
  • A2A-compatible remote agent interface (Agent Card discovery + JSON-RPC endpoint) so other agents/platforms can call it.
    • GET /.well-known/agent.json
    • POST /a2a (agent.about, paragraphs.count, paragraphs.get_text, pronunciation.evaluate with optional deepgram_api_key)
  • Production-minded observability: request/trace IDs propagated through requests, responses, and logs for run correlation.
  • Health endpoint (/healthz) to support deployment/monitoring and “is it alive?” checks.
  • Structured, machine-consumable outputs (clear JSON schema for words, alignments, target evaluations, and summary metrics).
  • "Control-plane friendly" service boundaries: stable HTTP endpoints + discoverable capabilities designed to plug into orchestration/routing.
  • UI as an operator surface: single-page workflow that makes the agent capability usable and demoable (not just a script).
  • Built-in developer visibility: the app page documents the agent endpoints and example calls to ease integration and adoption.
  • Synchronous persistence of every analyzed submission as two files in a mounted bucket directory.

Local setup

  1. Optionally set DEEPGRAM_API_KEY as a shared fallback key for server-side transcription. The app now prefers a non-empty deepgram_api_key cookie when present and falls back to this environment variable when needed.

  2. (Optional, but recommended for persistence testing) set a custom bucket mount directory:

export BUCKET_DIR=/bucket
  1. Run the dev server (creates .venv, installs deps, minifies assets, starts Flask):
./devserver.sh
  1. Open:
http://localhost:8080

API endpoints

  • GET /api/paragraphs
  • POST /api/analyze (multipart/form-data: paragraph_id, audio_wav, optional native_exemplar)
  • GET /healthz
  • GET /api/healthz (alias for environments that only route /api/*)
  • GET /.well-known/agent.json
  • POST /a2a

Agent card discovery advertises method-level capabilities for agent.about, paragraphs.count, paragraphs.get_text, and pronunciation.evaluate, including required vs optional params and the JSON-RPC endpoint URL.

Persistence behavior (/api/analyze and /a2a)

For each successful analysis, the app writes two files synchronously into BUCKET_DIR: {recording_id}.wav (original uploaded WAV bytes) and {recording_id}.json (analysis sidecar). The recording_id is a microsecond-resolution timestamp string. For native exemplar uploads (native_exemplar=true), the timestamp is suffixed with e before the file extension so pairs are easier to spot in bucket listings. Example:

/bucket/260226140321123456.wav
/bucket/260226140321123456.json
/bucket/260226140321123456e.wav   # native exemplar
/bucket/260226140321123456e.json  # native exemplar

Recording JSON schema

Each sidecar file contains the following top-level structure:

{
  "schema_version": 1,
  "recording_id": "YYMMDDHHMMSSffffff",
  "created_at_hst": "2026-02-26T14:03:21.123456-10:00",
  "timezone": "Pacific/Honolulu",
  "source": "web",
  "request_id": "uuid-or-upstream-id",
  "request_ip": "198.51.100.23",
  "paragraph_id": 3,
  "paragraph_text_hash": "sha256:...",
  "native_exemplar": true,
  "audio": {
    "path": "/bucket/YYMMDDHHMMSSffffff.wav",
    "bytes": 482344,
    "content_type": "audio/wav",
    "sample_rate_hz": 16000,
    "channels": 1,
    "sample_width_bytes": 2
  },
  "analysis_summary": {
    "percent_correct": 71.43,
    "total_targets": 7,
    "scored_targets": 7,
    "missing_targets": 0,
    "unaligned_targets": 0
  },
  "targets": [
    {
      "token_index": 12,
      "word_display": "record",
      "word_norm": "record",
      "label": "N",
      "expected_stress": 1,
      "inferred_stress": 2,
      "status": "ok",
      "correct": false,
      "core_phones": {"syll1": "EH", "syll2": "AO"},
      "core_durations": {"syll1": 0.07, "syll2": 0.11},
      "duration_ratio": 0.636364,
      "duration_ratio_log": -0.451985,
      "deepgram_word_index": 13,
      "deepgram_confidence": 0.93,
      "deepgram_confidence_cubed": 0.804357,
      "feedback": "Shift stress to syllable 1 and lengthen that vowel."
    }
  ],
  "pipeline": {
    "asr_provider": "deepgram",
    "asr_model": "nova-2",
    "aligner": "pocketsphinx"
  }
}

When native_exemplar is true, recording_id and persisted filenames append e before the extension (for example, YYMMDDHHMMSSffffffe.wav and YYMMDDHHMMSSffffffe.json).

targets[*].duration_ratio is computed as syll1/syll2 when both values exist and syll2 > 0; otherwise it is null.

targets[*].duration_ratio_log is computed as ln((syll1 + 1e-4)/(syll2 + 1e-4)) when both values exist; otherwise it is null.

The sidecar JSON is written with an atomic temp-file-and-rename pattern (.json.tmp then os.replace) to reduce partial-write risk.

Adaptive thresholding from native exemplar sidecars

The inference pipeline can learn stress decision thresholds from previously persisted sidecars in $BUCKET_DIR/*.json, reading all matching sidecars fresh for each analysis request.

  • Data source: all JSON sidecars in BUCKET_DIR (default /bucket).
  • Training filter: only samples where top-level native_exemplar is true, target status is "ok", expected_stress is 1 or 2, and target syllable durations are usable.
  • Feature: duration_ratio_log = ln((syll1 + 1e-4)/(syll2 + 1e-4)).
  • Threshold method: for each key, fit class summaries in log-ratio space and compute a Gaussian intersection boundary using T = (μ₁σ₂ + μ₂σ₁) / (σ₁ + σ₂), where class 1 is expected stress 1 (noun pattern) and class 2 is expected stress 2 (verb pattern).
  • Keys:
    • context key: (word_norm, paragraph_id, token_index) when available,
    • fallback key: word_norm.
  • Minimum data guardrail: learned thresholds are used only when both classes have at least 3 exemplar samples, so each class has enough points for a valid standard deviation.
  • Boundary safety fallback: if σ₁ + σ₂ = 0 (identical spread), the threshold falls back to midpoint(T) = (μ₁ + μ₂) / 2.
  • Decision fallback: if no usable learned threshold exists (or sidecar data is empty/corrupt), the app keeps the original heuristic: syll1 >= syll2 => stress 1 else stress 2.

Target output now includes debug fields:

  • decision_method ("learned_threshold" or "naive_duration" when inference succeeds)
  • duration_ratio_log
  • learned_threshold (nullable)
  • threshold_key (nullable)
  • threshold_stats (nullable object with mu1, mu2, sigma1, sigma2, class counts, and effective threshold)
  • decision_confidence (nullable, 0.0-100.0, only when decision_method is "learned_threshold")

A2A JSON-RPC quickstart (paragraph 3 WAV fixture)

  1. Read the discoverable model card:
curl -s "$BASE_URL/.well-known/agent.json" | jq .
  1. Build base64 payload from the paragraph 3 WAV fixture:
AUDIO_B64=$(python - <<'PY'
import base64
from pathlib import Path
print(base64.b64encode(Path("tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wav").read_bytes()).decode())
PY
)
  1. Discover paragraph count before selecting ids:
curl -s -X POST "$BASE_URL/a2a" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"p-count","method":"paragraphs.count","params":{}}' | jq .
  1. Fetch plain paragraph text (unannotated) for the selected id:
curl -s -X POST "$BASE_URL/a2a" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":"p-text-3","method":"paragraphs.get_text","params":{"paragraph_id":3}}' | jq .
  1. Submit pronunciation.evaluate as an A2A client:
jq -n --arg audio "$AUDIO_B64" --arg dg "dg_live_xxx" '{jsonrpc:"2.0",id:"p3-a2a-demo",method:"pronunciation.evaluate",params:{paragraph_id:3,audio_wav_base64:$audio,deepgram_api_key:$dg}}' \\
| curl -s -X POST "$BASE_URL/a2a" \
    -H "Content-Type: application/json" \
    -d @- | jq .

Deepgram API key resolution order

For requests that need transcription, the app resolves the Deepgram key in this order:

  1. deepgram_api_key A2A method parameter (if provided and non-empty)
  2. deepgram_api_key browser cookie (if provided and non-empty)
  3. DEEPGRAM_API_KEY environment variable

Server logs include only the source (a2a_param, cookie, or env) and never print key values.

Manual validation flow

  1. Select a paragraph.
  2. Click Start Recording, read paragraph, click Stop Recording.
  3. Optionally check native exemplar if this submission should be flagged as an exemplar candidate.
  4. Click Submit for Analysis.
  5. Verify:
    • confidence-colored word backgrounds,
    • prominent red boxes around incorrect stress targets,
    • dashed orange boxes for missing/unaligned targets,
    • per-target table populated,
    • Developer/A2A docs visible on same page,
    • matching {recording_id}.wav and {recording_id}.json files appear in BUCKET_DIR ({recording_id} ends with e for native exemplar submissions).

Tests

Run unit tests:

pytest -q

Included tests cover:

  • paragraph parsing and target extraction,
  • sequence alignment behavior,
  • confidence-cubed and deterministic background normalization,
  • persistence of sidecar and WAV files and schema fields using the paragraph 3 test WAV fixture.

Fixture-based visual QA

To quickly verify colored confidence rendering and the target table without recording live audio, submit the bundled paragraph 3 WAV fixture and then view the Results section in the browser:

curl -sS -X POST http://127.0.0.1:8080/api/analyze \
  -F paragraph_id=3 \
  -F native_exemplar=false \
  -F audio_wav=@tests/abc340c7-fc39-41f0-b1a6-3557f83b7707.wav

This uses the same analysis pipeline as normal uploads (including Deepgram transcription and adaptive thresholding), so it is suitable for screenshot-based regression checks.

Cloud Run notes

  • Provide DEEPGRAM_API_KEY via the service environment configuration. Keep secrets out of source control.
  • Procfile for Google Cloud Run invocations runs app.py as app:app.
  • Validate locally with devserver.sh before deployment.

Idea for multi-agent orchistration: WhatsApp Voice Message Bot via n8n

This section describes how to connect the pronunciation evaluator to WhatsApp so users can receive a practice paragraph, record a voice message reply, and get evaluation feedback — all within WhatsApp.

Architecture overview

A lightweight n8n workflow acts as the orchestrator:

  1. A user sends any text message to your WhatsApp number → n8n sends back a random paragraph (e.g., "Please read paragraph 3 aloud and reply with a voice message: ...")
  2. The user replies with a WhatsApp voice message → n8n downloads the OGG/Opus audio
  3. n8n calls a small audio conversion helper to produce a 16 kHz mono WAV
  4. n8n base64-encodes the WAV and POSTs it to this app's pronunciation.evaluate A2A endpoint with the paragraph_id parsed from the outbound message text
  5. n8n formats the JSON result into a plain-language summary and sends it back to the user

No server-side session state is required: the paragraph number is embedded in the text message sent to the user and parsed from the Meta webhook payload's conversation context when the voice reply arrives.

Prerequisites

Meta developer account and WhatsApp Business Cloud API

WhatsApp's API is gated by Meta regardless of what orchestration layer you use. You will need:

  • A Meta developer account at developers.facebook.com
  • A Meta Business portfolio (formerly Business Manager), verified with a real business or individual identity
  • A WhatsApp Business Cloud API app created in the Meta developer console, with a verified phone number attached

Meta's verification process typically takes 1–3 business days for individual/small business accounts, though it can be faster. The WhatsApp Cloud API itself is free at low message volumes (Meta's pricing is per-conversation, with a free tier for developer testing).

n8n's WhatsApp Business Cloud trigger node registers your webhook with Meta automatically using an OAuth token — you do not need to copy webhook URLs manually — but the underlying Meta account and app setup is still required.

n8n

Either:

  • Self-hosted (recommended for this use case): docker compose up with the official n8n Docker image. Self-hosting gives you shell access for audio conversion and no workflow execution limits. Running n8n on Cloud Run or a small GCP VM keeps everything in the same GCP environment as the main app.
  • n8n Cloud: easier to start but restricts arbitrary command execution, which affects audio conversion (see below).

Audio format conversion

WhatsApp voice messages are delivered as OGG/Opus files. The pronunciation.evaluate endpoint requires 16 kHz mono WAV. Convert with ffmpeg:

ffmpeg -i input.ogg -ar 16000 -ac 1 -sample_fmt s16 output.wav

On self-hosted n8n you can run this in an Execute Command node. On n8n Cloud, or to keep the conversion cleanly separable, deploy a minimal Cloud Run function that accepts a POST with the OGG bytes and returns WAV bytes. This converter is itself a small A2A-compatible service if you want to expose it as a discoverable agent capability.

n8n workflow outline

  1. WhatsApp Trigger node — fires on any incoming WhatsApp message to your business number
  2. IF node — branch on message type:
    • Text message → pick a random paragraph_id, fetch its text via paragraphs.get_text, send back "Please read paragraph {N} aloud and reply with a voice message: {text}"
    • Voice message → proceed to conversion and evaluation
  3. HTTP Request node — download the OGG audio from the Media URL in the webhook payload
  4. Audio conversion — Execute Command node (self-hosted) or HTTP Request to your conversion microservice
  5. Code node — base64-encode the WAV bytes; parse the paragraph_id from the last outbound message text in the webhook context
  6. HTTP Request node — POST to /a2a:
   {
     "jsonrpc": "2.0",
     "id": "whatsapp-eval",
     "method": "pronunciation.evaluate",
     "params": {
       "paragraph_id": 3,
       "audio_wav_base64": "<base64>"
     }
   }
  1. Code node — format result: extract score_summary.percent_correct and per-target feedback strings
  2. WhatsApp Business Cloud send node — reply with the formatted summary

Example summary message

Results for paragraph 3: 5/7 correct (71%)
✓ record (verb) — stress correct
✓ permit (verb) — stress correct
✗ project (noun) — should be PRO-ject; you stressed pro-JECT
✗ object (noun) — should be OB-ject; you stressed ob-JECT
✓ present (verb) — stress correct
✓ conduct (noun) — stress correct
✗ increase (noun) — should be IN-crease; you stressed in-CREASE

Resources