A follow-along and improvement project based on the AI-Engineer-Skool/local-ai-transcript-app tutorial. The goal is to learn about speech-to-text (STT) and text-to-speech (TTS) models, experiment with different backends, and evolve the app for lower latency and better performance.
- Upstream repo & tutorial: github.com/AI-Engineer-Skool/local-ai-transcript-app
- 📺 Video walkthrough: YouTube – project structure and API details
This fork starts from the vanilla stack: Whisper for STT and an LLM for transcript cleaning, with a Streamlit frontend and FastAPI backend. A latency tracker shows Whisper, LLM, and total pipeline times. The roadmap includes expanding model selection (e.g. alternative STT/TTS engines) to reduce latency and improve performance while keeping the app usable locally.
Record or upload audio, then run the full pipeline to get the cleaned transcript and latency metrics.
- Initial setup: STT via Whisper (English, runs locally).
- Planned improvements: Broaden model choices for both STT and (optionally) TTS to compare latency, accuracy, and resource use; tune for faster transcription and better quality.
Features:
- 🎤 Record or upload — Browser-based voice recording or drag-and-drop upload (WAV, MP3, M4A, OGG; limit 200MB per file)
- 🔊 English Whisper speech-to-text (runs locally)
- 🤖 LLM cleaning (removes filler words, fixes errors)
- ⏱️ Latency tracker — Whisper time, LLM time, and total pipeline time
- 📊 Benchmark script — Repeatable latency measurements (STT/LLM/total) over sample audio with CSV output
- 🔌 OpenAI API-compatible (works with Ollama, LM Studio, OpenAI, or any OpenAI-compatible API)
Recent frontend updates:
- Header updated to STT + LLM Cleaning with subtitle: Full transcription pipeline with real-time latency metrics.
- Input Audio section with two options side by side: Record (microphone) and Upload (file picker or drag-and-drop).
- If both recorded and uploaded audio are present, the app uses the recording and shows a short notice; otherwise it uses whichever source is available.
- Single Process Audio → action for either input; results show cleaned transcript and overall latency.
- Prerequisites: Docker Desktop, VS Code, Dev Containers extension.
- Open in container: In VS Code, "Reopen in Container" (or
Cmd/Ctrl+Shift+P→ "Dev Containers: Reopen in Container"). - Wait ~5–10 minutes for build and Ollama model download. The devcontainer creates
backend/.envand starts Ollama automatically.
Then go to How to run the app.
The app runs inside the dev container. Use Docker Compose to start the environment, then run backend and frontend in two terminals (both inside the container).
Step 1 — Start the environment
From the project root (e.g. repo root on your machine):
docker compose up -dStep 2 — Enter the container
docker compose exec app bashStep 3 — Start the backend (first terminal, inside container)
cd workspaces/ai-transcript-app/backend
uv sync
uv run uvicorn app:app --reload --host 0.0.0.0 --port 8000Wait until you see ✅ Ready! (Whisper and LLM are loaded).
- API docs: http://localhost:8000/docs
- Health check: http://localhost:8000/api/status
Step 4 — Start the frontend (new terminal, inside container)
Open a new terminal, then:
docker compose exec app bash
cd workspaces/ai-transcript-app/frontend
export HOME=/tmp
uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0- Frontend app: http://localhost:8501
Step 5 — Use the app
Open http://localhost:8501 in your browser. Use Record or Upload to provide audio, click Process Audio →, and see the raw transcript, cleaned transcript, and latency breakdown (Whisper, LLM, total time). No extra env vars needed inside the container.
This app is compatible with any OpenAI API-format LLM provider:
- Ollama (default - works out of the box in devcontainer)
- LM Studio (local alternative)
- OpenAI API (cloud-based)
- Any other OpenAI-compatible API
The devcontainer automatically creates backend/.env with working Ollama defaults. No configuration needed to get started.
To use a different provider, edit backend/.env:
LLM_BASE_URL- API endpointLLM_API_KEY- API keyLLM_MODEL- Model name
This repository currently includes a development container configuration only.
-
Copy the example env file and fill in your values:
cp backend/.env.example backend/.env
-
Start the dev container:
docker compose -f .devcontainer/docker-compose.yml up --build
- FastAPI: http://localhost:8000
- Streamlit: http://localhost:8501
Requires Docker & Docker Compose. VS Code + Dev Containers extension recommended.
To make this production-ready, the following changes would be needed:
- Dockerfile — add a
productionbuild stage, create a named non-root user, add anentrypoint.shto start services - docker-compose.yml — remove workspace volume mount (bake code into image), use
.envinstead of.env.example, addrestart: unless-stopped, add Ollama healthcheck withcondition: service_healthy, remove hardcoded dev UID - Dependencies — run
uv sync --no-devin prod stage to exclude dev dependencies
A benchmark script measures STT (Whisper), LLM, and total pipeline latency using sample WAV files. It runs multiple passes (with one warmup run discarded), aggregates mean ± std, and writes results to a CSV.
Location: backend/benchmark/
Prerequisites: Backend must be running (see How to run the app).
Run from the backend directory (e.g. inside the dev container):
cd backend
uv run python benchmark/benchmark.pyConfig (edit at the top of backend/benchmark/benchmark.py):
BACKEND_URL– full-pipeline endpoint (defaulthttp://localhost:8000/api/full)LLM_MODEL/WHISPER_MODEL– match your.envand Whisper setupAUDIO_FILES– list of WAV files inbenchmark/audio/(e.g.sample_a.wav,sample_b.wav,sample_c.wav)N_RUNS– number of runs per file (first run is warmup and excluded)OUTPUT_CSV– path for the results CSV (default:benchmark/results_{LLM_MODEL}_{WHISPER_MODEL}.csv)
Output: A CSV with columns such as transcription_time_mean, llm_time_mean, total_time_mean, and their standard deviations, per audio file and setup.
Benchmarked on 3 audio samples using different Whisper + LLM cleaning configurations.
⚠️ These tests were run locally.
Stronger/larger models were not evaluated due to local hardware constraints.
| Setup | Transcription (s) | LLM (s) | Total (s) | Wall Time (s) |
|---|---|---|---|---|
| whisper:base.en + llm:gemma:2b | 3.33 | 5.67 | 9.00 | 9.06 |
| whisper:base.en + llm:phi3:mini | 3.31 | 0.01 | 3.32 | 3.35 |
| whisper:small.en + llm:phi3:mini | 9.58 | 1.63 | 11.20 | 11.27 |
Values shown are mean runtime in seconds across 3 runs.
-
LLM choice significantly impacts total runtime.
gemma:2badds substantial latency compared tophi3:mini. -
Whisper model size has the largest effect on transcription time.
small.enis ~3× slower thanbase.en. -
Wall time closely matches total time, indicating minimal unmeasured overhead in the pipeline.
-
phi3:miniintroduces near-zero overhead in the base configuration, making it the fastest overall setup locally.
The extremely low LLM runtime observed with phi3:mini may indicate failed or incomplete execution due to local memory constraints. Further validation is required. For this reason, gemma:2b is retained as the default LLM in the current architecture for now.
- Tested locally (CPU-based inference)
- Focused on lightweight configurations suitable for on-device usage
Container won't start or is very slow:
Configure Docker Desktop resources:
- Open Docker Desktop → Settings → Resources
- Set CPUs to maximum available (8+ cores recommended)
- Set Memory to at least 16GB
- Click Apply & Restart
Expected specs: Modern laptop/desktop with 8+ CPU cores and 16GB RAM. More CPU = faster LLM responses.
Microphone not working:
- Use Chrome or Firefox (Safari may have issues)
- Check browser permissions: Settings → Privacy → Microphone
Backend fails to start:
- Check Whisper model downloads:
~/.cache/huggingface/ - Ensure enough disk space (models are ~150MB)
LLM errors:
- Make sure Ollama service is running (it auto-starts with devcontainer)
- Check model is downloaded: Model downloads automatically during devcontainer setup
- Transcription still works without LLM (raw Whisper only)
LLM is slow:
- See "Container won't start or is very slow" section above for Docker resource configuration
- Fallback option: Switch to another model (edit
LLM_MODELinbackend/.env)⚠️ Trade-off: 3b is faster but significantly worse at cleaning transcripts
- Best alternative: Use a cloud API like OpenAI for instant responses with excellent quality (edit
.env)
Cannot access localhost:8501 or localhost:8000 from host machine:
- Docker Desktop: Go to Settings → Resources → Network
- Enable "Use host networking" (may require Docker Desktop restart)
- Restart the frontend and backend servers
Port already in use:
- Backend: Change port with
--port 8001(and setBACKEND_URLin the Streamlit app if needed) - Frontend: Use a different port, e.g.
uv run streamlit run streamlit_app.py --server.port 8502 --server.address 0.0.0.0
