AI Transcript App

A follow-along and improvement project based on the AI-Engineer-Skool/local-ai-transcript-app tutorial. The goal is to learn about speech-to-text (STT) and text-to-speech (TTS) models, experiment with different backends, and evolve the app for lower latency and better performance.

Upstream repo & tutorial: github.com/AI-Engineer-Skool/local-ai-transcript-app
📺 Video walkthrough: YouTube – project structure and API details

This fork starts from the vanilla stack: Whisper for STT and an LLM for transcript cleaning, with a Streamlit frontend and FastAPI backend. A latency tracker shows Whisper, LLM, and total pipeline times. The roadmap includes expanding model selection (e.g. alternative STT/TTS engines) to reduce latency and improve performance while keeping the app usable locally.

Quick demo

Record or upload audio, then run the full pipeline to get the cleaned transcript and latency metrics.

Learning focus: STT & TTS

Initial setup: STT via Whisper (English, runs locally).
Planned improvements: Broaden model choices for both STT and (optionally) TTS to compare latency, accuracy, and resource use; tune for faster transcription and better quality.

Features:

🎤 Record or upload — Browser-based voice recording or drag-and-drop upload (WAV, MP3, M4A, OGG; limit 200MB per file)
🔊 English Whisper speech-to-text (runs locally)
🤖 LLM cleaning (removes filler words, fixes errors)
⏱️ Latency tracker — Whisper time, LLM time, and total pipeline time
📊 Benchmark script — Repeatable latency measurements (STT/LLM/total) over sample audio with CSV output
🔌 OpenAI API-compatible (works with Ollama, LM Studio, OpenAI, or any OpenAI-compatible API)

Recent frontend updates:

Header updated to STT + LLM Cleaning with subtitle: Full transcription pipeline with real-time latency metrics.
Input Audio section with two options side by side: Record (microphone) and Upload (file picker or drag-and-drop).
If both recorded and uploaded audio are present, the app uses the recording and shows a short notice; otherwise it uses whichever source is available.
Single Process Audio → action for either input; results show cleaned transcript and overall latency.

Quick Start

🚀 Dev Container

Prerequisites: Docker Desktop, VS Code, Dev Containers extension.
Open in container: In VS Code, "Reopen in Container" (or Cmd/Ctrl+Shift+P → "Dev Containers: Reopen in Container").
Wait ~5–10 minutes for build and Ollama model download. The devcontainer creates backend/.env and starts Ollama automatically.

Then go to How to run the app.

How to run the app (dev)

The app runs inside the dev container. Use Docker Compose to start the environment, then run backend and frontend in two terminals (both inside the container).

Step 1 — Start the environment

From the project root (e.g. repo root on your machine):

docker compose up -d

Step 2 — Enter the container

docker compose exec app bash

Step 3 — Start the backend (first terminal, inside container)

cd workspaces/ai-transcript-app/backend
uv sync
uv run uvicorn app:app --reload --host 0.0.0.0 --port 8000

Wait until you see ✅ Ready! (Whisper and LLM are loaded).

API docs: http://localhost:8000/docs
Health check: http://localhost:8000/api/status

Step 4 — Start the frontend (new terminal, inside container)

Open a new terminal, then:

docker compose exec app bash
cd workspaces/ai-transcript-app/frontend
export HOME=/tmp
uv run streamlit run streamlit_app.py --server.port 8501 --server.address 0.0.0.0

Frontend app: http://localhost:8501

Step 5 — Use the app

Open http://localhost:8501 in your browser. Use Record or Upload to provide audio, click Process Audio →, and see the raw transcript, cleaned transcript, and latency breakdown (Whisper, LLM, total time). No extra env vars needed inside the container.

Configuration

OpenAI API Compatibility

This app is compatible with any OpenAI API-format LLM provider:

Ollama (default - works out of the box in devcontainer)
LM Studio (local alternative)
OpenAI API (cloud-based)
Any other OpenAI-compatible API

The devcontainer automatically creates backend/.env with working Ollama defaults. No configuration needed to get started.

To use a different provider, edit backend/.env:

LLM_BASE_URL - API endpoint
LLM_API_KEY - API key
LLM_MODEL - Model name

Docker Setup

This repository currently includes a development container configuration only.

Running the Dev Environment

Copy the example env file and fill in your values:
```
cp backend/.env.example backend/.env
```

Start the dev container:

docker compose -f .devcontainer/docker-compose.yml up --build

FastAPI: http://localhost:8000
Streamlit: http://localhost:8501

Requires Docker & Docker Compose. VS Code + Dev Containers extension recommended.

Production Readiness (TODO)

To make this production-ready, the following changes would be needed:

Dockerfile — add a production build stage, create a named non-root user, add an entrypoint.sh to start services
docker-compose.yml — remove workspace volume mount (bake code into image), use .env instead of .env.example, add restart: unless-stopped, add Ollama healthcheck with condition: service_healthy, remove hardcoded dev UID
Dependencies — run uv sync --no-dev in prod stage to exclude dev dependencies

Benchmarking

A benchmark script measures STT (Whisper), LLM, and total pipeline latency using sample WAV files. It runs multiple passes (with one warmup run discarded), aggregates mean ± std, and writes results to a CSV.

Location: backend/benchmark/

Prerequisites: Backend must be running (see How to run the app).

Run from the backend directory (e.g. inside the dev container):

cd backend
uv run python benchmark/benchmark.py

Config (edit at the top of backend/benchmark/benchmark.py):

BACKEND_URL – full-pipeline endpoint (default http://localhost:8000/api/full)
LLM_MODEL / WHISPER_MODEL – match your .env and Whisper setup
AUDIO_FILES – list of WAV files in benchmark/audio/ (e.g. sample_a.wav, sample_b.wav, sample_c.wav)
N_RUNS – number of runs per file (first run is warmup and excluded)
OUTPUT_CSV – path for the results CSV (default: benchmark/results_{LLM_MODEL}_{WHISPER_MODEL}.csv)

Output: A CSV with columns such as transcription_time_mean, llm_time_mean, total_time_mean, and their standard deviations, per audio file and setup.

⚡ Benchmark Results (Local Testing)

Benchmarked on 3 audio samples using different Whisper + LLM cleaning configurations.

⚠️ These tests were run locally.
Stronger/larger models were not evaluated due to local hardware constraints.

⏱ Mean Runtime (seconds)

Setup	Transcription (s)	LLM (s)	Total (s)	Wall Time (s)
whisper:base.en + llm:gemma:2b	3.33	5.67	9.00	9.06
whisper:base.en + llm:phi3:mini	3.31	0.01	3.32	3.35
whisper:small.en + llm:phi3:mini	9.58	1.63	11.20	11.27

Values shown are mean runtime in seconds across 3 runs.

🔎 Observations

LLM choice significantly impacts total runtime.
gemma:2b adds substantial latency compared to phi3:mini.
Whisper model size has the largest effect on transcription time.
small.en is ~3× slower than base.en.
Wall time closely matches total time, indicating minimal unmeasured overhead in the pipeline.
phi3:mini introduces near-zero overhead in the base configuration, making it the fastest overall setup locally.

⚠️ Consideration

The extremely low LLM runtime observed with phi3:mini may indicate failed or incomplete execution due to local memory constraints. Further validation is required. For this reason, gemma:2b is retained as the default LLM in the current architecture for now.

🖥 Environment

Tested locally (CPU-based inference)
Focused on lightweight configurations suitable for on-device usage

Troubleshooting

Container won't start or is very slow:

⚠️ This app runs an LLM on CPU and requires adequate Docker resources.

Configure Docker Desktop resources:

Open Docker Desktop → Settings → Resources
Set CPUs to maximum available (8+ cores recommended)
Set Memory to at least 16GB
Click Apply & Restart

Expected specs: Modern laptop/desktop with 8+ CPU cores and 16GB RAM. More CPU = faster LLM responses.

Microphone not working:

Use Chrome or Firefox (Safari may have issues)
Check browser permissions: Settings → Privacy → Microphone

Backend fails to start:

Check Whisper model downloads: ~/.cache/huggingface/
Ensure enough disk space (models are ~150MB)

LLM errors:

Make sure Ollama service is running (it auto-starts with devcontainer)
Check model is downloaded: Model downloads automatically during devcontainer setup
Transcription still works without LLM (raw Whisper only)

LLM is slow:

See "Container won't start or is very slow" section above for Docker resource configuration
Fallback option: Switch to another model (edit LLM_MODEL in backend/.env)
- ⚠️ Trade-off: 3b is faster but significantly worse at cleaning transcripts
Best alternative: Use a cloud API like OpenAI for instant responses with excellent quality (edit .env)

Cannot access localhost:8501 or localhost:8000 from host machine:

Docker Desktop: Go to Settings → Resources → Network
Enable "Use host networking" (may require Docker Desktop restart)
Restart the frontend and backend servers

Port already in use:

Backend: Change port with --port 8001 (and set BACKEND_URL in the Streamlit app if needed)
Frontend: Use a different port, e.g. uv run streamlit run streamlit_app.py --server.port 8502 --server.address 0.0.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.devcontainer		.devcontainer
.vscode		.vscode
assets		assets
backend		backend
frontend		frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Transcript App

Quick demo

Learning focus: STT & TTS

Quick Start

🚀 Dev Container

How to run the app (dev)

Configuration

OpenAI API Compatibility

Docker Setup

Running the Dev Environment

Production Readiness (TODO)

Benchmarking

⚡ Benchmark Results (Local Testing)

⏱ Mean Runtime (seconds)

🔎 Observations

⚠️ Consideration

🖥 Environment

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Transcript App

Quick demo

Learning focus: STT & TTS

Quick Start

🚀 Dev Container

How to run the app (dev)

Configuration

OpenAI API Compatibility

Docker Setup

Running the Dev Environment

Production Readiness (TODO)

Benchmarking

⚡ Benchmark Results (Local Testing)

⏱ Mean Runtime (seconds)

🔎 Observations

⚠️ Consideration

🖥 Environment

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages