CompositeVoice

An SDK for building AI voice agents — wire together any combination of input, STT, LLM, TTS, and output providers behind one unified 5-role pipeline.

import { CompositeVoice, AnthropicLLM } from '@lukeocodes/composite-voice';

// Text agent — NullInput + NullOutput auto-filled
const agent = new CompositeVoice({
  providers: [new AnthropicLLM({ apiKey: 'sk-ant-...', model: 'claude-haiku-4-5-20251001' })],
});

await agent.initialize();
// Add NativeSTT + NativeTTS for voice, or cloud providers for production audio.

Why CompositeVoice?

Building a voice agent from scratch means solving many hard problems simultaneously: microphone capture, WebSocket reconnection, turn-taking logic, interleaving STT/LLM/TTS lifecycles, and stitching together provider SDKs that share nothing in common. It is easy to spend weeks before your first working demo.

CompositeVoice handles the plumbing. You declare the pipeline; the SDK runs it.

Feature	What it means for you
5-role pipeline	Audio flows through 5 roles: `input → stt → llm → tts → output`. Each role is a pluggable provider. Multi-role providers (e.g., NativeSTT = input+stt) reduce boilerplate.
Provider-agnostic	Deepgram, AssemblyAI, Anthropic, OpenAI, Groq, Gemini, Mistral, ElevenLabs, Cartesia, or browser built-ins — mix and match freely. Swapping a provider is one constructor change.
Type-safe throughout	Every event payload, config option, and provider interface is fully typed. TypeScript autocomplete works end-to-end.
Zero-config text agent	Pass an empty providers array (or just an LLM) and the SDK defaults to a text-only agent — AnthropicLLM + NullInput + NullOutput. Add voice providers to progressively enhance.
Smart text routing	LLM output is split into visual and spoken streams. Code fences are buffered and never sent to TTS. Markdown is stripped for natural speech while the UI gets full formatting.
Event-driven	Subscribe to any stage of the pipeline: individual transcription words, LLM tokens, TTS audio chunks, queue stats, and state transitions.
Race-condition-free	Audio frames are buffered in a queue during STT connection. No frames are ever lost, even when the WebSocket handshake takes time.
Conversation memory	Multi-turn history that grows and trims automatically, included in every LLM call.
Eager LLM generation	Start generating a response before the user finishes speaking — cuts perceived latency noticeably.
Server-side proxy	Keep API keys completely off the client. Proxy middleware included for Express, Next.js, and plain Node.js — supports all providers.
Server-side pipelines	Run the full pipeline in Node.js, Bun, or Deno with `BufferInput` and `NullOutput` — no browser APIs required.
Agent providers	Collapse STT + LLM + TTS into a single connection. `DeepgramAgent` uses one WebSocket to the Deepgram Voice Agent API — the SDK auto-fills mic input and speaker output.
Extensible	Abstract base classes for all 5 roles plus `BaseAgentProvider` for multi-role agents. The `OpenAICompatibleLLM` base class means any OpenAI-compatible API works out of the box.

Installation

pnpm add @lukeocodes/composite-voice

Node.js 18 or later is required.

Most providers use native fetch or native WebSocket — no SDKs to install. Optional peer dependencies:

pnpm add @mlc-ai/web-llm      # WebLLMLLM — in-browser inference (>=0.2.74)
pnpm add ws                   # server-side proxy WebSocket support, Node.js only (>=8.0.0)

Anthropic, OpenAI, Groq, Gemini, Mistral, Deepgram, AssemblyAI, ElevenLabs, and Cartesia providers all work with zero peer dependencies.

Quick start

Simplest setup — text agent, zero providers

Pass just an LLM (or even an empty array) and the SDK defaults to a text-only agent. NullInput and NullOutput are auto-filled — no microphone, no speaker. You interact via agent.pushText() and subscribe to LLM events.

import { CompositeVoice, AnthropicLLM } from '@lukeocodes/composite-voice';

// Just an LLM — NullInput + NullOutput auto-filled for text-only mode
const agent = new CompositeVoice({
  providers: [new AnthropicLLM({ apiKey: 'sk-ant-...', model: 'claude-haiku-4-5-20251001' })],
});

// Or even zero providers — AnthropicLLM (claude-haiku-4-5) auto-filled too
// const agent = new CompositeVoice({ providers: [] });

agent.on('llm.chunk', (e) => process.stdout.write(e.chunk));
agent.on('agent.stateChange', (e) => console.log('State:', e.state));

await agent.initialize();

Add voice — browser built-ins

Add NativeSTT and NativeTTS for a voice agent using the browser's Web Speech API and SpeechSynthesis. Works in Chrome and Edge.

import { CompositeVoice, NativeSTT, AnthropicLLM, NativeTTS } from '@lukeocodes/composite-voice';

const agent = new CompositeVoice({
  providers: [
    new NativeSTT({ language: 'en-US' }),
    new AnthropicLLM({
      apiKey: 'sk-ant-...',
      model: 'claude-haiku-4-5-20251001',
      systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
      maxTokens: 200,
    }),
    new NativeTTS(),
  ],
});

agent.on('transcription.final', (e) => console.log('You said:', e.text));
agent.on('llm.chunk', (e) => process.stdout.write(e.chunk));
agent.on('agent.stateChange', (e) => console.log('State:', e.state));

await agent.initialize();
await agent.startListening();

See Example 00 for a full runnable demo with UI.

Production setup — cloud providers, auto-filled audio I/O

Supply cloud STT and TTS providers and the SDK auto-fills MicrophoneInput and BrowserAudioOutput for you. Audio is buffered between stages, eliminating the race condition where first frames could be lost during STT WebSocket handshake.

import {
  CompositeVoice,
  DeepgramSTT,
  AnthropicLLM,
  DeepgramTTS,
} from '@lukeocodes/composite-voice';

// 3-provider config — MicrophoneInput + BrowserAudioOutput auto-filled
const agent = new CompositeVoice({
  providers: [
    new DeepgramSTT({
      apiKey: 'your-deepgram-key',
      options: {
        model: 'nova-3',
        smartFormat: true,
        interimResults: true,
        endpointing: 300,
      },
    }),
    new AnthropicLLM({
      apiKey: 'your-anthropic-key',
      model: 'claude-haiku-4-5-20251001',
      systemPrompt: 'You are a helpful voice assistant. Keep responses brief.',
      maxTokens: 200,
    }),
    new DeepgramTTS({
      apiKey: 'your-deepgram-key',
      options: { model: 'aura-2-thalia-en', encoding: 'linear16', sampleRate: 24000 },
    }),
  ],
});

await agent.initialize();
await agent.startListening();

See Example 20 for the full runnable demo.

5-role pipeline

Every voice agent follows the same 5-stage pipeline:

[InputProvider]  →  InputQueue  →  [STT]  →  [LLM]  →  [TTS]  →  OutputQueue  →  [OutputProvider]
     input             ↑           stt        llm        tts           ↑              output
                  buffers during                                 buffers during
                  STT connection                                 output setup

Stage	Role	What it does	Example providers
1	`input`	Captures audio from a source	`MicrophoneInput`, `BufferInput`, `NativeSTT`
2	`stt`	Converts audio to text	`DeepgramSTT`, `AssemblyAISTT`, `NativeSTT`
3	`llm`	Generates a text response	`AnthropicLLM`, `OpenAILLM`, `WebLLMLLM`
4	`tts`	Converts text to audio	`DeepgramTTS`, `ElevenLabsTTS`, `NativeTTS`
5	`output`	Plays audio to a destination	`BrowserAudioOutput`, `NullOutput`, `NativeTTS`

Audio frames are buffered in queues between stages. The input queue holds frames while the STT WebSocket connects, then flushes them in order — no audio is ever lost. The output queue does the same for TTS-to-speaker handoff.

The agent state machine moves through well-defined states — idle -> ready -> listening -> thinking -> speaking — emitting events at every transition. Your UI subscribes to these events; the SDK manages the lifecycle.

Provider roles

Every provider declares a roles property listing which pipeline slots it fills. The SDK resolves the full 5-role pipeline from a flat providers array.

Multi-role providers

Some providers handle multiple roles. NativeSTT manages its own microphone internally (Web Speech API), so it covers both input and stt. NativeTTS manages its own speaker output, so it covers tts and output. These are explicit providers you add when you want voice — they are not auto-filled defaults.

// Voice agent with browser built-ins — NativeSTT covers input+stt, NativeTTS covers tts+output
const agent = new CompositeVoice({
  providers: [
    new NativeSTT(),        // roles: ['input', 'stt']
    new AnthropicLLM({...}), // roles: ['llm']
    new NativeTTS(),        // roles: ['tts', 'output']
  ],
});

Explicit 5-provider config

When using providers that handle a single role each, supply all 5:

// 5-provider config — each role filled by a separate provider
const agent = new CompositeVoice({
  providers: [
    new MicrophoneInput(),        // roles: ['input']
    new DeepgramSTT({...}),       // roles: ['stt']
    new AnthropicLLM({...}),      // roles: ['llm']
    new DeepgramTTS({...}),       // roles: ['tts']
    new BrowserAudioOutput(),     // roles: ['output']
  ],
});

Defaults

When you omit certain roles, the SDK auto-fills them:

input + stt both uncovered → auto-fills with NullInput (text-only, no microphone)
tts + output both uncovered → auto-fills with NullOutput (text-only, no speaker)
llm uncovered → auto-fills with AnthropicLLM (claude-haiku-4-5)
stt provided without input → auto-fills with MicrophoneInput
tts provided without output → auto-fills with BrowserAudioOutput

This means the minimal config is an empty providers array (text-only agent with AnthropicLLM):

// Zero-config — AnthropicLLM + NullInput + NullOutput auto-filled
const agent = new CompositeVoice({ providers: [] });

// Or just an LLM — NullInput + NullOutput auto-filled for text-only mode
const agent = new CompositeVoice({
  providers: [new AnthropicLLM({ apiKey: '...', model: 'claude-haiku-4-5-20251001' })],
});

Providers

Audio Input

Provider	Environment	Roles	Peer dependency
`MicrophoneInput`	Browser	`input`	None
`BufferInput`	Node/Bun/Deno	`input`	None
`NativeSTT`	Browser	`input` + `stt`	None

MicrophoneInput wraps the browser's getUserMedia + AudioContext into a provider. BufferInput accepts pushed ArrayBuffer data for server-side pipelines. NativeSTT manages its own microphone internally.

Speech-to-Text (STT)

Provider	Transport	Browser support	Peer dependency
`NativeSTT`	Web Speech API	Chrome, Edge	None
`DeepgramSTT`	WebSocket	All modern browsers	None
`DeepgramFlux`	WebSocket	All modern browsers	None
`AssemblyAISTT`	WebSocket	All modern browsers	None
`ElevenLabsSTT`	WebSocket	All modern browsers	None

All STT providers emit an utteranceComplete: true flag on transcription results to signal when an utterance is ready for LLM processing. This flag is the canonical trigger for LLM generation. The speechFinal event is retained for display purposes but is deprecated as the LLM trigger — use utteranceComplete instead.

NativeSTT options:

new NativeSTT({
  language: 'en-US', // BCP-47 language tag
  continuous: true, // keep listening between pauses
  interimResults: true, // emit partial results while speaking
  startTimeout: 5000, // ms before erroring if no audio detected
});

DeepgramSTT options (V1/Nova):

new DeepgramSTT({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  language: 'en-US',
  options: {
    model: 'nova-3', // nova-3 = best accuracy
    smartFormat: true,
    punctuation: true,
    interimResults: true,
    endpointing: 300, // ms of silence before speech_final fires
    vadEvents: true,
  },
});

DeepgramFlux options (V2/Flux — supports eager LLM):

new DeepgramFlux({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  options: {
    model: 'flux-general-en',
    eagerEotThreshold: 0.5, // enables eager end-of-turn signals
    eotThreshold: 0.7,
  },
});

AssemblyAISTT options:

new AssemblyAISTT({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  language: 'en', // language code
  sampleRate: 16000, // audio sample rate in Hz
  wordBoost: ['CompositeVoice'], // boost recognition of specific words
  interimResults: true, // partial transcripts while speaking
});

ElevenLabsSTT options:

new ElevenLabsSTT({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  model: 'scribe_v2_realtime', // default model
  audioFormat: 'pcm_16000', // audio format
  language: 'en', // BCP 47, ISO 639-1, or ISO 639-3
  commitStrategy: 'vad', // 'vad' (default) or 'manual'
  includeTimestamps: true, // word-level timestamps
});

Language Models (LLM)

Provider	Transport	Peer dependency	Notes
`AnthropicLLM`	HTTP streaming	None	Claude models. Streams by default. Pipeline default.
`OpenAILLM`	HTTP	None	GPT models.
`GroqLLM`	HTTP	None	Groq-hosted models. Ultra-fast inference.
`GeminiLLM`	HTTP	None	Google Gemini models via OpenAI-compatible API.
`MistralLLM`	HTTP	None	Mistral open and commercial models.
`WebLLMLLM`	In-browser WebGPU	`@mlc-ai/web-llm`	Fully offline. No API keys. Runs entirely client-side.
`OpenAICompatibleLLM`	HTTP	None	Base class — extend for any OpenAI-compatible endpoint.

AnthropicLLM options:

new AnthropicLLM({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  model: 'claude-haiku-4-5-20251001',
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
  temperature: 0.7,
  stream: true, // default: true
});

OpenAILLM options:

new OpenAILLM({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  model: 'gpt-4o-mini',
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
  temperature: 0.7,
});

GroqLLM options:

new GroqLLM({
  apiKey: 'your-key', // or groqApiKey; omit and use proxyUrl for proxy mode
  model: 'llama-3.3-70b-versatile', // default model
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
});

GeminiLLM options:

new GeminiLLM({
  apiKey: 'your-key', // or geminiApiKey; omit and use proxyUrl for proxy mode
  model: 'gemini-2.0-flash', // default model
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
});

MistralLLM options:

new MistralLLM({
  apiKey: 'your-key', // or mistralApiKey; omit and use proxyUrl for proxy mode
  model: 'mistral-small-latest', // default model
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
});

WebLLMLLM options:

new WebLLMLLM({
  model: 'Llama-3.2-1B-Instruct-q4f16_1-MLC', // runs entirely in-browser via WebGPU
  onLoadProgress: (progress) => console.log(progress), // model download progress
  chatOpts: { context_window_size: 2048 }, // optional model config overrides
});

Text-to-Speech (TTS)

Provider	Transport	Browser support	Peer dependency
`NativeTTS`	SpeechSynthesis API	All modern browsers	None
`DeepgramTTS`	WebSocket	All modern browsers	None
`OpenAITTS`	HTTP (REST)	All modern browsers	None
`ElevenLabsTTS`	WebSocket	All modern browsers	None
`CartesiaTTS`	WebSocket	All modern browsers	None

NativeTTS options:

new NativeTTS({
  rate: 1.0, // speech rate (0.1 – 10)
  pitch: 1.0, // voice pitch (0 – 2)
  volume: 1.0, // volume (0 – 1)
  preferLocal: true, // prefer on-device voices over cloud voices
});

DeepgramTTS options:

new DeepgramTTS({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  options: {
    model: 'aura-2-thalia-en',
    encoding: 'linear16',
    sampleRate: 24000,
  },
});

OpenAITTS options:

new OpenAITTS({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  model: 'tts-1', // 'tts-1' (fast) or 'tts-1-hd' (quality)
  voice: 'alloy', // 'alloy' | 'echo' | 'fable' | 'onyx' | 'nova' | 'shimmer'
  responseFormat: 'mp3', // 'mp3' | 'opus' | 'aac' | 'flac' | 'wav'
  speed: 1.0, // 0.25 – 4.0
});

ElevenLabsTTS options:

new ElevenLabsTTS({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  voiceId: 'your-voice-id', // required — ElevenLabs voice ID
  modelId: 'eleven_turbo_v2_5', // fast low-latency model
  stability: 0.5, // voice stability (0 – 1)
  similarityBoost: 0.75, // similarity boost (0 – 1)
  outputFormat: 'pcm_16000', // 'pcm_16000' | 'pcm_22050' | 'pcm_24000' | 'mp3_44100_128' | ...
});

CartesiaTTS options:

new CartesiaTTS({
  apiKey: 'your-key', // omit and use proxyUrl for server-side key injection
  voiceId: 'your-voice-id', // required — Cartesia voice ID
  modelId: 'sonic-2', // 'sonic-2' | 'sonic' | 'sonic-multilingual'
  language: 'en', // language code
  outputSampleRate: 16000, // sample rate in Hz
  speed: 1.0, // speech speed multiplier
  emotion: ['positivity:high'], // emotion tags for voice expression
});

Agent Providers

Agent providers collapse the three middle pipeline roles (stt + llm + tts) into a single connection. Instead of wiring up separate STT, LLM, and TTS providers, one agent provider handles the entire server-side loop over a single WebSocket. The SDK auto-fills MicrophoneInput and BrowserAudioOutput for audio I/O, so a working voice agent requires just one provider in your config.

Provider	Transport	Roles	Peer Dependency	Description
`DeepgramAgent`	WebSocket	`stt` + `llm` + `tts`	None	Deepgram Voice Agent API — single WebSocket handles STT, LLM, and TTS server-side

Audio Output

Provider	Environment	Roles	Peer dependency
`BrowserAudioOutput`	Browser	`output`	None
`NullOutput`	Node/Bun/Deno	`output`	None
`NativeTTS`	Browser	`tts` + `output`	None

BrowserAudioOutput wraps the browser's AudioContext for speaker playback. NullOutput silently discards audio for server-side pipelines. NativeTTS manages its own speaker output internally via the SpeechSynthesis API.

Configuration

Full CompositeVoice configuration reference:

const agent = new CompositeVoice({
  // Required: array of provider instances covering the 5 pipeline roles
  providers: [
    new MicrophoneInput({ sampleRate: 16000 }),
    new DeepgramSTT({ apiKey: '...' }),
    new AnthropicLLM({ apiKey: '...' }),
    new DeepgramTTS({ apiKey: '...' }),
    new BrowserAudioOutput(),
  ],

  // Audio buffer queue tuning (for separate input/STT and TTS/output providers)
  queue: {
    input: { maxSize: 2000, overflowStrategy: 'drop-oldest' },
    output: { maxSize: 500 },
  },

  // Conversation memory (disabled by default)
  conversationHistory: {
    enabled: true,
    maxTurns: 10, // 0 = unlimited; each turn = one user + assistant exchange
    maxTokens: 4000, // approximate token budget (chars/4 heuristic)
    preserveSystemMessages: true, // keep system messages during trimming (default: true)
  },

  // Eager/speculative LLM generation (requires DeepgramFlux)
  eagerLLM: {
    enabled: true,
    cancelOnTextChange: true, // restart generation if the preflight transcript was wrong
    similarityThreshold: 0.8, // word-overlap threshold for accepting speculative response
  },

  // Turn-taking: whether to pause the mic during TTS playback
  turnTaking: {
    pauseCaptureOnPlayback: 'auto', // 'auto' | true | false (default: 'auto')
    autoStrategy: 'conservative', // 'conservative' | 'aggressive' | 'detect' (default: 'conservative')
  },

  // Logging
  logging: {
    enabled: true,
    level: 'info', // 'debug' | 'info' | 'warn' | 'error'
  },

  // LLM→TTS backpressure
  pipeline: {
    maxPendingChunks: 10, // pause LLM if TTS has 10+ unprocessed chunks
  },

  // WebSocket reconnection (applies to Deepgram providers)
  reconnection: {
    enabled: true, // required — enables reconnection
    maxAttempts: 5,
    initialDelay: 1000, // ms before first retry
    maxDelay: 30000, // cap on retry interval
    backoffMultiplier: 2, // exponential backoff factor
  },

  // Automatic error recovery
  autoRecover: true,

  // Recovery strategy (when autoRecover is true)
  recovery: {
    maxAttempts: 3,
    initialDelay: 1000,
    backoffMultiplier: 2,
    maxDelay: 10000,
  },
});

Events

Subscribe to any part of the voice pipeline with a type-safe event system:

agent.on('event.name', handler); // subscribe
agent.off('event.name', handler); // unsubscribe
agent.once('event.name', handler); // fire once, then auto-unsubscribe

Agent events

Event	Payload	Description
`agent.ready`	—	SDK initialized and ready to start listening
`agent.stateChange`	`{ state, previousState }`	Agent moved to a new state
`agent.error`	`{ error, recoverable, context? }`	System-level error

Transcription events

Event	Payload	Description
`transcription.start`	—	Transcription session opened
`transcription.interim`	`{ text, confidence? }`	Partial transcript — updates word by word while the user is speaking
`transcription.final`	`{ text, confidence? }`	Confirmed transcript segment
`transcription.speechFinal`	`{ text, confidence? }`	Full utterance ended. The `utteranceComplete` flag on the transcription result is what triggers LLM processing.
`transcription.preflight`	`{ text, confidence? }`	Early end-of-turn signal (DeepgramFlux only)
`transcription.error`	`{ error }`	Transcription error

LLM events

Event	Payload	Description
`llm.start`	`{ prompt }`	LLM generation started
`llm.chunk`	`{ chunk, accumulated }`	Text token received from the model
`llm.complete`	`{ text, tokensUsed? }`	Full response assembled
`llm.error`	`{ error }`	LLM error

TTS events

Event	Payload	Description
`tts.start`	`{ text }`	Synthesis started
`tts.audio`	`{ chunk }`	Audio chunk ready for playback
`tts.metadata`	`{ metadata }`	Audio format metadata
`tts.complete`	—	Synthesis complete (playback may still be in progress)
`tts.error`	`{ error }`	TTS error

Audio events

Event	Payload	Description
`audio.capture.start`	—	Microphone opened
`audio.capture.stop`	—	Microphone closed
`audio.capture.error`	`{ error }`	Capture failure
`audio.playback.start`	—	Audio playback started
`audio.playback.end`	—	Audio playback ended
`audio.playback.error`	`{ error }`	Playback failure

Queue events

Event	Payload	Description
`queue.overflow`	`{ queueName, droppedChunks, currentSize }`	Queue exceeded `maxSize`, chunks were dropped
`queue.stats`	`{ queueName, size, totalEnqueued, totalDequeued, oldestChunkAge }`	Pipeline health snapshot from `getQueueStats()`

agent.on('queue.overflow', (e) => {
  console.warn(`Queue "${e.queueName}" dropped ${e.droppedChunks} chunks (size: ${e.currentSize})`);
});

agent.on('queue.stats', (e) => {
  console.log(`Queue "${e.queueName}": ${e.size} buffered, ${e.totalEnqueued} total`);
});

Agent states

The agent moves through a well-defined state machine. Every transition emits an agent.stateChange event so your UI can always reflect what the agent is doing.

idle -> ready -> listening -> thinking -> speaking
                    ^                       |
                    |_______________________|
                              |
                           (error)

State	Description
`idle`	Not yet initialized
`ready`	Initialized, waiting to start
`listening`	Capturing audio and transcribing
`thinking`	LLM is generating a response
`speaking`	TTS audio is playing
`error`	Recoverable error — call `startListening()` to retry

agent.on('agent.stateChange', ({ state, previousState }) => {
  console.log(`${previousState} -> ${state}`);
});

The error state is recoverable. The agent does not shut down on errors — it waits for you to call startListening() again, which lets you add your own retry UI or backoff logic.

Conversation history

Enable multi-turn memory so the LLM remembers previous exchanges within a session. Without this, each user utterance is sent to the LLM in isolation.

const agent = new CompositeVoice({
  providers: [stt, llm, tts],
  conversationHistory: {
    enabled: true,
    maxTurns: 10, // keep last 10 user + assistant pairs; 0 = unlimited
  },
});

Each completed turn is automatically appended and included in the next LLM call:

You:  "My name is Sam."
AI:   "Nice to meet you, Sam!"
You:  "What's my name?"
AI:   "Your name is Sam."   // LLM remembers the earlier exchange

Access and manage history programmatically:

// Retrieve the full conversation as an array of LLMMessage objects
const history = agent.getHistory();

// Wipe the history without reinitializing the agent
agent.clearHistory();

See Example 01 for a demo with a full chat-thread UI.

Eager LLM pipeline

With the DeepgramFlux provider, the SDK can begin LLM generation before the user finishes speaking. DeepgramFlux emits a preflight event — an early end-of-turn prediction — which the SDK uses to speculatively start the LLM. If the final transcript is sufficiently similar to the preflight (based on similarityThreshold), the response continues uninterrupted. If it differs significantly, generation restarts with the correct text.

Enabling eager LLM:

const agent = new CompositeVoice({
  providers: [
    new DeepgramFlux({
      apiKey: 'your-key',
      options: { model: 'flux-general-en', eagerEotThreshold: 0.5 },
    }),
    llm,
    tts,
  ],
  eagerLLM: {
    enabled: true,
    cancelOnTextChange: true, // restart if the preflight text diverged
    similarityThreshold: 0.8, // 80% word overlap required to keep response
  },
});

How it works:

User is still speaking
        |
        v
preflight fires  -->  LLM starts generating (speculative)
        |
        v
speech_final arrives
        |
        +---> text unchanged?  -->  LLM continues streaming uninterrupted
        |
        +---> text changed?    -->  LLM cancelled, restarts with correct text

The result is noticeably lower perceived latency on natural speech patterns where the end of an utterance is predictable. See Example 21 for a demo with real-time pipeline timing.

Smart text routing

LLM responses often contain markdown, code blocks, and formatting that sounds terrible when read aloud by TTS. CompositeVoice automatically splits LLM output into separate visual and spoken streams:

Code fences are buffered entirely and never sent to TTS — no more hearing "opening backtick backtick backtick javascript function hello..."
Markdown is stripped from the spoken stream (headings, bold, links, lists) while the visual stream preserves full formatting for the UI
Partial fences are held in a buffer until the closing fence arrives, so incomplete code blocks never leak to either stream

// Subscribe to the visual stream (full markdown + code) for your UI
voice.on('llm.chunk', ({ chunk }) => appendToUI(chunk));

// The spoken stream is handled automatically — TTS receives clean text only
// No configuration needed; smart routing is built into the pipeline

The routing is handled by LLMTextRouter and ChunkSplitter internally. The TTS provider receives pre-processed text via ttsStrip, which removes markdown syntax while preserving natural sentence structure.

Tool use / function calling

LLM providers that implement ToolAwareLLMProvider can invoke tools (function calls) during generation. Currently, AnthropicLLM supports this. Text output is streamed to TTS as usual, while tool calls are handled via the onToolCall callback. After tool execution, the LLM is called again with the tool result to generate a natural language follow-up.

const voice = new CompositeVoice({
  providers: [
    new DeepgramSTT({ proxyUrl: '/api/proxy/deepgram' }),
    new AnthropicLLM({ proxyUrl: '/api/proxy/anthropic', model: 'claude-sonnet-4-6' }),
    new DeepgramTTS({ proxyUrl: '/api/proxy/deepgram' }),
  ],
  tools: {
    definitions: [
      {
        name: 'get_weather',
        description: 'Get the current weather for a location',
        parameters: {
          type: 'object',
          properties: {
            location: { type: 'string', description: 'City name' },
          },
          required: ['location'],
        },
      },
    ],
    onToolCall: async (toolCall) => {
      // Execute the tool and return the result
      const weather = await fetchWeather(toolCall.arguments.location);
      return { result: JSON.stringify(weather) };
    },
  },
});

Note: Tool use is supported by AnthropicLLM and all OpenAI-compatible providers (OpenAILLM, GroqLLM, GeminiLLM, MistralLLM, OpenAICompatibleLLM) via the ToolAwareLLMProvider interface. WebLLMLLM does not support tools.

Barge-in

The SDK automatically interrupts the agent when the user speaks while the agent is in the thinking or speaking state. This is called barge-in — the user can cut in at any time without waiting for the agent to finish.

When barge-in is triggered, the SDK:

Aborts the in-flight LLM generation (via AbortSignal)
Clears the TTS output queue
Resets the TTS provider
Transitions back to listening state

Barge-in happens automatically when the STT provider detects speech during agent output. You can also trigger it programmatically:

// Programmatic barge-in — immediately stop the agent and return to listening
agent.stopSpeaking();

No configuration is required — barge-in is always active.

Turn-taking

Turn-taking controls whether the microphone is paused while the AI is speaking. The right strategy depends on whether your audio setup provides echo cancellation.

const agent = new CompositeVoice({
  providers: [stt, llm, tts],
  turnTaking: {
    pauseCaptureOnPlayback: 'auto',
    autoStrategy: 'conservative',
  },
});

`pauseCaptureOnPlayback`	`autoStrategy`	Behaviour
`'auto'` (default)	`'conservative'` (default)	Pauses the mic for `NativeSTT` (no echo cancellation); leaves it open for `DeepgramSTT` (relies on hardware echo cancellation). When pausing, uses the conservative strategy.
`true`	—	Always pauses the mic during TTS playback. Safe choice if you are unsure about echo cancellation.
`false`	—	Never pauses. Only suitable with reliable hardware echo cancellation.
`'auto'`	`'detect'`	Attempts to detect echo cancellation support at runtime before choosing a strategy.
`'auto'`	`'aggressive'`	When auto-detection decides to pause, uses the aggressive strategy (shorter pause windows).

Server-side proxy

Keep API keys completely out of the browser. The proxy middleware forwards browser requests to provider APIs and injects credentials server-side. Your deployed client bundle contains zero secrets.

The proxy supports all providers: Deepgram, Anthropic, OpenAI, Groq, Gemini, Mistral, AssemblyAI, ElevenLabs, and Cartesia.

Express

import express from 'express';
import { createServer } from 'http';
import { createExpressProxy } from '@lukeocodes/composite-voice/proxy';

const app = express();
const server = createServer(app);

const proxy = createExpressProxy({
  deepgramApiKey: process.env.DEEPGRAM_API_KEY,
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
  openaiApiKey: process.env.OPENAI_API_KEY,
  groqApiKey: process.env.GROQ_API_KEY,
  geminiApiKey: process.env.GEMINI_API_KEY,
  mistralApiKey: process.env.MISTRAL_API_KEY,
  assemblyaiApiKey: process.env.ASSEMBLYAI_API_KEY,
  elevenlabsApiKey: process.env.ELEVENLABS_API_KEY,
  cartesiaApiKey: process.env.CARTESIA_API_KEY,
  pathPrefix: '/proxy',
});

app.use(proxy.middleware);
proxy.attachWebSocket(server); // required for WebSocket connections

app.use(express.static('dist'));
server.listen(3010);

Next.js App Router

// app/proxy/[...path]/route.ts
import { createNextJsProxy } from '@lukeocodes/composite-voice/proxy';

const proxy = createNextJsProxy({
  deepgramApiKey: process.env.DEEPGRAM_API_KEY,
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
});

export const GET = proxy.handler;
export const POST = proxy.handler;

Plain Node.js

import { createServer } from 'http';
import { createNodeProxy } from '@lukeocodes/composite-voice/proxy';

const proxy = createNodeProxy({
  deepgramApiKey: process.env.DEEPGRAM_API_KEY,
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
  pathPrefix: '/proxy',
});

const server = createServer(proxy.handler);
proxy.attachWebSocket(server);
server.listen(3010);

Browser side

Replace apiKey with proxyUrl in any provider config. The provider will route requests through your server instead of calling the provider API directly.

const stt = new DeepgramSTT({
  proxyUrl: `${window.location.origin}/proxy/deepgram`,
  options: { model: 'nova-3', interimResults: true, endpointing: 300 },
});

const llm = new AnthropicLLM({
  proxyUrl: `${window.location.origin}/proxy/anthropic`,
  model: 'claude-haiku-4-5-20251001',
  systemPrompt: 'You are a helpful voice assistant.',
  maxTokens: 200,
});

const tts = new DeepgramTTS({
  proxyUrl: `${window.location.origin}/proxy/deepgram`,
  options: { model: 'aura-2-thalia-en', encoding: 'linear16', sampleRate: 24000 },
});

const agent = new CompositeVoice({ providers: [stt, llm, tts] });

See Example 10 for a complete production-ready setup.

Proxy security

The proxy supports optional security middleware for rate limiting, body size limits, WebSocket message size limits, and custom authentication. All options are opt-in — when omitted, the proxy behaves identically to a proxy without the security field.

const proxy = createExpressProxy({
  deepgramApiKey: process.env.DEEPGRAM_API_KEY,
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
  security: {
    rateLimit: { maxRequests: 100, windowMs: 60000 },
    maxBodySize: 1024 * 1024, // 1 MB
    maxWsMessageSize: 64 * 1024, // 64 KB
    authenticate: (req) => {
      return req.headers['x-api-token'] === process.env.APP_TOKEN;
    },
  },
});

Option	What it does
`rateLimit`	Per-IP request throttling. Requests over the limit receive `429 Too Many Requests`.
`maxBodySize`	Rejects HTTP bodies larger than the limit with `413 Payload Too Large`.
`maxWsMessageSize`	Closes WebSocket connections that send messages exceeding the limit (code 1009).
`authenticate`	Custom auth function — return `true` to allow, `false` to reject with `401`. Called for both HTTP requests and WebSocket upgrade requests.

Server-side usage

The SDK runs outside the browser. Use BufferInput and NullOutput for Node.js, Bun, or Deno pipelines where there is no microphone or speaker.

import {
  CompositeVoice,
  BufferInput,
  DeepgramSTT,
  AnthropicLLM,
  DeepgramTTS,
  NullOutput,
} from '@lukeocodes/composite-voice';

// Define the audio format you will push
const input = new BufferInput({
  sampleRate: 16000,
  encoding: 'linear16',
  channels: 1,
  bitDepth: 16,
});

const agent = new CompositeVoice({
  providers: [
    input,
    new DeepgramSTT({ apiKey: process.env.DEEPGRAM_API_KEY }),
    new AnthropicLLM({ apiKey: process.env.ANTHROPIC_API_KEY, model: 'claude-haiku-4-5-20251001' }),
    new DeepgramTTS({ apiKey: process.env.DEEPGRAM_API_KEY }),
    new NullOutput(),
  ],
});

await agent.initialize();
await agent.startListening();

// Push audio from any source (file, stream, WebSocket, etc.)
input.push(audioBuffer);

BufferInput and NullOutput have zero browser dependencies — no navigator, window, or AudioContext.

Custom providers

All built-in providers implement abstract base classes. You can plug in any provider for any of the 5 pipeline roles by extending the appropriate base class and emitting the expected events.

Base classes and interfaces

Base class / Interface	Role	Use for
`AudioInputProvider`	`input`	Custom audio capture (microphone, file, stream)
`BaseSTTProvider`	`stt`	Any speech-to-text provider
`BaseLLMProvider`	`llm`	Any language model
`OpenAICompatibleLLM`	`llm`	Any LLM with an OpenAI-compatible API (Groq, Gemini, Mistral...)
`BaseTTSProvider`	`tts`	Any text-to-speech provider
`BaseAgentProvider`	`stt` + `llm` + `tts`	Agent providers that handle STT, LLM, and TTS in one connection
`AudioOutputProvider`	`output`	Custom audio playback (speakers, file, stream)

Every custom provider must declare its roles property.

AudioInputProvider skeleton

import type {
  AudioInputProvider,
  AudioChunk,
  AudioMetadata,
  ProviderType,
  ProviderRole,
} from '@lukeocodes/composite-voice';

class MyInput implements AudioInputProvider {
  public readonly type: ProviderType = 'rest';
  public readonly roles: readonly ProviderRole[] = ['input'];

  private callback: ((chunk: AudioChunk) => void) | null = null;

  async initialize(): Promise<void> {
    /* set up your audio source */
  }
  async dispose(): Promise<void> {
    /* clean up */
  }
  isReady(): boolean {
    return true;
  }

  start(): void {
    /* begin capturing and call this.callback with AudioChunk objects */
  }
  stop(): void {
    /* stop capturing */
  }
  pause(): void {
    /* pause capturing */
  }
  resume(): void {
    /* resume capturing */
  }
  isActive(): boolean {
    return false;
  }
  onAudio(callback: (chunk: AudioChunk) => void): void {
    this.callback = callback;
  }
  getMetadata(): AudioMetadata {
    return { sampleRate: 16000, encoding: 'linear16', channels: 1, bitDepth: 16 };
  }
}

STT provider skeleton

import { BaseSTTProvider } from '@lukeocodes/composite-voice';

class MySTT extends BaseSTTProvider {
  // Declare the roles this provider covers
  public readonly roles = ['stt'] as const;

  protected async onInitialize(): Promise<void> {
    // Connect to your STT service, set up any clients or state.
  }

  protected async onDispose(): Promise<void> {
    // Clean up connections and resources.
  }

  async startCapture(): Promise<void> {
    // Stream audio from the microphone to your service, then emit results
    // via the base class helper (it handles event delivery and routing):
    this.emitTranscription({ text, isFinal: false });
    this.emitTranscription({ text, isFinal: true, utteranceComplete: true });
  }

  async stopCapture(): Promise<void> {
    // Flush and close the stream.
  }
}

LLM provider skeleton

import { BaseLLMProvider, LLMMessage, LLMGenerationOptions } from '@lukeocodes/composite-voice';

class MyLLM extends BaseLLMProvider {
  protected async onInitialize(): Promise<void> {
    // Set up your LLM client.
  }

  async *generateFromMessages(
    messages: LLMMessage[],
    options?: LLMGenerationOptions
  ): AsyncGenerator<string> {
    // Stream tokens from your model — yield each chunk as it arrives.
    for await (const token of myModelStream(messages)) {
      yield token;
    }
  }

  async *generate(prompt: string, options?: LLMGenerationOptions): AsyncGenerator<string> {
    yield* this.generateFromMessages(this.promptToMessages(prompt), options);
  }
}

OpenAI-compatible LLM

The fastest way to add a new LLM provider that has an OpenAI-compatible API:

import { OpenAICompatibleLLM, OpenAICompatibleLLMConfig } from '@lukeocodes/composite-voice';

class MyLLM extends OpenAICompatibleLLM {
  constructor(config: OpenAICompatibleLLMConfig) {
    super({ endpoint: 'https://api.my-provider.com/v1', ...config });
  }
}

// Use it exactly like any other LLM provider
const llm = new MyLLM({
  apiKey: 'your-key',
  model: 'my-model',
  systemPrompt: 'You are a helpful voice assistant.',
});

TTS provider skeleton

import { BaseTTSProvider } from '@lukeocodes/composite-voice';

class MyTTS extends BaseTTSProvider {
  public readonly roles = ['tts'] as const;

  protected async onInitialize(): Promise<void> {
    // Set up your TTS client.
  }

  async synthesize(text: string): Promise<void> {
    this.emit('tts.start', { text });
    // Stream audio chunks from your service, then emit:
    this.emit('tts.audio', { chunk: audioBuffer });
    this.emit('tts.complete');
  }
}

AudioOutputProvider skeleton

import type {
  AudioOutputProvider,
  AudioChunk,
  AudioMetadata,
  ProviderType,
  ProviderRole,
} from '@lukeocodes/composite-voice';

class MyOutput implements AudioOutputProvider {
  public readonly type: ProviderType = 'rest';
  public readonly roles: readonly ProviderRole[] = ['output'];

  async initialize(): Promise<void> {
    /* set up your audio destination */
  }
  async dispose(): Promise<void> {
    /* clean up */
  }
  isReady(): boolean {
    return true;
  }

  configure(metadata: AudioMetadata): void {
    /* configure output format */
  }
  enqueue(chunk: AudioChunk): void {
    /* write audio chunk to destination */
  }
  async flush(): Promise<void> {
    /* wait for all enqueued audio to finish */
  }
  stop(): void {
    /* stop playback */
  }
  pause(): void {
    /* pause playback */
  }
  resume(): void {
    /* resume playback */
  }
  isPlaying(): boolean {
    return false;
  }
  onPlaybackStart(callback: () => void): void {
    /* register callback */
  }
  onPlaybackEnd(callback: () => void): void {
    /* register callback */
  }
  onPlaybackError(callback: (error: Error) => void): void {
    /* register callback */
  }
}

For a full implementation guide, see CONTRIBUTING.md. The built-in providers in src/providers/ are the best reference implementations.

Examples

30 standalone Vite apps in examples/, organized by category. Each introduces a real feature or provider — no filler.

Getting started (00–06)

Browser-native providers and core SDK patterns. Only an Anthropic API key is required for most examples.

#	What it demonstrates	API keys needed	Port
00	Minimum viable voice agent	Anthropic	3000
01	Multi-turn conversation memory	Anthropic	3001
02	System prompt persona configuration	Anthropic	3002
03	Full event timeline and debugging	Anthropic	3003
04	Error simulation and automatic recovery	Anthropic	3004
05	Turn-taking strategy visualization	Anthropic	3005
06	Advanced config — multi-role and 5-provider patterns with queue options	Anthropic + Deepgram	3006

Production patterns (10–13)

Proxy servers, custom providers, and deployment-ready configurations.

#	What it demonstrates	API keys needed	Port
10	Express proxy — API keys stay on the server	Deepgram + Anthropic	3010
11	Next.js App Router proxy	Anthropic	3011
12	Custom LLM provider (extends base class)	None	3012
13	Multi-language support and switching	Anthropic	3013

Deepgram (20–24)

Production-quality STT and TTS with Deepgram-specific features.

#	What it demonstrates	API keys needed	Port
20	Full Deepgram STT + TTS pipeline	Deepgram + Anthropic	3020
21	Eager/preflight speculative generation	Deepgram + Anthropic	3021
22	STT configuration panel (model, VAD, etc.)	Deepgram + Anthropic	3022
23	TTS voice gallery — preview Aura 2 voices	Deepgram + Anthropic	3023
24	Deepgram pipeline + conversation history	Deepgram + Anthropic	3024

Deepgram Voice Agent (70)

Single-WebSocket voice agent using the Deepgram Voice Agent API.

#	What it demonstrates	API keys needed	Port
70	Deepgram Voice Agent API (single-WebSocket STT+LLM+TTS)	Deepgram	3070

Anthropic (30–31)

Claude model comparison and streaming configuration.

#	What it demonstrates	API keys needed	Port
30	Side-by-side model comparison (Haiku/Sonnet)	Anthropic	3030
31	Streaming config (temperature, tokens, topP)	Anthropic	3031

OpenAI (40–42)

GPT models and OpenAI TTS integration.

#	What it demonstrates	API keys needed	Port
40	OpenAI LLM with browser-native STT/TTS	OpenAI	3040
41	OpenAI LLM + Deepgram STT/TTS production	OpenAI + Deepgram	3041
42	OpenAI LLM + OpenAI TTS	OpenAI	3042

WebLLM (50)

Fully offline voice agent — no API keys, no server.

#	What it demonstrates	API keys needed	Port
50	In-browser LLM via WebGPU (100% offline)	None	3050

Groq (60)

Ultra-fast inference with Groq-hosted models.

#	What it demonstrates	API keys needed	Port
60	Groq LLM + Deepgram STT/TTS	Groq + Deepgram	3060

AssemblyAI (70)

Real-time transcription with word-level timing.

#	What it demonstrates	API keys needed	Port
70	AssemblyAI STT + Claude + Deepgram TTS	AssemblyAI + Anthropic + Deepgram	3070

ElevenLabs (80–81)

Ultra-low-latency streaming TTS and real-time STT with natural voices.

#	What it demonstrates	API keys needed	Port
80	ElevenLabs TTS + Deepgram STT + Claude	ElevenLabs + Deepgram + Anthropic	3080
81	ElevenLabs STT (Scribe V2) standalone demo	ElevenLabs	3081

Cartesia (90)

Low-latency TTS with emotion controls.

#	What it demonstrates	API keys needed	Port
90	Cartesia TTS + Deepgram STT + Groq LLM	Cartesia + Deepgram + Groq	3090

Gemini (100)

Google Gemini models via OpenAI-compatible API.

#	What it demonstrates	API keys needed	Port
100	Gemini LLM + Deepgram STT + ElevenLabs	Gemini + Deepgram + ElevenLabs	3100

Mistral (110)

Mistral open and commercial models.

#	What it demonstrates	API keys needed	Port
110	Mistral LLM + Deepgram STT + ElevenLabs	Mistral + Deepgram + ElevenLabs	3110

Running examples

pnpm install && pnpm build

# Getting started
pnpm example:00-minimal-voice-agent:dev          # http://localhost:3000
pnpm example:01-conversation-history:dev         # http://localhost:3001
pnpm example:02-system-persona:dev               # http://localhost:3002
pnpm example:03-event-inspector:dev              # http://localhost:3003
pnpm example:04-error-recovery:dev               # http://localhost:3004
pnpm example:05-turn-taking:dev                  # http://localhost:3005
pnpm example:06-advanced-config:dev              # http://localhost:3006

# Production patterns
pnpm example:10-proxy-server:dev                 # http://localhost:3010
pnpm example:11-nextjs-proxy:dev                 # http://localhost:3011
pnpm example:12-custom-provider:dev              # http://localhost:3012
pnpm example:13-multi-language:dev               # http://localhost:3013

# Deepgram
pnpm example:20-deepgram-pipeline:dev            # http://localhost:3020
pnpm example:21-eager-pipeline:dev               # http://localhost:3021
pnpm example:22-deepgram-options:dev             # http://localhost:3022
pnpm example:23-deepgram-voices:dev              # http://localhost:3023
pnpm example:24-deepgram-conversation-history:dev # http://localhost:3024

# Deepgram Voice Agent
pnpm example:70-deepgram-agent:dev               # http://localhost:3070

# Anthropic
pnpm example:30-anthropic-models:dev             # http://localhost:3030
pnpm example:31-anthropic-streaming-config:dev   # http://localhost:3031

# OpenAI
pnpm example:40-openai-pipeline:dev              # http://localhost:3040
pnpm example:41-openai-deepgram:dev              # http://localhost:3041
pnpm example:42-openai-tts-pipeline:dev          # http://localhost:3042

# WebLLM (offline)
pnpm example:50-webllm-pipeline:dev              # http://localhost:3050

# Groq
pnpm example:60-groq-pipeline:dev               # http://localhost:3060

# AssemblyAI
pnpm example:70-assemblyai-pipeline:dev          # http://localhost:3070

# ElevenLabs
pnpm example:80-elevenlabs-pipeline:dev          # http://localhost:3080
pnpm example:81-elevenlabs-stt:dev               # http://localhost:3081

# Cartesia
pnpm example:90-cartesia-pipeline:dev            # http://localhost:3090

# Gemini
pnpm example:100-gemini-pipeline:dev             # http://localhost:3100

# Mistral
pnpm example:110-mistral-pipeline:dev            # http://localhost:3110

Browser support

Browser	NativeSTT	DeepgramSTT	DeepgramFlux	AssemblyAISTT	ElevenLabsSTT	NativeTTS	DeepgramTTS	OpenAITTS	ElevenLabsTTS	CartesiaTTS
Chrome / Edge	Full	Full	Full	Full	Full	Full	Full	Full	Full	Full
Firefox	Not supported	Full	Full	Full	Full	Full	Full	Full	Full	Full
Safari	Limited	Full	Full	Full	Full	Full	Full	Full	Full	Full

NativeSTT depends on the Web Speech API, which is only fully supported in Chromium-based browsers. NativeSTT is unreliable in Safari. All WebSocket-based providers (Deepgram, AssemblyAI, ElevenLabs, Cartesia) and REST-based providers (OpenAI) work across all modern browsers.

For cross-browser production deployments, use DeepgramSTT, AssemblyAISTT, or ElevenLabsSTT for STT, and any cloud TTS provider.

Contributing

Contributions are welcome — new providers, bug fixes, documentation improvements, and feature requests. Every contribution matters, and there are options at every experience level.

CONTRIBUTING.md — development setup, workflow, and conventions
GitHub Issues — bug reports and feature requests
GitHub Discussions — questions, ideas, and show & tell
Code of Conduct — community standards
Security Policy — how to report vulnerabilities privately

New here? Look for issues labelled good first issue.

Community

CompositeVoice is built in the open and shaped by the people who use it.

GitHub Discussions — share what you built, ask questions, propose ideas
GitHub Issues — bug reports and concrete feature requests
Good first issues — well-scoped tasks for new contributors
Security advisories — private channel for vulnerability reports

If you build something with CompositeVoice, share it in Discussions. Seeing real applications is one of the best ways to understand what the SDK does well and where it still has gaps.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.beads		.beads
.claude		.claude
.github		.github
.husky		.husky
apps		apps
examples		examples
packages/ui		packages/ui
scratch		scratch
scripts		scripts
src		src
tasks		tasks
tests		tests
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
commitlint.config.js		commitlint.config.js
jest.config.js		jest.config.js
magic-frame-detection.md		magic-frame-detection.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CompositeVoice

Why CompositeVoice?

Table of contents

Installation

Quick start

Simplest setup — text agent, zero providers

Add voice — browser built-ins

Production setup — cloud providers, auto-filled audio I/O

5-role pipeline

Provider roles

Multi-role providers

Explicit 5-provider config

Defaults

Providers

Audio Input

Speech-to-Text (STT)

Language Models (LLM)

Text-to-Speech (TTS)

Agent Providers

Audio Output

Configuration

Events

Agent events

Transcription events

LLM events

TTS events

Audio events

Queue events

Agent states

Conversation history

Eager LLM pipeline

Smart text routing

Tool use / function calling

Barge-in

Turn-taking

Server-side proxy

Express

Next.js App Router

Plain Node.js

Browser side

Proxy security

Server-side usage

Custom providers

Base classes and interfaces

AudioInputProvider skeleton

STT provider skeleton

LLM provider skeleton

OpenAI-compatible LLM

TTS provider skeleton

AudioOutputProvider skeleton

Examples

Getting started (00–06)

Production patterns (10–13)

Deepgram (20–24)

Deepgram Voice Agent (70)

Anthropic (30–31)

OpenAI (40–42)

WebLLM (50)

Groq (60)

AssemblyAI (70)

ElevenLabs (80–81)

Cartesia (90)

Gemini (100)

Mistral (110)

Running examples

Browser support

Contributing

Community

License

About

Topics

Resources

License

Code of conduct

Contributing

Packages