diff --git a/docs.json b/docs.json index d7af84da..5cbd0ca0 100644 --- a/docs.json +++ b/docs.json @@ -266,6 +266,7 @@ "server/utilities/audio/audio-buffer-processor", "server/utilities/audio/koala-filter", "server/utilities/audio/krisp-viva-filter", + "server/utilities/audio/krisp-viva-vad-analyzer", "server/utilities/audio/silero-vad-analyzer", "server/utilities/audio/soundfile-mixer" ] diff --git a/guides/features/krisp-viva.mdx b/guides/features/krisp-viva.mdx index b5127ac1..4bb04cf3 100644 --- a/guides/features/krisp-viva.mdx +++ b/guides/features/krisp-viva.mdx @@ -6,12 +6,13 @@ description: "Learn how to integrate Krisp's VIVA voice isolation and turn detec ## Overview -Krisp's VIVA SDK provides two capabilities for Pipecat applications: +Krisp's VIVA SDK provides three capabilities for Pipecat applications: - **Voice Isolation** — Filter out background noise and voices from the user's audio input stream, yielding clearer audio for fewer false interruptions and better transcription. - **Turn Detection** — Determine when a user has finished speaking using Krisp's streaming turn detection model, as an alternative to the [Smart Turn model](/server/utilities/turn-detection/smart-turn-overview). +- **Voice Activity Detection** — Detect speech in audio streams using Krisp's VAD model, supporting sample rates from 8kHz to 48kHz. -You can use either or both features together. +You can use any combination of these features together. API reference for turn detection + + API reference for voice activity detection + - Complete example with voice isolation and turn detection + Complete example with Krisp features - The voice isolation and turn detection features use **different models**. Set - `KRISP_VIVA_FILTER_MODEL_PATH` for voice isolation and - `KRISP_VIVA_TURN_MODEL_PATH` for turn detection. + Each feature uses a **different model**. Set `KRISP_VIVA_FILTER_MODEL_PATH` + for voice isolation, `KRISP_VIVA_TURN_MODEL_PATH` for turn detection, and + `KRISP_VIVA_VAD_MODEL_PATH` for voice activity detection. ## Test the integration @@ -170,3 +181,27 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair( ``` See the [KrispVivaTurn reference](/server/utilities/turn-detection/krisp-viva-turn) for configuration options. + +## Voice Activity Detection + +`KrispVivaVadAnalyzer` detects speech in audio streams using Krisp's VAD model. It supports sample rates from 8kHz to 48kHz, making it suitable for a wide range of applications including telephony and high-quality audio. + +Configure it as a VAD analyzer: + +```python +from pipecat.audio.vad.krisp_viva_vad import KrispVivaVadAnalyzer +from pipecat.audio.vad.vad_analyzer import VADParams +from pipecat.processors.aggregators.llm_response_universal import ( + LLMContextAggregatorPair, + LLMUserAggregatorParams, +) + +user_aggregator, assistant_aggregator = LLMContextAggregatorPair( + context, + user_params=LLMUserAggregatorParams( + vad_analyzer=KrispVivaVadAnalyzer(params=VADParams(stop_secs=0.2)), + ), +) +``` + +See the [KrispVivaVadAnalyzer reference](/server/utilities/audio/krisp-viva-vad-analyzer) for configuration options. diff --git a/guides/learn/speech-input.mdx b/guides/learn/speech-input.mdx index 35d72d0f..f62d3a47 100644 --- a/guides/learn/speech-input.mdx +++ b/guides/learn/speech-input.mdx @@ -26,9 +26,9 @@ Custom strategies can also be implemented for specific use cases. By combining t ### What VAD Does -VAD is responsible for detecting when a user starts and stops speaking. Pipecat uses the [Silero VAD](https://github.com/snakers4/silero-vad), an open-source model that runs locally on CPU with minimal overhead. +VAD is responsible for detecting when a user starts and stops speaking. Pipecat includes [Silero VAD](https://github.com/snakers4/silero-vad), an open-source model that runs locally on CPU with minimal overhead. [Krisp VIVA VAD](/server/utilities/audio/krisp-viva-vad-analyzer) is also available for applications requiring support for higher sample rates. -**Performance characteristics:** +**Silero VAD performance characteristics:** - Processes 30+ms audio chunks in less than 1ms - Runs on a single CPU thread diff --git a/server/utilities/audio/krisp-viva-vad-analyzer.mdx b/server/utilities/audio/krisp-viva-vad-analyzer.mdx new file mode 100644 index 00000000..a1bd6cb0 --- /dev/null +++ b/server/utilities/audio/krisp-viva-vad-analyzer.mdx @@ -0,0 +1,104 @@ +--- +title: "KrispVivaVadAnalyzer" +description: "Voice Activity Detection analyzer using the Krisp VIVA SDK" +--- + +## Overview + +`KrispVivaVadAnalyzer` is a Voice Activity Detection (VAD) analyzer that uses the Krisp VIVA SDK to detect speech in audio streams. It provides high-accuracy speech detection with support for multiple sample rates. + +## Installation + +```bash +pip install "pipecat-ai[krisp]" +``` + +## Prerequisites + +You need a Krisp VIVA VAD model file (`.kef` extension). Set the model path via: + +- The `model_path` constructor parameter, or +- The `KRISP_VIVA_VAD_MODEL_PATH` environment variable + +## Constructor Parameters + + + Path to the Krisp model file (`.kef` extension). If not provided, uses the + `KRISP_VIVA_VAD_MODEL_PATH` environment variable. + + + + Frame duration in milliseconds. Must be 10, 15, 20, 30, or 32ms. + + + + Audio sample rate in Hz. Must be 8000, 16000, 32000, 44100, or 48000. + + + + Voice Activity Detection parameters object + + + Confidence threshold for speech detection. Higher values make detection more strict. Must + be between 0 and 1. + + + + Time in seconds that speech must be detected before transitioning to SPEAKING state. + + + + Time in seconds of silence required before transitioning back to QUIET state. + + + + Minimum audio volume threshold for speech detection. Must be between 0 and 1. + + + + + +## Usage Example + +```python +from pipecat.audio.vad.krisp_viva_vad import KrispVivaVadAnalyzer +from pipecat.audio.vad.vad_analyzer import VADParams + +context = LLMContext(messages) +user_aggregator, assistant_aggregator = LLMContextAggregatorPair( + context, + user_params=LLMUserAggregatorParams( + vad_analyzer=KrispVivaVadAnalyzer( + model_path="/path/to/model.kef", + params=VADParams(stop_secs=0.2) + ), + ), +) +``` + +## Technical Details + +### Sample Rate Requirements + +The analyzer supports five sample rates: + +- 8000 Hz +- 16000 Hz +- 32000 Hz +- 44100 Hz +- 48000 Hz + +### Model Requirements + +- Model files must have a `.kef` extension +- Model path can be specified via constructor or environment variable +- Model is loaded once during initialization + +## Notes + +- High-accuracy speech detection using Krisp VIVA SDK +- Supports multiple sample rates (8kHz to 48kHz) +- Requires external `.kef` model file +- Thread-safe for pipeline processing +- Automatic session management +- Configurable frame duration