Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,7 @@
"server/utilities/audio/audio-buffer-processor",
"server/utilities/audio/koala-filter",
"server/utilities/audio/krisp-viva-filter",
"server/utilities/audio/krisp-viva-vad-analyzer",
"server/utilities/audio/silero-vad-analyzer",
"server/utilities/audio/soundfile-mixer"
]
Expand Down
47 changes: 41 additions & 6 deletions guides/features/krisp-viva.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@ description: "Learn how to integrate Krisp's VIVA voice isolation and turn detec

## Overview

Krisp's VIVA SDK provides two capabilities for Pipecat applications:
Krisp's VIVA SDK provides three capabilities for Pipecat applications:

- **Voice Isolation** — Filter out background noise and voices from the user's audio input stream, yielding clearer audio for fewer false interruptions and better transcription.
- **Turn Detection** — Determine when a user has finished speaking using Krisp's streaming turn detection model, as an alternative to the [Smart Turn model](/server/utilities/turn-detection/smart-turn-overview).
- **Voice Activity Detection** — Detect speech in audio streams using Krisp's VAD model, supporting sample rates from 8kHz to 48kHz.

You can use either or both features together.
You can use any combination of these features together.

<CardGroup cols={2}>
<Card
Expand All @@ -28,12 +29,19 @@ You can use either or both features together.
>
API reference for turn detection
</Card>
<Card
title="KrispVivaVadAnalyzer Reference"
icon="code"
href="/server/utilities/audio/krisp-viva-vad-analyzer"
>
API reference for voice activity detection
</Card>
<Card
title="Krisp VIVA Example"
icon="play"
href="https://github.com/pipecat-ai/pipecat/blob/main/examples/foundational/07p-interruptible-krisp-viva.py"
>
Complete example with voice isolation and turn detection
Complete example with Krisp features
</Card>
<Card
title="Krisp Developers"
Expand Down Expand Up @@ -102,12 +110,15 @@ KRISP_VIVA_FILTER_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-tel-v2.kef

# Turn detection model path
KRISP_VIVA_TURN_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-tt-v2.kef

# Voice activity detection model path (optional)
KRISP_VIVA_VAD_MODEL_PATH=/PATH_TO_UNZIPPED_MODELS/krisp-viva-vad-v2.kef
```

<Note>
The voice isolation and turn detection features use **different models**. Set
`KRISP_VIVA_FILTER_MODEL_PATH` for voice isolation and
`KRISP_VIVA_TURN_MODEL_PATH` for turn detection.
Each feature uses a **different model**. Set `KRISP_VIVA_FILTER_MODEL_PATH`
for voice isolation, `KRISP_VIVA_TURN_MODEL_PATH` for turn detection, and
`KRISP_VIVA_VAD_MODEL_PATH` for voice activity detection.
</Note>

## Test the integration
Expand Down Expand Up @@ -170,3 +181,27 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
```

See the [KrispVivaTurn reference](/server/utilities/turn-detection/krisp-viva-turn) for configuration options.

## Voice Activity Detection

`KrispVivaVadAnalyzer` detects speech in audio streams using Krisp's VAD model. It supports sample rates from 8kHz to 48kHz, making it suitable for a wide range of applications including telephony and high-quality audio.

Configure it as a VAD analyzer:

```python
from pipecat.audio.vad.krisp_viva_vad import KrispVivaVadAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
)

user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
vad_analyzer=KrispVivaVadAnalyzer(params=VADParams(stop_secs=0.2)),
),
)
```

See the [KrispVivaVadAnalyzer reference](/server/utilities/audio/krisp-viva-vad-analyzer) for configuration options.
4 changes: 2 additions & 2 deletions guides/learn/speech-input.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ Custom strategies can also be implemented for specific use cases. By combining t

### What VAD Does

VAD is responsible for detecting when a user starts and stops speaking. Pipecat uses the [Silero VAD](https://github.com/snakers4/silero-vad), an open-source model that runs locally on CPU with minimal overhead.
VAD is responsible for detecting when a user starts and stops speaking. Pipecat includes [Silero VAD](https://github.com/snakers4/silero-vad), an open-source model that runs locally on CPU with minimal overhead. [Krisp VIVA VAD](/server/utilities/audio/krisp-viva-vad-analyzer) is also available for applications requiring support for higher sample rates.

**Performance characteristics:**
**Silero VAD performance characteristics:**

- Processes 30+ms audio chunks in less than 1ms
- Runs on a single CPU thread
Expand Down
104 changes: 104 additions & 0 deletions server/utilities/audio/krisp-viva-vad-analyzer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: "KrispVivaVadAnalyzer"
description: "Voice Activity Detection analyzer using the Krisp VIVA SDK"
---

## Overview

`KrispVivaVadAnalyzer` is a Voice Activity Detection (VAD) analyzer that uses the Krisp VIVA SDK to detect speech in audio streams. It provides high-accuracy speech detection with support for multiple sample rates.

## Installation

```bash
pip install "pipecat-ai[krisp]"
```

## Prerequisites

You need a Krisp VIVA VAD model file (`.kef` extension). Set the model path via:

- The `model_path` constructor parameter, or
- The `KRISP_VIVA_VAD_MODEL_PATH` environment variable

## Constructor Parameters

<ParamField path="model_path" type="str" default="None">
Path to the Krisp model file (`.kef` extension). If not provided, uses the
`KRISP_VIVA_VAD_MODEL_PATH` environment variable.
</ParamField>

<ParamField path="frame_duration" type="int" default="10">
Frame duration in milliseconds. Must be 10, 15, 20, 30, or 32ms.
</ParamField>

<ParamField path="sample_rate" type="int" default="None">
Audio sample rate in Hz. Must be 8000, 16000, 32000, 44100, or 48000.
</ParamField>

<ParamField path="params" type="VADParams" default="VADParams()">
Voice Activity Detection parameters object
<Expandable title="properties">
<ParamField path="confidence" type="float" default="0.7">
Confidence threshold for speech detection. Higher values make detection more strict. Must
be between 0 and 1.
</ParamField>

<ParamField path="start_secs" type="float" default="0.2">
Time in seconds that speech must be detected before transitioning to SPEAKING state.
</ParamField>

<ParamField path="stop_secs" type="float" default="0.2">
Time in seconds of silence required before transitioning back to QUIET state.
</ParamField>

<ParamField path="min_volume" type="float" default="0.6">
Minimum audio volume threshold for speech detection. Must be between 0 and 1.
</ParamField>

</Expandable>
</ParamField>

## Usage Example

```python
from pipecat.audio.vad.krisp_viva_vad import KrispVivaVadAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams

context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
vad_analyzer=KrispVivaVadAnalyzer(
model_path="/path/to/model.kef",
params=VADParams(stop_secs=0.2)
),
),
)
```

## Technical Details

### Sample Rate Requirements

The analyzer supports five sample rates:

- 8000 Hz
- 16000 Hz
- 32000 Hz
- 44100 Hz
- 48000 Hz

### Model Requirements

- Model files must have a `.kef` extension
- Model path can be specified via constructor or environment variable
- Model is loaded once during initialization

## Notes

- High-accuracy speech detection using Krisp VIVA SDK
- Supports multiple sample rates (8kHz to 48kHz)
- Requires external `.kef` model file
- Thread-safe for pipeline processing
- Automatic session management
- Configurable frame duration
Loading