Skip to content

Add Phonic Plugin to LiveKit agents#4980

Merged
tinalenguyen merged 13 commits intolivekit:mainfrom
Phonic-Co:qiong/phonic
Mar 4, 2026
Merged

Add Phonic Plugin to LiveKit agents#4980
tinalenguyen merged 13 commits intolivekit:mainfrom
Phonic-Co:qiong/phonic

Conversation

@qionghuang6
Copy link
Contributor

@qionghuang6 qionghuang6 commented Mar 3, 2026

Implements Phonic Plugin in Livekit Agents Python SDK.
(Original LiveKit Agents JS implementation: https://github.com/Phonic-Co/livekit-agents-js/tree/main/plugins/phonic)

Changes

  • Implements RealtimeSession and RealtimeModel, supporting STS and tool calls
  • Uses Phonic Python SDK
  • Adds example to examples/voice-agents

Testing

  • Tested in Agent playground, including tool use.

Demo video

@qionghuang6 qionghuang6 marked this pull request as ready for review March 3, 2026 00:36
@qionghuang6 qionghuang6 changed the title Add support for Phonic Plugin to LiveKit agents Add Phonic Plugin to LiveKit agents Mar 3, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

"Phonic does not support updating instructions mid-session."
)
return
self._opts.instructions = instructions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Mutating shared _opts object in update_instructions affects the model and all future sessions

When update_instructions is called, it mutates self._opts.instructions at line 258. However, self._opts is assigned as a direct reference to realtime_model._opts at line 198 (self._opts = realtime_model._opts). This means the mutation leaks into the parent RealtimeModel and any future sessions created from it.

Root Cause and Impact

The RealtimeSession.__init__ at livekit-plugins/livekit-plugins-phonic/livekit/plugins/phonic/realtime/realtime_model.py:198 stores a reference to the model's _opts:

self._opts = realtime_model._opts

Then update_instructions at line 258 mutates that shared object:

self._opts.instructions = instructions

Since RealtimeModel.session() can be called multiple times (e.g., when the agent activity is updated at livekit-agents/livekit/agents/voice/agent_activity.py:556), the second session would inherit the instructions set by the first session's update_instructions call, rather than the original NOT_GIVEN default. This can lead to stale or incorrect instructions being sent to new Phonic sessions.

Expected: Each session should have its own copy of options, or instructions should be stored in a session-local variable.
Actual: Instructions set in one session mutate the shared model-level options object.

Prompt for agents
In livekit-plugins/livekit-plugins-phonic/livekit/plugins/phonic/realtime/realtime_model.py, line 198, change self._opts = realtime_model._opts to create a copy of the options (e.g., using dataclasses.replace or copy.copy) so that mutations in update_instructions (line 258) don't affect the parent RealtimeModel or other sessions. For example, change line 198 from:
  self._opts = realtime_model._opts
to:
  import copy
  self._opts = copy.copy(realtime_model._opts)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern that the Google realtime and other realtime models use.

Copy link
Member

@tinalenguyen tinalenguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the PR! i left a few comments, could you also:

  • undo the uv.lock changes
  • add the plugin to the main pyproject file here
  • add the plugin to the agents pyproject file here

self._opts.instructions = instructions
self._instructions_ready.set()

async def update_chat_ctx(self, chat_ctx: llm.ChatContext) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the start of the session, update_chat_ctx is called with a fresh chat context and i see this log:

WARNI… livekit.…ns.phonic update_chat_ctx called but no new tool call outputs to send. Phonic does not support general chat context updates.

perhaps we could add a check if the session just started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, just added a check to log this only if the config message has already been sent to Phonic.

Comment on lines +341 to +344
def interrupt(self) -> None:
logger.warning(
"interrupt() is not supported by Phonic realtime model. "
"User interruptions are automatically handled by Phonic."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call interrupt() when input speech is detected (relevant lines), so even if there is no ongoing speech to interrupt this log appears. adding this check should filter out those instances

Suggested change
def interrupt(self) -> None:
logger.warning(
"interrupt() is not supported by Phonic realtime model. "
"User interruptions are automatically handled by Phonic."
def interrupt(self) -> None:
if self._current_generation:
logger.warning(
"interrupt() is not supported by Phonic realtime model. "
"User interruptions are automatically handled by Phonic."
)

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 17 additional findings in Devin Review.

Open in Devin Review

Comment on lines +566 to +572
if self._current_generation is None and message.text:
logger.debug("Starting new generation due to text in audio chunk")
self._start_new_assistant_turn()

gen = self._current_generation
if gen is None:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Audio-only chunks silently dropped when no active generation exists

In _handle_audio_chunk, a new generation is only created when message.text is truthy (line 566). If an audio chunk arrives with audio data but no text, and there is no active generation (self._current_generation is None), the audio is silently discarded at line 571-572.

Root Cause and Impact

The guard at line 566 checks:

if self._current_generation is None and message.text:

But it should also account for message.audio. When an audio-only chunk (no text) arrives without an active generation — for example if Phonic sends audio before assistant_started_speaking, or if the event was missed — the code falls through to:

gen = self._current_generation  # None
if gen is None:
    return  # <-- audio silently dropped

The audio data in message.audio is never decoded or forwarded, causing audio loss for that chunk. The condition should be message.text or message.audio to also start a generation for audio-only chunks.

Impact: Potential audio loss/gaps if audio chunks arrive without an active generation. In the normal protocol flow (assistant_started_speakingaudio_chunk), this won't trigger, but any deviation from that ordering causes silent audio drops.

Suggested change
if self._current_generation is None and message.text:
logger.debug("Starting new generation due to text in audio chunk")
self._start_new_assistant_turn()
gen = self._current_generation
if gen is None:
return
if self._current_generation is None and (message.text or message.audio):
logger.debug("Starting new generation due to content in audio chunk")
self._start_new_assistant_turn()
gen = self._current_generation
if gen is None:
return
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. Phonic can send silent audio chunks (or audio chunks mixed in with background noise), but we want to ignore these chunks in our integration with the Livekit agent SDKs.

@qionghuang6
Copy link
Contributor Author

Thanks @tinalenguyen . I made some quick changes. I also just reset uv.lock to what it is on main. Just to confirm, we don't want to update uv.lock here even though we changed pyproject.toml?

Also, I think the ruff CI is having some issues due to GitHub being partially down? We might need to re-run it. https://github.com/livekit/agents/actions/runs/22639600339/job/65611716048?pr=4980

@tinalenguyen
Copy link
Member

The uv.lock file is usually auto-updated from version releases, and looks like the ruff tests passed!

Just tested again, I consistently see this log when a tool is called:
_SegmentSynchronizerImpl.playback_finished called before text/audio input is done {"text_done": false, "audio_done": true}

I'm checking if this is related to any plugin changes, I don't recall seeing that initially

@tinalenguyen
Copy link
Member

@qionghuang6 I think marking the generation as done upon "assistant_finished_speaking" when a tool is called causes this behavior. The tool call starts a new generation with no text and audio, the expected behavior is that it's included in the current generation as well

Is there an event similar to response.done to mark the end of a turn with a function call?

@qionghuang6
Copy link
Contributor Author

qionghuang6 commented Mar 3, 2026

EDIT: What I said below assumes the assistant speaks before calling the tool. In that case, we don't get the warning, but the warning happens when the assistant doesn't say anything before calling the tool, so nothing makes its way into the text channel, and text_done doesn't become true by the time that generation is closed.

@tinalenguyen Are you saying that when a tool is called, it is in it's own generation without audio or text? I think the current implementation uses the same generation as the existing audio and text. (Unless if self._current_generation is None: is not satisfied)

assistant_finished_speaking is our version of response.done, it's just that the turn is not considered to be done (i.e. the assistant has not finished speaking) until it has received the results of the tool call and finished speaking after receiving the results of the tool call.

The current flow of events when a tool call happens is:
assistant_started_speaking -> New generation begins
audio_chunk
...
audio_chunk
tool_call -> Tool call put in function channel, we call _close_current_generation, current generation closes.
...
After the tool is executed, Livekit Agents calls update_chat_ctx which calls _start_new_assistant_turn, starting a new generation.
Agent continues speaking:
audio_chunk
...
assistant_finished_speaking.

For turns without tool calls, it is:
assistant_started_speaking -> New generation begins
audio_chunk
...
audio_chunk
assistant_finished_speaking

I tried logging when assistant_finished_speaking is received, and it seems that I am seeing this _SegmentSynchronizerImpl warning prior to the assistant_finished_speaking event.

Continuing to look into this.

)

if not gen.text_ch.closed:
gen.text_ch.send_nowait("")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated more deeply, and I think the reason we were seeing

_SegmentSynchronizerImpl.playback_finished called before text/audio input is done {"text_done": false, "audio_done": true}

is because of this chain of events:

User says -> "Please turn on light A 5 for me"
Phonic -> assistant_started_speaking (Plugin -> starts a new generation)
Phonic -> tool_call (Plugin -> closes current generation, but no text chunks was put in the text_ch by the time it closes, so )

(We need to close the current generation so that tools actually get sent back to the plugin through `update_chat_ctx)

However, because no text ever made it into the text, channel, the text_done never becomes true.

A quick fix for this is to just put an empty string in the text channel, so that the Text/Audio Synchronizer can change the state from false to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed overview! Just to confirm, the tool call results are received after the "assistant_started_speaking" event and that's why there's something like an in-between empty generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the assistant_started_speaking fires before the tool_call, because assistant_started_speaking is basically our version of saying a turn has started, and tool calls need to happen within an assistant turn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay, I see now! Thanks again

@qionghuang6 qionghuang6 requested a review from tinalenguyen March 3, 2026 23:51
Copy link
Member

@tinalenguyen tinalenguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm and works great, thanks!

@tinalenguyen tinalenguyen merged commit 9a49e19 into livekit:main Mar 4, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants