Add Phonic Plugin to LiveKit agents#4980
Conversation
| "Phonic does not support updating instructions mid-session." | ||
| ) | ||
| return | ||
| self._opts.instructions = instructions |
There was a problem hiding this comment.
🔴 Mutating shared _opts object in update_instructions affects the model and all future sessions
When update_instructions is called, it mutates self._opts.instructions at line 258. However, self._opts is assigned as a direct reference to realtime_model._opts at line 198 (self._opts = realtime_model._opts). This means the mutation leaks into the parent RealtimeModel and any future sessions created from it.
Root Cause and Impact
The RealtimeSession.__init__ at livekit-plugins/livekit-plugins-phonic/livekit/plugins/phonic/realtime/realtime_model.py:198 stores a reference to the model's _opts:
self._opts = realtime_model._optsThen update_instructions at line 258 mutates that shared object:
self._opts.instructions = instructionsSince RealtimeModel.session() can be called multiple times (e.g., when the agent activity is updated at livekit-agents/livekit/agents/voice/agent_activity.py:556), the second session would inherit the instructions set by the first session's update_instructions call, rather than the original NOT_GIVEN default. This can lead to stale or incorrect instructions being sent to new Phonic sessions.
Expected: Each session should have its own copy of options, or instructions should be stored in a session-local variable.
Actual: Instructions set in one session mutate the shared model-level options object.
Prompt for agents
In livekit-plugins/livekit-plugins-phonic/livekit/plugins/phonic/realtime/realtime_model.py, line 198, change self._opts = realtime_model._opts to create a copy of the options (e.g., using dataclasses.replace or copy.copy) so that mutations in update_instructions (line 258) don't affect the parent RealtimeModel or other sessions. For example, change line 198 from:
self._opts = realtime_model._opts
to:
import copy
self._opts = copy.copy(realtime_model._opts)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is the pattern that the Google realtime and other realtime models use.
| self._opts.instructions = instructions | ||
| self._instructions_ready.set() | ||
|
|
||
| async def update_chat_ctx(self, chat_ctx: llm.ChatContext) -> None: |
There was a problem hiding this comment.
at the start of the session, update_chat_ctx is called with a fresh chat context and i see this log:
WARNI… livekit.…ns.phonic update_chat_ctx called but no new tool call outputs to send. Phonic does not support general chat context updates.
perhaps we could add a check if the session just started?
There was a problem hiding this comment.
Yup, just added a check to log this only if the config message has already been sent to Phonic.
| def interrupt(self) -> None: | ||
| logger.warning( | ||
| "interrupt() is not supported by Phonic realtime model. " | ||
| "User interruptions are automatically handled by Phonic." |
There was a problem hiding this comment.
we call interrupt() when input speech is detected (relevant lines), so even if there is no ongoing speech to interrupt this log appears. adding this check should filter out those instances
| def interrupt(self) -> None: | |
| logger.warning( | |
| "interrupt() is not supported by Phonic realtime model. " | |
| "User interruptions are automatically handled by Phonic." | |
| def interrupt(self) -> None: | |
| if self._current_generation: | |
| logger.warning( | |
| "interrupt() is not supported by Phonic realtime model. " | |
| "User interruptions are automatically handled by Phonic." | |
| ) |
| if self._current_generation is None and message.text: | ||
| logger.debug("Starting new generation due to text in audio chunk") | ||
| self._start_new_assistant_turn() | ||
|
|
||
| gen = self._current_generation | ||
| if gen is None: | ||
| return |
There was a problem hiding this comment.
🟡 Audio-only chunks silently dropped when no active generation exists
In _handle_audio_chunk, a new generation is only created when message.text is truthy (line 566). If an audio chunk arrives with audio data but no text, and there is no active generation (self._current_generation is None), the audio is silently discarded at line 571-572.
Root Cause and Impact
The guard at line 566 checks:
if self._current_generation is None and message.text:But it should also account for message.audio. When an audio-only chunk (no text) arrives without an active generation — for example if Phonic sends audio before assistant_started_speaking, or if the event was missed — the code falls through to:
gen = self._current_generation # None
if gen is None:
return # <-- audio silently droppedThe audio data in message.audio is never decoded or forwarded, causing audio loss for that chunk. The condition should be message.text or message.audio to also start a generation for audio-only chunks.
Impact: Potential audio loss/gaps if audio chunks arrive without an active generation. In the normal protocol flow (assistant_started_speaking → audio_chunk), this won't trigger, but any deviation from that ordering causes silent audio drops.
| if self._current_generation is None and message.text: | |
| logger.debug("Starting new generation due to text in audio chunk") | |
| self._start_new_assistant_turn() | |
| gen = self._current_generation | |
| if gen is None: | |
| return | |
| if self._current_generation is None and (message.text or message.audio): | |
| logger.debug("Starting new generation due to content in audio chunk") | |
| self._start_new_assistant_turn() | |
| gen = self._current_generation | |
| if gen is None: | |
| return |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
This is intentional. Phonic can send silent audio chunks (or audio chunks mixed in with background noise), but we want to ignore these chunks in our integration with the Livekit agent SDKs.
|
Thanks @tinalenguyen . I made some quick changes. I also just reset Also, I think the ruff CI is having some issues due to GitHub being partially down? We might need to re-run it. https://github.com/livekit/agents/actions/runs/22639600339/job/65611716048?pr=4980 |
|
The Just tested again, I consistently see this log when a tool is called: I'm checking if this is related to any plugin changes, I don't recall seeing that initially |
|
@qionghuang6 I think marking the generation as done upon Is there an event similar to |
|
EDIT: What I said below assumes the assistant speaks before calling the tool. In that case, we don't get the warning, but the warning happens when the assistant doesn't say anything before calling the tool, so nothing makes its way into the text channel, and @tinalenguyen Are you saying that when a tool is called, it is in it's own generation without audio or text? I think the current implementation uses the same generation as the existing audio and text. (Unless
The current flow of events when a tool call happens is: For turns without tool calls, it is: I tried logging when Continuing to look into this. |
| ) | ||
|
|
||
| if not gen.text_ch.closed: | ||
| gen.text_ch.send_nowait("") |
There was a problem hiding this comment.
I investigated more deeply, and I think the reason we were seeing
_SegmentSynchronizerImpl.playback_finished called before text/audio input is done {"text_done": false, "audio_done": true}
is because of this chain of events:
User says -> "Please turn on light A 5 for me"
Phonic -> assistant_started_speaking (Plugin -> starts a new generation)
Phonic -> tool_call (Plugin -> closes current generation, but no text chunks was put in the text_ch by the time it closes, so )
(We need to close the current generation so that tools actually get sent back to the plugin through `update_chat_ctx)
However, because no text ever made it into the text, channel, the text_done never becomes true.
A quick fix for this is to just put an empty string in the text channel, so that the Text/Audio Synchronizer can change the state from false to true.
There was a problem hiding this comment.
Thank you for the detailed overview! Just to confirm, the tool call results are received after the "assistant_started_speaking" event and that's why there's something like an in-between empty generation?
There was a problem hiding this comment.
Yup, the assistant_started_speaking fires before the tool_call, because assistant_started_speaking is basically our version of saying a turn has started, and tool calls need to happen within an assistant turn.
There was a problem hiding this comment.
Ahh okay, I see now! Thanks again
tinalenguyen
left a comment
There was a problem hiding this comment.
lgtm and works great, thanks!
Implements Phonic Plugin in Livekit Agents Python SDK.
(Original LiveKit Agents JS implementation: https://github.com/Phonic-Co/livekit-agents-js/tree/main/plugins/phonic)
Changes
RealtimeSessionandRealtimeModel, supporting STS and tool callsexamples/voice-agentsTesting
Demo video