[Inworld] Flush to drain decoder on every audio chunk from server#4983
Open
ianbbqzy wants to merge 1 commit intolivekit:mainfrom
Open
[Inworld] Flush to drain decoder on every audio chunk from server#4983ianbbqzy wants to merge 1 commit intolivekit:mainfrom
ianbbqzy wants to merge 1 commit intolivekit:mainfrom
Conversation
839b7ba to
daf64bb
Compare
Contributor
Author
|
@tinalenguyen @davidzhao PTAL, Thanks! |
Member
|
Hi @ianbbqzy, thanks for the PR! Few notes and Q's:
Could you elaborate more on this? I also noticed that the last punctuation is not present, not sure if that was pre-existing behavior |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The emitter's non-PCM path still routes through
_decode_task, which onFlushSegmentdoesaudio_decoder.end_input()+await decode_atask+audio_decoder = None. This means:Without per-chunk
flush(): All audio bytes accumulate in the decoder. The finaloutput_emitter.flush()at line 1137 is the onlyFlushSegment. TTFB = time to receive all audio from server, not just the first chunk.With per-chunk
flush(): Each chunk triggersFlushSegment→ decoder drains →_flush_frame()→SynthesizedAudiois emitted. TTFB = time to first chunk from server (correct).This is an Inworld-specific issue because Inworld's HTTP API returns one JSON-line per audio chunk (each with its own WAV payload), unlike other providers which returns a single continuous raw PCM byte stream. For Inworld, each JSON line is a discrete audio segment, and delaying the flush until the end defeats the purpose of streaming.
====
Also updated timestamps parsing logic to always add a trailing space regardless if it's the end of a chunk or not. Because in Inworld case, end of a chunk could just be end of a phrase rather than end of a sentence