feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553
feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553tarekgh wants to merge 2 commits intogoogleapis:mainfrom
Conversation
…ealtimeClientSession Add a Vertex AI Live API provider implementing the Microsoft.Extensions.AI Realtime abstractions (IRealtimeClient / IRealtimeClientSession), enabling real-time audio, text, image, and function-calling conversations with Vertex AI models through the standardized MEAI interface. New files: - PredictionServiceRealtimeClient.cs - PredictionServiceRealtimeSession.cs - InternalLiveTransport.cs / InternalLiveTypes.cs / InternalLiveJsonContext.cs - BuildIRealtimeClientTest.cs / AotJsonContextTest.cs Also includes AOT improvements to PredictionServiceChatClient.cs, avoiding JSON round-tripping for tool arguments/results.
There was a problem hiding this comment.
Code Review
This pull request introduces support for Vertex AI live models by implementing the IRealtimeClient and IRealtimeClientSession interfaces. Key additions include a new WebSocket-based transport layer, internal types for the Gemini Live API, and source-generated JSON context to ensure Native AOT compatibility. Furthermore, the PredictionServiceChatClient has been optimized to handle tool arguments and results through direct conversion between objects and Protobuf Struct/Value types, avoiding inefficient JSON string round-tripping and adding a nesting depth limit. Feedback focuses on further optimizing the transport layer by using pooled buffers and direct stream deserialization, as well as improving the normalization of tool payloads for unknown types.
...oud.VertexAI.Extensions/Google.Cloud.VertexAI.Extensions/PredictionServiceRealtimeSession.cs
Outdated
Show resolved
Hide resolved
apis/Google.Cloud.VertexAI.Extensions/Google.Cloud.VertexAI.Extensions/InternalLiveTransport.cs
Outdated
Show resolved
Hide resolved
apis/Google.Cloud.VertexAI.Extensions/Google.Cloud.VertexAI.Extensions/InternalLiveTransport.cs
Outdated
Show resolved
Hide resolved
…ream deserialization - NormalizeToolPayload: Use JsonSerializer.SerializeToElement with AIJsonUtilities.DefaultOptions for unknown POCO types instead of ToString(), consistent with PredictionServiceChatClient. - ReceiveAsync: Use cached _receiveBuffer field instead of allocating 4KB per call, reducing GC pressure in high-frequency audio streaming. - ReceiveAsync: Deserialize directly from MemoryStream instead of ToArray() + UTF8.GetString(), avoiding intermediate copies.
|
I'll get to this sometime Tuesday/Wednesday. Thanks! |
|
Thanks so much for your help on this! As noted in the description, there are a few additional changes in the SDK itself to support AOT compatibility. I’d also really appreciate any help in getting visibility or support for this PR as well: googleapis/dotnet-genai#256. |
Summary
Adds a Vertex AI Live API provider implementing the
Microsoft.Extensions.AIRealtime abstractions (IRealtimeClient/IRealtimeClientSession), enabling real-time audio, text, image, and function-calling conversations with Vertex AI models through the standardized MEAI interface.This follows the same pattern as the Gemini Realtime provider PR in the
dotnet-genairepository, and is consistent with the existingIChatClientimplementation (PredictionServiceChatClient) in this package.AOT Compatibility
This PR also includes cross-cutting AOT (Ahead-of-Time) compilation improvements that span the entire SDK, not just the realtime provider:
Realtime provider AOT support:
InternalLiveJsonContext— a source-generatedJsonSerializerContextwith[JsonSerializable]entries for all Live API types (includingDictionary<string, object?>for function call arguments)LiveJsonContext.Default.LiveClientMessage/LiveJsonContext.Default.LiveServerMessage[JsonPropertyName]attributes — no reflection-based namingAotJsonContextTest.csverifying source-gen coverage, nested type auto-discovery, round-trip correctness, andDefaultIgnoreCondition.WhenWritingNullbehaviorChat client AOT improvement (
PredictionServiceChatClient.cs):Structvalues were converted viaStruct.Parser.ParseJson(JsonSerializer.Serialize(value))— a serialize-then-parse pattern that relies on reflection-basedJsonSerializer.Serialize. Now uses directStruct↔ dictionary conversion, avoidingSystem.Text.Jsonentirely on the tool-call hot path. This makes the existingIChatClientmore AOT-friendly as a side effect.Usage Example
What's Included
New Files
PredictionServiceRealtimeClient.cs—IRealtimeClientimplementation that wraps aPredictionServiceClientBuilder, resolves credentials (ADC, service account JSON, scoped OAuth), builds the WebSocket connection, and creates realtime sessions via the Vertex AI Live API.PredictionServiceRealtimeSession.cs—IRealtimeClientSessionimplementation that manages the WebSocket connection, audio buffering with ActivityStart/ActivityEnd framing, message mapping, image sending viaclientContent, and function call orchestration.InternalLiveTransport.cs— Internal WebSocket transport (Client,Live,AsyncSession) handling connection lifecycle, credential headers, binary frame send/receive, and graceful disposal with close timeouts.InternalLiveTypes.cs— Internal JSON-serializable types for the Vertex AI Live API protocol (client messages, server messages, blobs, function calls, schemas, etc.).InternalLiveJsonContext.cs— Source-generatedJsonSerializerContextfor AOT-safe serialization of all Live API types.BuildIRealtimeClientTest.cs— 42 unit tests covering client construction, session config mapping, audio commit flow, message mapping, function call handling, disposal, and edge cases.AotJsonContextTest.cs— 4 unit tests verifying AOT source-gen coverage, nested type discovery, and serialization round-trips.Modified Files
VertexAIExtensions.cs— AddedBuildIRealtimeClient()/BuildIRealtimeClientAsync()extension methods onPredictionServiceClientBuilder.PredictionServiceChatClient.cs— Eliminated JSON round-tripping for tool arguments/results to improve AOT compatibility (directStruct↔ dictionary conversion).BuildIChatClientTest.cs— Updated tests reflecting the chat client tool-handling changes.Google.Cloud.VertexAI.Extensions.csproj— Version bumped to1.0.0-beta08; addedMicrosoft.Bcl.AsyncInterfacesdependency fornetstandard2.0/net462.docs/history.md— Release notes for1.0.0-beta08.Features
audioStreamEnd; automatic activity detection is always disabled in favor of explicit ActivityStart/ActivityEnd framing that reliably triggers model responsesclientContentwithinlineDatafor proper multimodal conversation contextFunctionInvokingRealtimeSessionmiddleware; tool responses batched into a singleSendToolResponseAsynccallSemaphoreSlimserializes all WebSocket sends, safe for concurrent middleware + caller usageCreateSessionAsyncwaits for the server'sSetupCompletebefore returningJsonCredentials), and explicitGoogleCredential— all automatically scoped withcloud-platformOAuth scopeConvertJsonSchemaToGoogleSchemaenforces MaxDepth=32 to prevent stack overflow from deeply nested schemasResponseDonemessages, correctly handled even whenTurnCompleteandUsageMetadataarrive in the same server messageKey Design Decisions
Always manual activity detection — Vertex AI does not support
audioStreamEnd(confirmed bydotnet-genai'sLiveConverters.cs). When automatic activity detection is enabled server-side, the server ignores manualActivityStart/ActivityEndsignals, leaving no way to trigger a response. The provider always forcesAutomaticActivityDetection.Disabled = trueand uses explicitActivityStart→ audio →ActivityEndframing regardless of the user'sVoiceActivityDetection.Enabledsetting. TheAllowInterruptionoption is still respected viaactivityHandling.Images via
clientContent— Static images are sent viaclientContentwithinlineDataparts (proper conversation content) rather thanrealtimeInput.video(designed for streaming video frames). This ensures the model properly processes images as part of the conversation context.Tool response batching — The MEAI
FunctionInvokingRealtimeSessionmiddleware sends separateCreateConversationItemper function result. Gemini expects all results in oneSendToolResponseAsynccall. The provider buffers results and flushes them as a single batch whenCreateResponsearrives.TurnComplete suppression after tool responses — After
SendToolResponseAsync, the Gemini model automatically continues generating. The provider tracks this via_lastSendWasToolResponseand skips redundant triggers.SetupComplete handshake — The
CreateSessionAsyncmethod drains the server'sSetupCompleteacknowledgment before returning, ensuring the session is fully ready before the caller sends audio or text.Audio buffer cap — Audio appends are capped at 10 MB to prevent unbounded memory growth. Frames exceeding 32 KB are automatically split.
Consistent with IChatClient — The realtime provider follows the same patterns as
PredictionServiceChatClient: same namespace, same builder extension methods (BuildIRealtimeClient/BuildIRealtimeClientAsync), same credential resolution, sameGetServicepattern exposing the underlyingClientviaIServiceProvider.Test Coverage
151 unit tests (42 new + 109 existing), covering:
BuildLiveConnectConfigoption combinations (modalities, voice, tools, transcription, VAD, max tokens)