Skip to content

feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553

Open
tarekgh wants to merge 2 commits intogoogleapis:mainfrom
tarekgh:feature/vertex-ai-realtime-provider
Open

feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession#15553
tarekgh wants to merge 2 commits intogoogleapis:mainfrom
tarekgh:feature/vertex-ai-realtime-provider

Conversation

@tarekgh
Copy link
Copy Markdown

@tarekgh tarekgh commented Apr 6, 2026

Summary

Adds a Vertex AI Live API provider implementing the Microsoft.Extensions.AI Realtime abstractions (IRealtimeClient / IRealtimeClientSession), enabling real-time audio, text, image, and function-calling conversations with Vertex AI models through the standardized MEAI interface.

This follows the same pattern as the Gemini Realtime provider PR in the dotnet-genai repository, and is consistent with the existing IChatClient implementation (PredictionServiceChatClient) in this package.

AOT Compatibility

This PR also includes cross-cutting AOT (Ahead-of-Time) compilation improvements that span the entire SDK, not just the realtime provider:

Realtime provider AOT support:

  • Added InternalLiveJsonContext — a source-generated JsonSerializerContext with [JsonSerializable] entries for all Live API types (including Dictionary<string, object?> for function call arguments)
  • All WebSocket JSON serialization/deserialization uses the source-generated context via LiveJsonContext.Default.LiveClientMessage / LiveJsonContext.Default.LiveServerMessage
  • All internal types use explicit [JsonPropertyName] attributes — no reflection-based naming
  • Added AotJsonContextTest.cs verifying source-gen coverage, nested type auto-discovery, round-trip correctness, and DefaultIgnoreCondition.WhenWritingNull behavior

Chat client AOT improvement (PredictionServiceChatClient.cs):

  • Eliminated JSON round-tripping for tool arguments and results: previously, Struct values were converted via Struct.Parser.ParseJson(JsonSerializer.Serialize(value)) — a serialize-then-parse pattern that relies on reflection-based JsonSerializer.Serialize. Now uses direct Struct ↔ dictionary conversion, avoiding System.Text.Json entirely on the tool-call hot path. This makes the existing IChatClient more AOT-friendly as a side effect.

Usage Example

using Google.Cloud.AIPlatform.V1;
using Google.Cloud.VertexAI.Extensions;
using Microsoft.Extensions.AI;

// Create the realtime client from a PredictionServiceClientBuilder
var builder = new PredictionServiceClientBuilder
{
    // Credentials are resolved automatically via ADC, or set explicitly:
    // JsonCredentials = File.ReadAllText("service-account.json"),
};

IRealtimeClient realtimeClient = builder.BuildIRealtimeClient(
    "gemini-live-2.5-flash-native-audio");

// Or async:
// IRealtimeClient realtimeClient = await builder.BuildIRealtimeClientAsync(
//     "gemini-live-2.5-flash-native-audio");

// Define a tool for function calling
AIFunction getWeather = AIFunctionFactory.Create(
    (string location) =>
        location.ToLowerInvariant() switch
        {
            var l when l.Contains("seattle")       => $"The weather in {location} is rainy, 55F",
            var l when l.Contains("new york")      => $"The weather in {location} is cloudy, 70F",
            var l when l.Contains("san francisco") => $"The weather in {location} is foggy, 60F",
            _                                      => $"Sorry, I don't have weather data for {location}."
        },
    "GetWeather",
    "Gets the current weather for a given location");

// Configure session options
var sessionOptions = new RealtimeSessionOptions
{
    Instructions = "You are a helpful assistant.",
    Voice = "Puck",
    OutputModalities = ["audio"],
    Tools = [getWeather],
    InputAudioFormat = new RealtimeAudioFormat("audio/pcm;rate=16000", 16000),
    TranscriptionOptions = new TranscriptionOptions(),
    VoiceActivityDetection = new VoiceActivityDetectionOptions
    {
        Enabled = true,
        AllowInterruption = true,
    },
};

// Optionally wrap with function invocation middleware
var clientBuilder = new RealtimeClientBuilder(realtimeClient)
    .UseFunctionInvocation();
IRealtimeClient wrappedClient = clientBuilder.Build();

// Create a session and start streaming
await using var session = await wrappedClient.CreateSessionAsync(sessionOptions);

// Start listening for server messages in the background
_ = Task.Run(async () =>
{
    await foreach (var message in session.GetStreamingResponseAsync(cancellationToken))
    {
        switch (message)
        {
            case OutputTextAudioRealtimeServerMessage audio
                when audio.Type == RealtimeServerMessageType.OutputAudioDelta:
                // Play audio - audio.Audio contains base64-encoded PCM audio
                PlayAudio(Convert.FromBase64String(audio.Audio));
                break;

            case OutputTextAudioRealtimeServerMessage text
                when text.Type == RealtimeServerMessageType.OutputAudioTranscriptionDelta:
                Console.Write(text.Text);
                break;

            case InputAudioTranscriptionRealtimeServerMessage transcription:
                Console.WriteLine($"You said: {transcription.Transcription}");
                break;

            case ResponseCreatedRealtimeServerMessage response:
                if (response.Usage != null)
                    Console.WriteLine(
                        $"Tokens - In: {response.Usage.InputTokenCount}, " +
                        $"Out: {response.Usage.OutputTokenCount}");
                break;

            case ErrorRealtimeServerMessage error:
                Console.Error.WriteLine($"Error: {error.Error?.Message}");
                break;
        }
    }
});

// Send audio from microphone (e.g., 16kHz PCM)
var audioContent = new DataContent(
    $"data:audio/pcm;base64,{Convert.ToBase64String(audioBytes)}");
await session.SendAsync(
    new InputAudioBufferAppendRealtimeClientMessage(audioContent));
await session.SendAsync(
    new InputAudioBufferCommitRealtimeClientMessage());
await session.SendAsync(
    new CreateResponseRealtimeClientMessage());

// Send an image for multimodal conversation
var imageContent = new DataContent(
    $"data:image/jpeg;base64,{Convert.ToBase64String(imageBytes)}");
var imageItem = new RealtimeConversationItem(
    [imageContent], id: null, role: ChatRole.User);
await session.SendAsync(
    new CreateConversationItemRealtimeClientMessage(item: imageItem));
await session.SendAsync(
    new CreateResponseRealtimeClientMessage());

What's Included

New Files

  • PredictionServiceRealtimeClient.csIRealtimeClient implementation that wraps a PredictionServiceClientBuilder, resolves credentials (ADC, service account JSON, scoped OAuth), builds the WebSocket connection, and creates realtime sessions via the Vertex AI Live API.
  • PredictionServiceRealtimeSession.csIRealtimeClientSession implementation that manages the WebSocket connection, audio buffering with ActivityStart/ActivityEnd framing, message mapping, image sending via clientContent, and function call orchestration.
  • InternalLiveTransport.cs — Internal WebSocket transport (Client, Live, AsyncSession) handling connection lifecycle, credential headers, binary frame send/receive, and graceful disposal with close timeouts.
  • InternalLiveTypes.cs — Internal JSON-serializable types for the Vertex AI Live API protocol (client messages, server messages, blobs, function calls, schemas, etc.).
  • InternalLiveJsonContext.cs — Source-generated JsonSerializerContext for AOT-safe serialization of all Live API types.
  • BuildIRealtimeClientTest.cs — 42 unit tests covering client construction, session config mapping, audio commit flow, message mapping, function call handling, disposal, and edge cases.
  • AotJsonContextTest.cs — 4 unit tests verifying AOT source-gen coverage, nested type discovery, and serialization round-trips.

Modified Files

  • VertexAIExtensions.cs — Added BuildIRealtimeClient() / BuildIRealtimeClientAsync() extension methods on PredictionServiceClientBuilder.
  • PredictionServiceChatClient.cs — Eliminated JSON round-tripping for tool arguments/results to improve AOT compatibility (direct Struct ↔ dictionary conversion).
  • BuildIChatClientTest.cs — Updated tests reflecting the chat client tool-handling changes.
  • Google.Cloud.VertexAI.Extensions.csproj — Version bumped to 1.0.0-beta08; added Microsoft.Bcl.AsyncInterfaces dependency for netstandard2.0/net462.
  • docs/history.md — Release notes for 1.0.0-beta08.

Features

  • Audio streaming — Append/commit pattern with automatic frame splitting (32KB max), explicit ActivityStart/ActivityEnd framing
  • Manual activity detection — Vertex AI does not support audioStreamEnd; automatic activity detection is always disabled in favor of explicit ActivityStart/ActivityEnd framing that reliably triggers model responses
  • Text conversations — Send text messages and receive text/audio responses
  • Image understanding — Images sent via clientContent with inlineData for proper multimodal conversation context
  • Function calling — Full tool invocation support with the FunctionInvokingRealtimeSession middleware; tool responses batched into a single SendToolResponseAsync call
  • Transcription — Input and output audio transcription support
  • Thread-safe sendsSemaphoreSlim serializes all WebSocket sends, safe for concurrent middleware + caller usage
  • Graceful disposal — Race-safe dispose with proper exception handling for in-flight operations; WebSocket close with 5-second timeout
  • SetupComplete handshakeCreateSessionAsync waits for the server's SetupComplete before returning
  • Credential resolution — Supports ADC (Application Default Credentials), service account JSON (JsonCredentials), and explicit GoogleCredential — all automatically scoped with cloud-platform OAuth scope
  • AOT compatible — Source-generated JSON serialization for all Live API types; chat client tool path also improved for AOT
  • Schema depth guardConvertJsonSchemaToGoogleSchema enforces MaxDepth=32 to prevent stack overflow from deeply nested schemas
  • Usage metadata — Token counts surfaced in ResponseDone messages, correctly handled even when TurnComplete and UsageMetadata arrive in the same server message

Key Design Decisions

  1. Always manual activity detection — Vertex AI does not support audioStreamEnd (confirmed by dotnet-genai's LiveConverters.cs). When automatic activity detection is enabled server-side, the server ignores manual ActivityStart/ActivityEnd signals, leaving no way to trigger a response. The provider always forces AutomaticActivityDetection.Disabled = true and uses explicit ActivityStart → audio → ActivityEnd framing regardless of the user's VoiceActivityDetection.Enabled setting. The AllowInterruption option is still respected via activityHandling.

  2. Images via clientContent — Static images are sent via clientContent with inlineData parts (proper conversation content) rather than realtimeInput.video (designed for streaming video frames). This ensures the model properly processes images as part of the conversation context.

  3. Tool response batching — The MEAI FunctionInvokingRealtimeSession middleware sends separate CreateConversationItem per function result. Gemini expects all results in one SendToolResponseAsync call. The provider buffers results and flushes them as a single batch when CreateResponse arrives.

  4. TurnComplete suppression after tool responses — After SendToolResponseAsync, the Gemini model automatically continues generating. The provider tracks this via _lastSendWasToolResponse and skips redundant triggers.

  5. SetupComplete handshake — The CreateSessionAsync method drains the server's SetupComplete acknowledgment before returning, ensuring the session is fully ready before the caller sends audio or text.

  6. Audio buffer cap — Audio appends are capped at 10 MB to prevent unbounded memory growth. Frames exceeding 32 KB are automatically split.

  7. Consistent with IChatClient — The realtime provider follows the same patterns as PredictionServiceChatClient: same namespace, same builder extension methods (BuildIRealtimeClient/BuildIRealtimeClientAsync), same credential resolution, same GetService pattern exposing the underlying Client via IServiceProvider.

Test Coverage

151 unit tests (42 new + 109 existing), covering:

  • Client and session lifecycle (construction, disposal, idempotent dispose)
  • All BuildLiveConnectConfig option combinations (modalities, voice, tools, transcription, VAD, max tokens)
  • Audio commit flow (ActivityStart/ActivityEnd framing, chunking, empty buffer handling)
  • All server message mapping (audio, text, transcription, function calls, usage, GoAway, turn complete)
  • Function call handling (single/multiple results, batching, null args, ID generation)
  • Edge cases (null inputs, empty buffers, concurrent dispose, exception swallowing)
  • Media trigger flow (image → CreateResponse → clientContent with turnComplete)
  • AOT source-gen context coverage, nested type discovery, and round-trip correctness
  • Chat client tool argument/result direct conversion (replaces JSON round-trip tests)

…ealtimeClientSession

Add a Vertex AI Live API provider implementing the Microsoft.Extensions.AI
Realtime abstractions (IRealtimeClient / IRealtimeClientSession), enabling
real-time audio, text, image, and function-calling conversations with
Vertex AI models through the standardized MEAI interface.

New files:
- PredictionServiceRealtimeClient.cs
- PredictionServiceRealtimeSession.cs
- InternalLiveTransport.cs / InternalLiveTypes.cs / InternalLiveJsonContext.cs
- BuildIRealtimeClientTest.cs / AotJsonContextTest.cs

Also includes AOT improvements to PredictionServiceChatClient.cs,
avoiding JSON round-tripping for tool arguments/results.
@tarekgh tarekgh requested a review from a team as a code owner April 6, 2026 22:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Vertex AI live models by implementing the IRealtimeClient and IRealtimeClientSession interfaces. Key additions include a new WebSocket-based transport layer, internal types for the Gemini Live API, and source-generated JSON context to ensure Native AOT compatibility. Furthermore, the PredictionServiceChatClient has been optimized to handle tool arguments and results through direct conversion between objects and Protobuf Struct/Value types, avoiding inefficient JSON string round-tripping and adding a nesting depth limit. Feedback focuses on further optimizing the transport layer by using pooled buffers and direct stream deserialization, as well as improving the normalization of tool payloads for unknown types.

…ream deserialization

- NormalizeToolPayload: Use JsonSerializer.SerializeToElement with
  AIJsonUtilities.DefaultOptions for unknown POCO types instead of
  ToString(), consistent with PredictionServiceChatClient.
- ReceiveAsync: Use cached _receiveBuffer field instead of allocating
  4KB per call, reducing GC pressure in high-frequency audio streaming.
- ReceiveAsync: Deserialize directly from MemoryStream instead of
  ToArray() + UTF8.GetString(), avoiding intermediate copies.
@amanda-tarafa amanda-tarafa self-requested a review April 6, 2026 23:20
@amanda-tarafa
Copy link
Copy Markdown
Contributor

I'll get to this sometime Tuesday/Wednesday. Thanks!

@tarekgh
Copy link
Copy Markdown
Author

tarekgh commented Apr 6, 2026

Thanks so much for your help on this! As noted in the description, there are a few additional changes in the SDK itself to support AOT compatibility.

I’d also really appreciate any help in getting visibility or support for this PR as well: googleapis/dotnet-genai#256.

CC @jeffhandley @stephentoub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants