feat: Add Gemini Realtime provider implementing IRealtimeClient/IRealtimeClientSession by tarekgh · Pull Request #256 · googleapis/dotnet-genai

tarekgh · 2026-03-19T00:18:42Z

Summary

Adds a Gemini Live API provider implementing the Microsoft.Extensions.AI Realtime abstractions (IRealtimeClient / IRealtimeClientSession), enabling real-time audio, text, and function-calling conversations with Gemini models through the standardized MEAI interface.

This PR also updates the repository to depend on the official Microsoft.Extensions.AI.Abstractions 10.4.1 NuGet package (replacing the private 10.5.0-dev builds).

AOT Compatibility

This PR includes changes to make the Google GenAI SDK fully compatible with AOT (Ahead-of-Time) compilation. This was done to avoid duplicating the SDK's WebSocket and JSON protocol code in the realtime provider, since the MEAI realtime wrapper (GoogleGenAIRealtimeClient / GoogleGenAIRealtimeSession) delegates all serialization and WebSocket communication to the core SDK's Live.cs / AsyncSession / Transformers.cs / LiveConverters.cs code paths.

AOT changes:

Added GenAIJsonContext — a source-generated JsonSerializerContext with [JsonSerializable] entries for 93 root types (nested property types are auto-discovered by the generator)
Wired the source-gen context into JsonConfig.JsonSerializerOptions with DefaultJsonTypeInfoResolver fallback for non-AOT scenarios (anonymous types, user-provided types)
Added compact InternalSerializerOptions for intermediate serialize-then-parse round-trips to avoid WriteIndented overhead on internal transforms
Updated all ~90 bare JsonSerializer.Serialize/Deserialize call sites across the SDK to use configured options with source-gen metadata
Added System.Text.Json PackageReference for source generator support on both netstandard2.0 and net8.0 targets
Added AotJsonContextTest.cs with 4 tests verifying context coverage, nested type auto-discovery, and serialization round-trips

Usage Example

using Google.GenAI;
using Microsoft.Extensions.AI;

// Simple: create with API key and model
IRealtimeClient realtimeClient = new GoogleGenAIRealtimeClient(
    apiKey, "gemini-3.1-flash-live-preview");

// Or advanced: create from an existing Google.GenAI.Client instance
// var genaiClient = new Client(apiKey: apiKey);
// IRealtimeClient realtimeClient = new GoogleGenAIRealtimeClient(genaiClient, "gemini-3.1-flash-live-preview");

// Or use the extension method
// IRealtimeClient realtimeClient = genaiClient.AsIRealtimeClient("gemini-3.1-flash-live-preview");

// Define a tool for function calling
AIFunction getWeather = AIFunctionFactory.Create(
    (string location) =>
        location.ToLowerInvariant() switch
        {
            var l when l.Contains("seattle")       => $"The weather in {location} is rainy, 55F",
            var l when l.Contains("new york")      => $"The weather in {location} is cloudy, 70F",
            var l when l.Contains("san francisco") => $"The weather in {location} is foggy, 60F",
            _                                      => $"Sorry, I don't have weather data for {location}."
        },
    "GetWeather",
    "Gets the current weather for a given location");

// Configure session options
var sessionOptions = new RealtimeSessionOptions
{
    Instructions = "You are a helpful assistant.",
    Voice = "Puck",
    OutputModalities = ["audio"],
    Tools = [getWeather],
    InputAudioFormat = new RealtimeAudioFormat("audio/pcm;rate=16000", 16000),
    TranscriptionOptions = new TranscriptionOptions(),
    VoiceActivityDetection = new VoiceActivityDetectionOptions
    {
        Enabled = true,
        AllowInterruption = true,
    },
};

// Optionally wrap with function invocation middleware
var clientBuilder = new RealtimeClientBuilder(realtimeClient)
    .UseFunctionInvocation();
IRealtimeClient wrappedClient = clientBuilder.Build();

// Create a session and start streaming
await using var session = await wrappedClient.CreateSessionAsync(sessionOptions);

// Start listening for server messages in the background
_ = Task.Run(async () =>
{
    await foreach (var message in session.GetStreamingResponseAsync(cancellationToken))
    {
        switch (message)
        {
            case OutputTextAudioRealtimeServerMessage audio
                when audio.Type == RealtimeServerMessageType.OutputAudioDelta:
                // Play audio - audio.Audio contains base64-encoded PCM audio
                PlayAudio(Convert.FromBase64String(audio.Audio));
                break;

            case OutputTextAudioRealtimeServerMessage text
                when text.Type == RealtimeServerMessageType.OutputAudioTranscriptionDelta:
                Console.Write(text.Text);
                break;

            case InputAudioTranscriptionRealtimeServerMessage transcription:
                Console.WriteLine($"You said: {transcription.Transcription}");
                break;

            case ResponseCreatedRealtimeServerMessage response:
                if (response.Usage != null)
                    Console.WriteLine($"Tokens - In: {response.Usage.InputTokenCount}, Out: {response.Usage.OutputTokenCount}");
                break;

            case ResponseOutputItemRealtimeServerMessage item:
                if (item.Item is RealtimeConversationItem convItem)
                    foreach (var content in convItem.Contents)
                        if (content is FunctionCallContent fc)
                            Console.WriteLine($"Function call: {fc.Name}({string.Join(", ", fc.Arguments?.Select(a => $"{a.Key}={a.Value}") ?? [])})");
                break;

            case ErrorRealtimeServerMessage error:
                Console.Error.WriteLine($"Error: {error.Error?.Message}");
                break;
        }
    }
});

// Send audio from microphone (e.g., 16kHz PCM)
var audioContent = new DataContent($"data:audio/pcm;base64,{Convert.ToBase64String(audioBytes)}");
await session.SendAsync(new InputAudioBufferAppendRealtimeClientMessage(audioContent));
await session.SendAsync(new InputAudioBufferCommitRealtimeClientMessage());
await session.SendAsync(new CreateResponseRealtimeClientMessage());

What's Included

New Files

GoogleGenAIRealtimeClient.cs — IRealtimeClient implementation that wraps a Google.GenAI.Client and creates realtime sessions via the Gemini Live API. Includes a convenience constructor accepting just an API key and model ID.
GoogleGenAIRealtimeSession.cs — IRealtimeClientSession implementation that manages the WebSocket connection, audio buffering, message mapping, and function call orchestration.
GoogleGenAIRealtimeTest.cs — 118 unit tests covering the full surface area.
GenAIJsonContext.cs — Source-generated JSON serialization context for AOT compatibility.
AotJsonContextTest.cs — 4 unit tests verifying AOT source-gen coverage and round-trip correctness.

Modified Files

GoogleGenAIExtensions.cs — Added AsIRealtimeClient() extension method.
Directory.Packages.props — Updated Microsoft.Extensions.AI.Abstractions from 10.5.0-dev → 10.4.1.
Google.GenAI.csproj — Added System.Text.Json PackageReference for source generator support.
JsonConfig.cs — Dual options: JsonSerializerOptions (indented, for API output) + InternalSerializerOptions (compact, for internal transforms). Both use source-gen context with reflection fallback.
Live.cs — Minor adjustment to expose AsyncSession for the realtime provider; added serialization options.
Batches.cs, Caches.cs, Files.cs, Models.cs, Operations.cs, Tunings.cs — Updated all bare JsonSerializer calls to use configured options (intermediate → InternalSerializerOptions, HTTP body/response → JsonSerializerOptions).
Transformers.cs, Common.cs, TokensConverters.cs — Updated bare JsonSerializer calls to use configured options.
All packages.lock.json files regenerated.

Features

✅ Audio streaming — Append/commit pattern with automatic frame splitting (32KB max), ActivityStart/ActivityEnd framing
✅ Voice Activity Detection (VAD) — Configurable server-side VAD or manual client-controlled boundaries
✅ Text conversations — Send text messages and receive text/audio responses
✅ Function calling — Full tool invocation support with the FunctionInvokingRealtimeSession middleware; tool responses are batched into a single SendToolResponseAsync call
✅ Transcription — Input and output audio transcription
✅ Thread-safe sends — SemaphoreSlim serializes all WebSocket sends, safe for concurrent middleware + caller usage
✅ Graceful disposal — Race-safe dispose with proper exception handling for in-flight operations
✅ SetupComplete handshake — CreateSessionAsync waits for the server's SetupComplete acknowledgment before returning, ensuring tools and modalities are fully configured before the caller sends audio or text
✅ Convenience constructor — GoogleGenAIRealtimeClient(string apiKey, string? defaultModelId) for simple setup without manually creating a Client
✅ AOT compatible — Source-generated JSON serialization for all SDK types, verified by dedicated tests

Key Design Decisions

SetupComplete handshake — The Google SDK's ConnectAsync sends the setup config but returns immediately without waiting for the server's SetupComplete acknowledgment. Our CreateSessionAsync drains this message before returning, ensuring the session is fully ready (tools configured, modalities set). Without this, function calling fails when the user speaks immediately after connecting.
Tool response batching — The MEAI FunctionInvokingRealtimeSession middleware sends separate CreateConversationItem per function result. Gemini expects all results in one SendToolResponseAsync call. The provider buffers results and flushes them as a single batch when CreateResponse arrives.
TurnComplete suppression after tool responses — After SendToolResponseAsync, Gemini automatically continues generating. Sending client_content with turn_complete: true causes the server to close the WebSocket. The provider tracks this via _lastSendWasToolResponse and skips TurnComplete accordingly.
Null function call ID guard — If a function call arrives without an ID, a synthetic GUID is generated to ensure the call-ID-to-function-name mapping always works for the round-trip.
VAD handling — When VAD is disabled (default), the provider wraps audio commits with explicit ActivityStart/ActivityEnd framing. When enabled, the server handles speech boundary detection automatically.
Audio buffer cap — Audio appends are capped at 10 MB to prevent unbounded memory growth. Frames exceeding 32 KB are automatically split.
AOT via source-gen in core SDK — Rather than reimplementing the WebSocket+JSON protocol in the MEAI wrapper (which would duplicate ~1000 lines of code), we made the core SDK AOT-compatible by adding a JsonSerializerContext with source-generated metadata for all types. This ensures the MEAI realtime wrapper can delegate to Live.cs/AsyncSession without any reflection-based JSON calls on the hot path.

Test Coverage

122 unit tests covering:

Client and session lifecycle (construction, disposal, idempotent dispose)
All message types (audio, text, function calls, transcription, errors)
Edge cases (null args, empty buffers, concurrent dispose, exception swallowing)
Function call flow (single/multiple results, batching, flag reset after tool cycle)
VAD modes (enabled, disabled, default)
BuildLiveConnectConfig mapping (all option combinations)
AOT source-gen context coverage, nested type discovery, and round-trip correctness

google-cla · 2026-03-19T00:18:49Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

tarekgh · 2026-03-19T01:30:02Z

CC @stephentoub

Google.GenAI/GoogleGenAIExtensions.cs

…timeClientSession

- Use SendRealtimeInputAsync for all input types (text, image, audio) to avoid interleaving with SendClientContentAsync which causes WebSocket close - Fix VAD handling: use ActivityStart/ActivityEnd framing when VAD is disabled, AudioStreamEnd when VAD is enabled for push-to-talk - Fix image input: send as Video blob without activity framing, use minimal text trigger in CreateResponse since Gemini treats images as streaming context - Fix function calling: convert MEAI JsonSchema to Google Schema type with proper uppercase type names (STRING, OBJECT, etc.) - Text input auto-triggers model response without framing

…ction calling reliability, and test corrections - Wait for SetupComplete after ConnectAsync so tools are configured before caller sends audio/text - Add convenience constructor GoogleGenAIRealtimeClient(apiKey, defaultModelId) - Guard against null function call IDs with synthetic GUID fallback - Always store callId-to-functionName mapping regardless of null checks - Fix 5 test expectations to match actual Gemini Live API behavior: - ParametersJsonSchema -> Parameters (Google Schema) - Text auto-triggers response (no turnComplete needed) - SendRealtimeInputAsync has no role field

- Branch on SessionKind.Transcription in BuildLiveConnectConfig to create minimal config: input transcription only, text modality, no voice/tools/instructions - Map TranscriptionOptions.SpeechLanguage to AudioTranscriptionConfig.LanguageCodes for both transcription and conversation modes - Add 8 tests covering transcription mode config, language mapping, VAD, and verifying conversation-oriented options are excluded

Add GenAIJsonContext (JsonSerializerContext) with source-generated metadata for all types used in serialization, enabling AOT (Ahead-of-Time) compilation support without duplicating the SDK code in provider implementations. Changes: - Add GenAIJsonContext.cs with [JsonSerializable] entries for 93 root types (nested types are auto-discovered by the source generator) - Wire source-gen context into JsonConfig.JsonSerializerOptions with DefaultJsonTypeInfoResolver fallback for non-AOT scenarios - Add compact InternalSerializerOptions for intermediate serialize-then-parse round-trips (avoids WriteIndented overhead on internal transforms) - Update all bare JsonSerializer.Serialize/Deserialize calls (~90 sites) to use the configured options with source-gen type metadata - Add System.Text.Json PackageReference for source generator support on both netstandard2.0 and net8.0 targets - Add AotJsonContextTest.cs with 4 tests verifying context coverage, nested type auto-discovery, and serialization round-trips

…time-provider # Conflicts: # Google.GenAI/Batches.cs # Google.GenAI/Models.cs # Google.GenAI/Tunings.cs

…normalization - Fix SendAsync error handling: rethrow ODE as named ObjectDisposedException, swallow WebSocketException only when disposed (not blanket catch) - Add concurrent enumeration guard (_activeStreamingEnumeration) to GetStreamingResponseAsync to prevent multiple simultaneous readers - Wrap DisposeAsync resources in individual try/catch with ExceptionDispatchInfo to prevent resource leaks on partial failure - Fix CreateSessionAsync to dispose asyncSession on setup failure - Replace shallow tool result serialization with deep NormalizeToolPayload, NormalizeToolArguments, ConvertJsonElementToToolPayload - Use FunctionCallContent.CreateFromParsedArguments for tool call args (consistent with MEAI conventions, AOT-safe) - Add MaxToolPayloadDepth (64) depth guard to prevent stack overflow - Add post-lock disposed recheck for race with concurrent DisposeAsync - Add 5 regression tests, update 3 existing error handling tests

tarekgh force-pushed the feature/gemini-realtime-provider branch 3 times, most recently from 32fa581 to 1d54288 Compare March 19, 2026 01:27

jeffhandley reviewed Mar 19, 2026

View reviewed changes

Google.GenAI/GoogleGenAIExtensions.cs Outdated Show resolved Hide resolved

tarekgh force-pushed the feature/gemini-realtime-provider branch from 1d54288 to a5345ce Compare March 19, 2026 21:58

shivvaam0001 self-assigned this Mar 20, 2026

feat: Add Gemini Realtime provider implementing IRealtimeClient/IReal…

dc649bd

…timeClientSession

tarekgh force-pushed the feature/gemini-realtime-provider branch from a5345ce to dd1b649 Compare March 26, 2026 23:07

tarekgh force-pushed the feature/gemini-realtime-provider branch from dd1b649 to 6121fcf Compare March 26, 2026 23:12

tarekgh added 2 commits March 27, 2026 18:14

tarekgh force-pushed the feature/gemini-realtime-provider branch from 56db929 to 7d22cc8 Compare March 29, 2026 19:47

tarekgh added 2 commits April 2, 2026 16:23

Merge remote-tracking branch 'upstream/main' into feature/gemini-real…

c76d80c

…time-provider # Conflicts: # Google.GenAI/Batches.cs # Google.GenAI/Models.cs # Google.GenAI/Tunings.cs

tarekgh force-pushed the feature/gemini-realtime-provider branch from 3a4177f to c76d80c Compare April 2, 2026 23:27

tarekgh mentioned this pull request Apr 6, 2026

feat: Add Vertex AI Realtime provider implementing IRealtimeClient/IRealtimeClientSession googleapis/google-cloud-dotnet#15553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Gemini Realtime provider implementing IRealtimeClient/IRealtimeClientSession#256

feat: Add Gemini Realtime provider implementing IRealtimeClient/IRealtimeClientSession#256
tarekgh wants to merge 7 commits intogoogleapis:mainfrom
tarekgh:feature/gemini-realtime-provider

tarekgh commented Mar 19, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Mar 19, 2026

Uh oh!

tarekgh commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tarekgh commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AOT Compatibility

Usage Example

What's Included

New Files

Modified Files

Features

Key Design Decisions

Test Coverage

Uh oh!

google-cla bot commented Mar 19, 2026

Uh oh!

tarekgh commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tarekgh commented Mar 19, 2026 •

edited

Loading