Sync with Microsoft ONNX Runtime - 01042026 by ai-fw-intg · Pull Request #1008 · intel/onnxruntime

ai-fw-intg · 2026-03-31T21:03:40Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…#27342) ### Description Moves the `--build_wasm_static_lib → --build_wasm` implication from `build.py` into `build_args.py`'s post-processing, **before** the cmake generator selection. Previously, `build_args.py` chose the generator based on `args.build_wasm` (still `False`), and `build.py` only set it to `True` afterwards—too late. - **`tools/ci_build/build_args.py`**: Set `args.build_wasm = True` when `args.build_wasm_static_lib` is set, prior to generator and cross-compilation logic. - **`tools/ci_build/build.py`**: Remove the now-redundant identical check. ### Motivation and Context Using `--build_wasm_static_lib` without `--build_wasm` caused cmake to use the wrong generator (e.g., Visual Studio instead of Ninja on Windows) and miss Emscripten-specific configuration, leading to build failures like missing `libiconv`.  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>

…n MatMulNBits (microsoft#27820) ### Description Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path (`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper code from MLAS. **Operator changes (matmul_nbits.cc):** - PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16 scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on ARM64. - ComputeBPacked: Bulk fp16→fp32 conversion of A, calls `MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16 conversion of C. **MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):** - Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`, `HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel conditions. - Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in `MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>` compatibility. ### Motivation and Context The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile before calling the same SQ fp32 kernels. This change: 1. **Eliminates per-tile overhead** via bulk conversion at the operator level. 2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the `HQNBIT_CompInt8` path. 3. **Removes ~370 lines of dead wrapper code** from MLAS. ### Improvements Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU` **Asymmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms | | Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms | | Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms | | Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms | | Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms | | Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms | **Symmetric:** | Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4 speedup | Acc4 latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms | | Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms | | Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms | | Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms | | Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms | | Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms | **NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25% on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion overhead replacing the old per-tile approach. The old HQ CompInt8 wrappers kept small tiles cache-hot, while the new unified path does full-matrix conversion passes. This trade-off is acceptable since 4-bit is the dominant quantization format (gaining 26–67%), 8-bit acc4 still outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at smaller model sizes where absolute latencies are already low. A proper fix would be 8-bit KleidiAI-style kernels rather than restoring the wrapper code.

…rt. (microsoft#27825) ### Description Support for Aarch64 SME intrinsics was added to version 19.40 of MSVC. The ONNX Runtime stated supported version of Visual Studio 2022 can go back before version 19.40. This patch modifies cmake/CMakeLists.txt to check the version of MSVC, if it is the target compiler. For versions less than 19.40 KleidiAi will be disabled in the build. ### Motivation and Context This issue was raised when cross compiling 1.24 for Windows on Arm. microsoft#27304 --------- Signed-off-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Colm Donelan <coldon01@e135129.arm.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

) ### Description Enable ccache and vcpkg caching for Linux workflows that use `reusable_linux_build.yml`. Saves about ~15-20 min on a 100% cache hit. Also parallelises tests. Saves ~6 minutes. Additionally, enable vcpkg and ccache for other Linux workflows. No numbers avail for comparison. ### Motivation and Context This change reduces wasted CO2 and time. ### Known Issues Benign - Android workflow doesn't seem to be populating its ccache.

### Description See below ### Motivation and Context Summary:The vulnerability lies in the ONNX Runtime's validate_package.py script, which uses unsanitized string concatenation with os.system() to construct shell commands. This allows attackers to inject arbitrary shell commands via the --package_name argument, leading to potential remote code execution. The issue affects the release validation pipeline, which operates with elevated privileges, exposing sensitive credentials and secrets. The root cause is the lack of input sanitization and the use of os.system() for command execution. Affected code locations: tools/nuget/validate_package.py line 241: os.system("tar zxvf " + package_name) tools/nuget/validate_package.py line 339: os.system("copy " + full_nuget_path + " " + nupkg_copy_name) Suggested fix: Replace os.system() with subprocess.run() using argument lists (no shell interpolation): ``` # Instead of: os.system("tar zxvf " + package_name) subprocess.run(["tar", "zxvf", package_name], check=True) # Instead of: os.system("copy " + full_nuget_path + " " + nupkg_copy_name) shutil.copy2(full_nuget_path, nupkg_copy_name) ```

Align maxStorageBufferBindingSize down to the nearest multiple of minStorageBufferOffsetAlignment after querying device limits. This ensures that when large buffers are split into segments, each segment's byte offset satisfies WebGPU's bind group offset alignment requirement (typically 256 bytes).

### Description This PR updates the pattern matchings to perform multi-head attention fusion for the conformer encoder inside [Nemotron speech](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b). <img width="550" height="976" alt="image" src="https://github.com/user-attachments/assets/a194308e-ce69-4128-9389-aae2a64b312f" /> ### Motivation and Context These changes allow the `MultiHeadAttention` op to appear in the encoder ONNX model. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…t#27823) ### Description  DmlOperatorQuantization21 was missing the tensor reshaping logic that the older DmlOperatorElementwiseQLinear already had. Scalar scale tensors get padded to 4D, but a 5D input stays 5D. DML rejects the dimension mismatch with E_INVALIDARG, and the resulting exception unwind triggers a sized-delete bug in WRL's MakeAllocator which address sanitizer detects. The fix is to port the same logic from the DmlOperatorElementwiseQLinear into this path, so that the dimensions match. ### Motivation and Context  This is required to ensure the DML EP correctly handles this scenario. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description  This change tries to address a problem in the DML EP where AlignToPow2 rounded up tensorByteSize to a 4-byte boundary before the data was read from the source buffer. This caused CreateCpuResource, CreateResource, WriteToFile, and the inputRawData vector construction to read 1–3 bytes past the end of the original tensor data. CreateResource and CreateCpuResource already independently align the D3D12 resource descriptor size, so they work correctly with the original (unaligned) byte count. The fix is to move the alignment to the location where it's needed. ### Motivation and Context  This is required because it addresses a crash / incorrect behavior in the DML EP.

…ft#27595) This pull request introduces support for node "layering annotations" and improves resource accounting and memory management during graph partitioning in ONNX Runtime. The changes add new mechanisms for annotating nodes, filtering nodes by annotation during partitioning, and efficiently accounting for resources in fused nodes. Several APIs are extended to support these features, and new configuration options are introduced to guide layer assignment. **Layering annotations & partitioning:** * Added `layering_annotation_` member and associated getter/setter/clear methods to the `Node` class, allowing nodes to be annotated for layer assignment. Also added a method to clear these annotations after partitioning to save memory. (`include/onnxruntime/core/graph/graph.h`) [[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R177-R184) [[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R266-R272) [[3]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R702-R703) * Extended the graph partitioning logic to support filtering nodes by their layering annotation using a `LayeringIndex`, ensuring only nodes matching the current execution provider's assignment are considered during partitioning. (`onnxruntime/core/framework/graph_partitioner.cc`) [[1]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR155) [[2]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR199-R286) [[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL244-R357) [[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL433-R545) [[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL451-R564) [[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL477-R591) * Added a new session option `kOrtSessionOptionsLayerAssignmentSettings` to configure layer assignment using annotation prefixes per device. (`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`) **Resource accounting improvements:** * Improved the `IResourceAccountant` interface to allow resetting and committing pending weights per node, and updated resource accounting logic to correctly sum and commit costs for all constituent nodes in fused nodes, preventing double-counting or undercounting. (`include/onnxruntime/core/framework/resource_accountant.h`, `include/onnxruntime/core/graph/indexed_sub_graph.h`, `onnxruntime/core/framework/graph_partitioner.cc`) [[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L48-R72) [[2]](diffhunk://#diff-3f09a80586759ee33e272477c3eb96f28d9b37f1e8251d13f1211c0450945135L89-R114) [[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL391-L397) **API and code organization:** * Updated the `Graph` class and related APIs to propagate layering annotations during function inlining and to provide a method for removing all layering annotations after partitioning. (`include/onnxruntime/core/graph/graph.h`) [[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1341-R1346) [[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1590-R1594) * Moved the `CreateAccountants` function out of the `NodeStatsRecorder` class to the namespace level for clarity. (`include/onnxruntime/core/framework/resource_accountant.h`) These changes enable more flexible and memory-efficient graph partitioning, particularly for scenarios involving hardware-specific layer assignments and dynamic resource constraints. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…icrosoft#27699) ### Description If the ONNX file is malformed, it could lead to an incorrect memory access. This change enforces that does not happen. ### Motivation and Context security issue --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

### Description  ### Motivation and Context

…icrosoft#27778) This PR is on top of a previous PR and fixes the remaining issues. microsoft#27706 All tests here should be passing now over webgpu: https://wpt.live/webnn/conformance_tests/dequantizeLinear.https.any.html?gpu --------- Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

) ### Description Add a pre-check for zero values in the divisor tensor for integral types in `Mod`. Returns an error `Status` instead of hitting undefined behavior (SIGFPE / structured exception). - **`element_wise_ops.cc`**: Added `CheckZeroDivisorImpl` as a single template struct in the `mod_internal` namespace using `if constexpr (std::is_integral<T>::value)` to guard the check — no-op for non-integer types. The struct's `operator()` returns `Status` (via `ORT_RETURN_IF`) and is dispatched with `InvokeRet<Status>`. When the divisor is a constant initializer, `TryGetConstantInput` validates for zeros once at kernel creation time in the out-of-line constructor (using `ORT_THROW_IF_ERROR`), avoiding per-`Compute` overhead. A `divisor_is_validated_constant_` flag tracks whether the one-time check was performed. In `Compute`, non-constant divisors are scanned via the type dispatcher (using `ORT_RETURN_IF_ERROR`) before calling `CallModImpl`, skipping the check when the constant was already validated. The Mod constructor is defined out-of-line after the `mod_internal` namespace to keep it contiguous. - **`element_wise_ops_test.cc`**: Added `Mod_int8_by_zero`, `Mod_int32_by_zero`, `Mod_int64_by_zero_scalar` tests covering tensor and scalar divisor cases, plus `Mod_int32_by_zero_constant_initializer` to exercise the `TryGetConstantInput` constructor path with `is_initializer = true`. ### Motivation and Context Integer modulo by zero is UB in C++ and causes a hardware exception that crashes the process. Float types produce NaN naturally via `std::fmod`, but int8/int16/int32/int64/uint* types do not. This is the same class of issue that was fixed for the `Div` operator in microsoft#27693, now applied to the `Mod` operator.  --- 💬 Send tasks to Copilot coding agent from [Slack](https://gh.io/cca-slack-docs) and [Teams](https://gh.io/cca-teams-docs) to turn conversations into code. Copilot posts an update in your thread when it's finished. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>

## Description Adds per-session thread pool work callbacks, allowing callers to hook into the enqueue/start/stop/abandon lifecycle of thread pool work items. The feature is gated behind a build flag (`--enable_session_threadpool_callbacks`) with zero overhead when disabled. ## API additions - C API: `OrtApi::SetPerSessionThreadPoolCallbacks` — stores an `OrtThreadPoolCallbacksConfig` on the `OrtEnv`, applied to per-session thread pools - C++ wrapper: `Ort::Env::SetPerSessionThreadPoolCallbacks` - Versioned C config struct `OrtThreadPoolCallbacksConfig` with fields: `on_enqueue`, `on_start_work`, `on_stop_work`, `on_abandon`, `user_context` - Four callback typedefs: `OrtThreadPoolWorkEnqueueFn`, `OrtThreadPoolWorkStartFn`, `OrtThreadPoolWorkStopFn`, `OrtThreadPoolWorkAbandonFn` ## Implementation - `EigenNonBlockingThreadPool.h`: Introduced a policy-based design with two compile-time callback policies: - `WorkNoCallbackPolicy`: `Work = std::function<void()>`, all callback methods are trivial inlines eliminated by the compiler. Zero overhead for non-callback builds. - `WorkWithCallbackPolicy`: `Work = WorkItem` bundling tasks with callback data; invokes user callbacks around task execution via `MakeWork`/`Execute`/`OnEnqueue`/`OnAbandon` methods. - `ThreadPoolTempl<Environment, CallbackPolicy>` uses the policy for all callback-related operations. - `RunQueue::RevokeWithTag` calls `policy_->OnAbandon(e.w)` on successful revocation; the policy implementation decides whether to invoke user callbacks. - `threadpool.h`: `extended_eigen_threadpool_` changed to `unique_ptr<ExtendedThreadPoolInterface>` for type erasure across policy instantiations. `EnableSpinning`/`DisableSpinning` added to the virtual interface. - `threadpool.cc`: Single `#ifdef` selects policy at `ThreadPoolTempl` instantiation. - `environment.h/.cc`: Added `SetPerSessionWorkCallbacks`/`GetPerSessionWorkCallbacks` on `Environment`. - `inference_session.cc`: Propagates callbacks from `Environment` to per-session thread pool options. - `thread_utils.h/.cc`: Added callback fields to `OrtThreadPoolParams` and wiring in `CreateThreadPoolHelper`. - `env.h`: `OrtThreadPoolCallbacksConfig*` pointer in `ThreadOptions`. ## Build - CMake option `onnxruntime_ENABLE_SESSION_THREADPOOL_CALLBACKS`; `build.py` argument `--enable_session_threadpool_callbacks` ## Tests - 8 callback-specific tests: Schedule, OnEnqueueOnly, NoCallbacks, ParallelFor, ParallelSection, Abandon, EnqueueReturnsNull, NoEnqueueWithStartStop - End-to-end C API test (`SetPerSessionThreadPoolCallbacks` via ModelBuilder with 1M-element Mul) - All 73 existing ThreadPool tests pass unchanged with both callback-enabled and callback-disabled builds (81/81 and 73/73 respectively) ## Motivation and Context Thread pool work callbacks enable telemetry, tracing, and resource management by providing visibility into when work is enqueued, executed, and abandoned in per-session thread pools. This is needed for production diagnostics and performance instrumentation scenarios. --------- Co-authored-by: Siyuan Peng <siyuanpeng@microsoft.com>

…icrosoft#27834) Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

### Description Run-level profiling (introduced in PR microsoft#26846) does not currently capture profiling events for operators inside subgraphs. This PR fixes that by threading the `run_profiler` pointer through `OpKernelContextInternal` to subgraph execution, following the same pattern as `terminate_flag`. ### Root Cause `utils::ExecuteSubgraph()` had no `run_profiler` parameter and always passed `nullptr` to `ExecuteGraphImpl`, so nested operators (inside If, Loop, Scan, BeamSearch, GreedySearch) were never profiled at the run level. ### Fix 1. **`OpKernelContextInternal`** — Added `run_profiler_` member and `GetRunProfiler()` accessor. 2. **`SessionScope` / `ExecuteKernel()`** — Pass the run profiler into `OpKernelContextInternal`. 3. **`ExecuteSubgraph()`** — Added `profiling::Profiler* run_profiler = nullptr` parameter, forwarded to `ExecuteGraphImpl()`. 4. **Control flow ops** (`if.cc`, `loop.cc`, `scan_utils.cc`) — Pass `context_.GetRunProfiler()` to `ExecuteSubgraph()`. 5. **Contrib transformer ops** (`beam_search_impl_gpt.h`, `beam_search_impl_t5.h`, `beam_search_impl_whisper.h`, `greedy_search_impl_gpt.h`) — All 8 `ExecuteSubgraph()` call sites updated to pass `this->context_.GetRunProfiler()`. Plugin EP control flow kernels (`PluginEpIfKernelImpl`, etc.) delegate to the same internal kernels, so the fix propagates automatically. ### Tests - **`CheckRunProfilerWithSubgraph`** (`inference_session_test.cc`) — Runs `if_mul.onnx`, enables run profiling, asserts `mul_0` (inside If's then-branch) appears in the profile JSON. - **`CheckRunProfilerWithBeamSearch`** (`beam_search_test.cc`) — Runs `tiny_gpt2_beamsearch.onnx`, enables run profiling, asserts decoder subgraph Node entries (beyond the top-level BeamSearch op) appear in the profile JSON. ### Files Changed (12 files) | File | Change | |------|--------| | `core/framework/op_kernel_context_internal.h` | Added `run_profiler_` member, `GetRunProfiler()`, constructor param | | `core/framework/sequential_executor.cc` | `SessionScope::GetRunProfiler()`, pass to `OpKernelContextInternal` | | `core/framework/utils.h` / `utils.cc` | `run_profiler` param on `ExecuteSubgraph()` | | `core/providers/cpu/controlflow/if.cc` | Forward `GetRunProfiler()` | | `core/providers/cpu/controlflow/loop.cc` | Forward `GetRunProfiler()` | | `core/providers/cpu/controlflow/scan_utils.cc` | Forward `GetRunProfiler()` | | `contrib_ops/cpu/transformers/beam_search_impl_gpt.h` | 2 call sites | | `contrib_ops/cpu/transformers/beam_search_impl_t5.h` | 2 call sites | | `contrib_ops/cpu/transformers/beam_search_impl_whisper.h` | 2 call sites | | `contrib_ops/cpu/transformers/greedy_search_impl_gpt.h` | 2 call sites | | `test/framework/inference_session_test.cc` | `CheckRunProfilerWithSubgraph` test | | `test/contrib_ops/beam_search_test.cc` | `CheckRunProfilerWithBeamSearch` test |

### Description Replace `actions/cache@v4` w/ `actions/cache@v5`. ### Motivation and Context `actions/cache@v4` uses node 20, which is deprecated.

This pull request introduces a new synchronization API for plugin execution providers (EPs) in ONNX Runtime, and adds comprehensive test infrastructure to verify its usage. The main theme is enabling EPs to synchronize device operations, which is particularly important for IO binding and async execution scenarios. The changes also update the test framework to support and validate this new capability. **Synchronization API for Plugin EPs:** * Added a new optional `Sync` method to the `OrtEp` C API interface, allowing EPs to block until all preceding device tasks are complete. This is primarily used by IO binding to ensure device inputs are ready before execution. (`include/onnxruntime/core/session/onnxruntime_ep_c_api.h`) * Implemented the `Sync` method in the example plugin EP, with a test hook that increments a counter for verification purposes. (`onnxruntime/test/autoep/library/example_plugin_ep/ep.cc`, `onnxruntime/test/autoep/library/example_plugin_ep/ep.h`) [[1]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R187) [[2]](diffhunk://#diff-60ddcfdf7fe7273a7f06c4c1eb39933737e6fe8c2f00bdf2e5f49c2d1f911fa4R589-R601) [[3]](diffhunk://#diff-5e9391ab7d2d558c5fa992b5fc373add5c52225aa43ce1af323ffbd8c2b86733R105-R106) **Test Infrastructure and Verification:** * Added test hooks (`ExampleEpTestHooks_ResetSyncCount`, `ExampleEpTestHooks_GetSyncCount`) to the example plugin EP, allowing tests to reset and retrieve the sync call count. (`onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.h`, `onnxruntime/test/autoep/library/example_plugin_ep/ep_test_hooks.cc`) [[1]](diffhunk://#diff-a587d529618260bec7cbecf107513dacb795fff9fb34ae99c3a2db36bdcc8befR1-R23) [[2]](diffhunk://#diff-7123fbca69d2580f0483d6589817e275c05b086c1fb56281a83f0fb895bdc06fR1-R11) * Updated test execution logic to load these hooks dynamically and verify that the `Sync` method is called exactly once during inference with IO binding. (`onnxruntime/test/autoep/test_execution.cc`) [[1]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R299-R358) [[2]](diffhunk://#diff-3e289607015487374dcf7d9ab1d73a2ca3c3e5a44cab5958e4334afcdd5f4e28R1099-R1119) **Plugin EP Interface Updates:** * Extended the `PluginExecutionProvider` C++ interface to support the new `Sync` method, delegating to the plugin EP if implemented. (`onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h`, `onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc`) [[1]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118) [[2]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638) **Performance Test Framework Enhancements:** * Added logic to detect if a plugin EP uses an NVIDIA GPU device, enabling CUDA IO binding automatically in performance tests when appropriate. (`onnxruntime/test/perftest/common_utils.cc`, `onnxruntime/test/perftest/utils.h`, `onnxruntime/test/perftest/ort_test_session.cc`) [[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221) [[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41) [[3]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R98-R108) * Ensured that async execution is used in performance tests with IO binding, relying on the new synchronization mechanism. (`onnxruntime/test/perftest/ort_test_session.cc`) [[1]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18L57) [[2]](diffhunk://#diff-f3908cf4d2982fd2b008beb8c951d9cab67b28cb07077493b8d1dd8b448d6e18R66-R69) These changes collectively improve device synchronization support for plugin EPs and provide robust testing to ensure correct behavior.This pull request introduces support for synchronizing plugin execution providers, especially for NVIDIA GPU devices, and refines the logic for CUDA I/O binding in performance tests. The main changes include adding a new `Sync` API for execution providers, updating the plugin EP interface to use this API, and improving test session configuration for CUDA devices. ### API and Interface Updates * Added a new optional `Sync` method to the `OrtEp` struct in `onnxruntime_ep_c_api.h`, allowing execution providers to block until all device tasks are complete. This is primarily used to ensure inputs are copied to the device before execution starts. * Implemented the `Sync` method in the `PluginExecutionProvider` class and its interface, enabling plugin EPs to support device synchronization if available. [[1]](diffhunk://#diff-6dac10650c4e1c5a55b95378173b33e95b300bf7c2350d8476088693b98652a5R632-R638) [[2]](diffhunk://#diff-db92123bb63f8b1cc0a776ba3dcad95118826d031c8f65e79969cfaddb8c3e0aR117-R118) ### Performance Test Improvements * Added a utility function `UsesNvidiaDevice` to detect if any registered plugin EP uses an NVIDIA GPU device, improving test configuration logic. [[1]](diffhunk://#diff-2b8b7de0106a523d40c40f901f6ff170bff722b0c147fbfec36b269e21c9526bR203-R221) [[2]](diffhunk://#diff-228a0b2557ae67945d94db8f9e74bb523517c2aa738db91fcfdda0958fa65f6cR40-R41) ### Motivation and Context

## Description This PR adds a standalone CUDA Plugin Execution Provider (`CudaPluginExecutionProvider`) built as a dynamically loadable shared library (`libonnxruntime_providers_cuda_plugin.so`) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks. ## Summary of Changes ### Build system and CMake | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN` build option, records plugin build info, and includes the plugin-specific CMake file. | | `cmake/onnxruntime_providers_cuda_plugin.cmake` | **New.** Defines the plugin shared-library target: collects `.cc`/`.cu` sources from `core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components. | | `cmake/onnxruntime_providers_cuda.cmake` | Minor additions to expose include paths needed by plugin builds. | | `cmake/onnxruntime_unittests.cmake` | Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration. | | `cmake/external/cuda_configuration.cmake` | Adds CUDA configuration support for the plugin build path. | ### Plugin runtime implementation (new files) | File | Purpose | |------|---------| | `plugin/cuda_ep_factory.cc/.h` | Implements `OrtEpFactory` — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization. | | `plugin/cuda_ep.cc/.h` | Plugin-side CUDA EP object deriving from `ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels). | | `plugin/cuda_allocator_plugin.cc/.h` | Plugin allocators for device and pinned memory, exposed through the EP API. | | `plugin/cuda_stream_plugin.cc/.h` | Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc` contexts). | | `plugin/cuda_data_transfer_plugin.cc/.h` | Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings. | | `plugin/cuda_memcpy_plugin.cc` | MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path. | | `plugin/cuda_controlflow_plugin.cc/.cu/.h` | Plugin-native `If`, `Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow hooks instead of inheriting from in-tree CPU base implementations. | | `plugin/cuda_plugin_ep.cc` | Exports the DLL entry points (`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and release the CUDA EP factory. | | `plugin/cuda_kernel_adapter.h` | **Core shim** (1088 lines). Provides `CudaKernel` base class, error-return macros, type helpers (`ToCudaType`), handle-management abstractions, and stream adapters. Force-included in all plugin `.cc` files to transparently adapt existing kernel code. | | `plugin/cuda_plugin_kernels.cu/.h` | Aggregates self-registered kernel definitions via `PluginKernelCollector` macro overrides, replacing the centralized registration tables used in the bundled build. | | `plugin/cuda_plugin_utils.h` | Shared utility helpers for the plugin (logging, error checking, config parsing). | | `plugin/provider_api_shims.cc` | Stub implementations for shared-provider bridge functions that are not needed in the plugin path. | | `plugin/cuda_plugin_ep_symbols.def` | Windows symbol export definitions for the plugin DLL. | ### EP adapter and API extensions | File | Change | |------|--------| | `include/onnxruntime/ep/api.h` | Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading. | | `include/onnxruntime/ep/adapter/node.h` | Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels. | | `include/onnxruntime/ep/adapter/op_kernel.h` | Adds `RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts. | | `include/onnxruntime/ep/adapter/op_kernel_info.h` | Extends adapter kernel-info with attribute and config accessors required by migrated kernels. | | `include/onnxruntime/ep/adapter/allocator.h` | Minor allocator adapter adjustments for plugin compatibility. | | `include/onnxruntime/ep/adapter/kernel_def_builder.h` | Adds kernel definition builder hooks for plugin registration. | | `include/onnxruntime/core/framework/tensor.h` | Restores a plugin-only `Tensor::Create` compatibility path for kernels relying on the older static factory form. | | `onnxruntime/core/providers/shared_library/provider_api.h` | Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution. | ### CUDA kernel compatibility migration - Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites). - Moves or templates reusable helper logic in shared CPU/CUDA headers (`ConstantOfShapeBase`, `PadBase`, `SliceBase`, `SplitBase`, `ScatterND`, `UpsampleBase`, `DeformConvAttributes`) so kernels compile in adapter mode. - Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops. - Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression. ### Python integration | File | Change | |------|--------| | `onnxruntime/python/onnxruntime_pybind_module.cc` | Extends `get_available_providers()` to surface dynamically registered plugin EPs discovered from `OrtEpDevice` enumeration. | | `onnxruntime/python/onnxruntime_pybind_state.cc` | Allows Python session creation to instantiate providers from registered plugin EP devices, including `device_id` selection, instead of only built-in or legacy dynamic-load EP paths. | | `onnxruntime/python/onnxruntime_pybind_schema.cc` | Adds schema query support for plugin-registered operators. | ### Testing and validation | File | Change | |------|--------| | `test/python/transformers/test_cuda_plugin_ep.py` | **New** (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation. | | `test/python/transformers/cuda_plugin_ep_helper.py` | **New** (192 lines). Utility for transparently routing existing tests to the plugin EP. | | `test/python/transformers/test_gqa.py` | Fixes `total_sequence_length` tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP. | | `test/python/transformers/test_moe_cuda.py` | Routes through plugin EP when available. | | `test/framework/dynamic_plugin_ep_test.cc` | **New** (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration. | | `test/unittest_util/base_tester.cc` | Routes CUDA test requests to `CudaPluginExecutionProvider` when registered, allowing existing CUDA provider tests to exercise the plugin path. | | `tools/ci_build/cuda_plugin_parity_report.py` | **New** (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation. | ### Documentation | File | Change | |------|--------| | `docs/cuda_plugin_ep/cuda_plugin_ep_design.md` | **New** (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support. | | `docs/cuda_plugin_ep/QUICK_START.md` | **New** (108 lines). Build instructions, C++ and Python usage examples, and known limitations. | ### Other | File | Change | |------|--------| | `tools/python/gen_opkernel_doc.py` | Extended to generate documentation for plugin-registered kernels. | | `orttraining/.../reduction_ops.cc` | Minor compatibility fix for training reduction ops under the plugin build configuration. | ## Testing - **Build**: Configure with `--build_cuda_ep_as_plugin` (or `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON`); verify `libonnxruntime_providers_cuda_plugin.so` is produced alongside existing CUDA provider artifacts. - **C++ unit tests**: Run `onnxruntime_provider_test` — `BaseTester` routes CUDA coverage through `CudaPluginExecutionProvider`. Run the new `dynamic_plugin_ep_test` for load/enumerate validation. - **Python tests**: Register the plugin library, confirm `onnxruntime.get_available_providers()` includes `CudaPluginExecutionProvider`, and run `test_cuda_plugin_ep.py` (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation). - **Parity report**: Run `tools/ci_build/cuda_plugin_parity_report.py` to verify kernel coverage parity between bundled and plugin builds. - **Backward compatibility**: Verify unchanged behavior for the in-tree CUDA EP build path (`onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF`). - **Known limitation**: CUDA graph support remains disabled in the plugin path and is documented as deferred. ## Motivation and Context The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle. **Related**: PR microsoft#27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (or documented in description) - [ ] CI passes

…ft#27914) ### Description Specify `main` as the target branch for the release candidate cron job. ### Motivation and Context Pipeline won't work without a branch specifier.

Copilot AI and others added 23 commits March 25, 2026 15:17

update jsvascript dependencies (microsoft#27838)

aeda0c7

Add cron job to release pipeline (microsoft#27864)

52709bc

### Description  ### Motivation and Context

[CI] chore: bump actions/cache@v5 (microsoft#27866)

cd120ee

### Description Replace `actions/cache@v4` w/ `actions/cache@v5`. ### Motivation and Context `actions/cache@v4` uses node 20, which is deprecated.

[CI] fix: missing branch specifier in schedule directive (microso…

e43d306

…ft#27914) ### Description Specify `main` as the target branch for the release candidate cron job. ### Motivation and Context Pipeline won't work without a branch specifier.

Merge remote-tracking branch 'origin/master' into sync_msft_01042026

6372fc6

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel March 31, 2026 21:03

ankitm3k approved these changes Apr 1, 2026

View reviewed changes

ankitm3k closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 01042026#1008

Sync with Microsoft ONNX Runtime - 01042026#1008
ai-fw-intg wants to merge 23 commits intoovep-developfrom
sync_msft_01042026

ai-fw-intg commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ai-fw-intg commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants