Skip to content

Sync with Microsoft ONNX Runtime - 31032026#1007

Closed
ai-fw-intg wants to merge 15 commits intoovep-developfrom
sync_msft_31032026
Closed

Sync with Microsoft ONNX Runtime - 31032026#1007
ai-fw-intg wants to merge 15 commits intoovep-developfrom
sync_msft_31032026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

Copilot AI and others added 15 commits March 25, 2026 15:17
…#27342)

### Description

Moves the `--build_wasm_static_lib → --build_wasm` implication from
`build.py` into `build_args.py`'s post-processing, **before** the cmake
generator selection. Previously, `build_args.py` chose the generator
based on `args.build_wasm` (still `False`), and `build.py` only set it
to `True` afterwards—too late.

- **`tools/ci_build/build_args.py`**: Set `args.build_wasm = True` when
`args.build_wasm_static_lib` is set, prior to generator and
cross-compilation logic.
- **`tools/ci_build/build.py`**: Remove the now-redundant identical
check.

### Motivation and Context

Using `--build_wasm_static_lib` without `--build_wasm` caused cmake to
use the wrong generator (e.g., Visual Studio instead of Ninja on
Windows) and miss Emscripten-specific configuration, leading to build
failures like missing `libiconv`.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
…n MatMulNBits (microsoft#27820)

### Description

Routes fp16 `HQNBIT_CompInt8` through the fp32 MLAS path
(`SQNBIT_CompInt8`) at the operator level for both 4-bit and 8-bit
MatMulNBits, then removes the ~370 lines of dead HQ CompInt8 wrapper
code from MLAS.

**Operator changes (matmul_nbits.cc):**
- PrePack: Uses `SQNBIT_CompInt8` for sizing/packing, pre-converts fp16
scales and bias to fp32, computes BZpCorr for asymmetric KleidiAI on
ARM64.
- ComputeBPacked: Bulk fp16→fp32 conversion of A, calls
`MlasQNBitGemmBatch<float>` with `SQNBIT_CompInt8`, bulk fp32→fp16
conversion of C.

**MLAS cleanup (qnbitgemm.cpp, qnbitgemm_kernel_neon.cpp):**
- Removed `HQ4BitGemm_CompInt8`, `HQ8BitGemm_CompInt8`,
`HQ8BitCompInt8PerGemmWorkspace`, associated enum values, dispatch
branches, workspace entries, and `HQNBIT_CompInt8` NEON kernel
conditions.
- Added `HQNBIT_CompInt8` → `SQNBIT_CompInt8` redirect in
`MlasIsQNBitGemmAvailable` for `GetComputeType<MLFloat16>`
compatibility.

### Motivation and Context

The HQ CompInt8 kernels are wrappers that convert fp16→fp32 per-tile
before calling the same SQ fp32 kernels. This change:
1. **Eliminates per-tile overhead** via bulk conversion at the operator
level.
2. **Enables KleidiAI for fp16 4-bit** — previously bypassed by the
`HQNBIT_CompInt8` path.
3. **Removes ~370 lines of dead wrapper code** from MLAS.

### Improvements
Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU`

**Asymmetric:**

| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |

|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 1.28× | 1.55× | **1.26×** | 1187.5ms |
| Qwen 1.5B | 512 | 1.14× | 1.63× | **1.55×** | 2257.2ms |
| Qwen 3B | 256 | 1.32× | 1.82× | **1.29×** | 2351.3ms |
| Qwen 3B | 512 | 1.38× | 1.70× | **1.28×** | 4777.2ms |
| Qwen 7B | 256 | 1.58× | 2.26× | **1.40×** | 4094.5ms |
| Qwen 7B | 512 | 1.49× | 2.23× | **1.52×** | 8002.6ms |

**Symmetric:**

| Model | Seq Len | Acc1/Acc4 (before) | Acc1/Acc4 (after) | Acc4
speedup | Acc4 latency (after) |

|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 0.95× | 1.45× | **1.67×** | 1255.5ms |
| Qwen 1.5B | 512 | 1.04× | 1.52× | **1.55×** | 2406.7ms |
| Qwen 3B | 256 | 1.39× | 1.88× | **1.32×** | 2215.0ms |
| Qwen 3B | 512 | 1.42× | 1.85× | **1.31×** | 4318.3ms |
| Qwen 7B | 256 | 1.66× | 2.58× | **1.55×** | 3564.4ms |
| Qwen 7B | 512 | 1.57× | 2.60× | **1.64×** | 7227.9ms |

**NOTE**: The 8-bit accuracy level 4 path shows some regression (5–25%
on 1.5B/3B models, neutral on 7B) due to the bulk fp16↔fp32 conversion
overhead replacing the old per-tile approach. The old HQ CompInt8
wrappers kept small tiles cache-hot, while the new unified path does
full-matrix conversion passes. This trade-off is acceptable since 4-bit
is the dominant quantization format (gaining 26–67%), 8-bit acc4 still
outperforms acc1 by 1.7–2.2×, and the regression is most pronounced at
smaller model sizes where absolute latencies are already low. A proper
fix would be 8-bit KleidiAI-style kernels rather than restoring the
wrapper code.
…rt. (microsoft#27825)

### Description
Support for Aarch64 SME intrinsics was added to version 19.40 of MSVC.
The ONNX Runtime stated supported version of Visual Studio 2022 can go
back before version 19.40.

This patch modifies cmake/CMakeLists.txt to check the version of MSVC,
if it is the target compiler. For versions less than 19.40 KleidiAi will
be disabled in the build.

### Motivation and Context
This issue was raised when cross compiling 1.24 for Windows on Arm.
microsoft#27304

---------

Signed-off-by: Colm Donelan <coldon01@e135129.arm.com>
Co-authored-by: Colm Donelan <coldon01@e135129.arm.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
)

### Description
Enable ccache and vcpkg caching for Linux workflows that use
`reusable_linux_build.yml`. Saves about ~15-20 min on a 100% cache hit.
Also parallelises tests. Saves ~6 minutes.

Additionally, enable vcpkg and ccache for other Linux workflows. No
numbers avail for comparison.

### Motivation and Context

This change reduces wasted CO2 and time.

### Known Issues

Benign  - Android workflow doesn't seem to be populating its ccache.
### Description
See below



### Motivation and Context
Summary:The vulnerability lies in the ONNX Runtime's validate_package.py
script, which uses unsanitized string concatenation with os.system() to
construct shell commands. This allows attackers to inject arbitrary
shell commands via the --package_name argument, leading to potential
remote code execution. The issue affects the release validation
pipeline, which operates with elevated privileges, exposing sensitive
credentials and secrets. The root cause is the lack of input
sanitization and the use of os.system() for command execution.

Affected code locations:

tools/nuget/validate_package.py line 241: os.system("tar zxvf " +
package_name)
tools/nuget/validate_package.py line 339: os.system("copy " +
full_nuget_path + " " + nupkg_copy_name)
Suggested fix: Replace os.system() with subprocess.run() using argument
lists (no shell interpolation):

```
# Instead of: os.system("tar zxvf " + package_name)
subprocess.run(["tar", "zxvf", package_name], check=True)

# Instead of: os.system("copy " + full_nuget_path + " " + nupkg_copy_name)
shutil.copy2(full_nuget_path, nupkg_copy_name)
```
Align maxStorageBufferBindingSize down to the nearest multiple of
minStorageBufferOffsetAlignment after querying device limits. This
ensures that when large buffers are split into segments, each segment's
byte offset satisfies WebGPU's bind group offset alignment requirement
(typically 256 bytes).
### Description

This PR updates the pattern matchings to perform multi-head attention
fusion for the conformer encoder inside [Nemotron
speech](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b).

<img width="550" height="976" alt="image"
src="https://github.com/user-attachments/assets/a194308e-ce69-4128-9389-aae2a64b312f"
/>

### Motivation and Context

These changes allow the `MultiHeadAttention` op to appear in the encoder
ONNX model.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…t#27823)

### Description
<!-- Describe your changes. -->
DmlOperatorQuantization21 was missing the tensor reshaping logic that
the older DmlOperatorElementwiseQLinear already had.

Scalar scale tensors get padded to 4D, but a 5D input stays 5D. DML
rejects the dimension mismatch with E_INVALIDARG, and the resulting
exception unwind triggers a sized-delete bug in WRL's MakeAllocator
which address sanitizer detects. The fix is to port the same logic from
the DmlOperatorElementwiseQLinear into this path, so that the dimensions
match.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required to ensure the DML EP correctly handles this scenario.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->
This change tries to address a problem in the DML EP where AlignToPow2
rounded up tensorByteSize to a 4-byte boundary before the data was read
from the source buffer. This caused CreateCpuResource, CreateResource,
WriteToFile, and the inputRawData vector construction to read 1–3 bytes
past the end of the original tensor data.

CreateResource and CreateCpuResource already independently align the
D3D12 resource descriptor size, so they work correctly with the original
(unaligned) byte count. The fix is to move the alignment to the location
where it's needed.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This is required because it addresses a crash / incorrect behavior in
the DML EP.
…ft#27595)

This pull request introduces support for node "layering annotations" and
improves resource accounting and memory management during graph
partitioning in ONNX Runtime. The changes add new mechanisms for
annotating nodes, filtering nodes by annotation during partitioning, and
efficiently accounting for resources in fused nodes. Several APIs are
extended to support these features, and new configuration options are
introduced to guide layer assignment.

**Layering annotations & partitioning:**

* Added `layering_annotation_` member and associated getter/setter/clear
methods to the `Node` class, allowing nodes to be annotated for layer
assignment. Also added a method to clear these annotations after
partitioning to save memory. (`include/onnxruntime/core/graph/graph.h`)
[[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R177-R184)
[[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R266-R272)
[[3]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R702-R703)
* Extended the graph partitioning logic to support filtering nodes by
their layering annotation using a `LayeringIndex`, ensuring only nodes
matching the current execution provider's assignment are considered
during partitioning. (`onnxruntime/core/framework/graph_partitioner.cc`)
[[1]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR155)
[[2]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bR199-R286)
[[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL244-R357)
[[4]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL433-R545)
[[5]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL451-R564)
[[6]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL477-R591)
* Added a new session option `kOrtSessionOptionsLayerAssignmentSettings`
to configure layer assignment using annotation prefixes per device.
(`include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h`)

**Resource accounting improvements:**

* Improved the `IResourceAccountant` interface to allow resetting and
committing pending weights per node, and updated resource accounting
logic to correctly sum and commit costs for all constituent nodes in
fused nodes, preventing double-counting or undercounting.
(`include/onnxruntime/core/framework/resource_accountant.h`,
`include/onnxruntime/core/graph/indexed_sub_graph.h`,
`onnxruntime/core/framework/graph_partitioner.cc`)
[[1]](diffhunk://#diff-7b1c9ef14536f9a66ed370cb729b6609d12c5907b460d8f145a7ad5a401e0fb6L48-R72)
[[2]](diffhunk://#diff-3f09a80586759ee33e272477c3eb96f28d9b37f1e8251d13f1211c0450945135L89-R114)
[[3]](diffhunk://#diff-e2d3910ae7593ee7ba4fd74e53f738fa973ae2fc32c069f1088ba458b91f8d4bL391-L397)

**API and code organization:**

* Updated the `Graph` class and related APIs to propagate layering
annotations during function inlining and to provide a method for
removing all layering annotations after partitioning.
(`include/onnxruntime/core/graph/graph.h`)
[[1]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1341-R1346)
[[2]](diffhunk://#diff-aaea1507ec81a94c72a1fa72ce320df712156b665f7798573be3f7e439bb4c37R1590-R1594)
* Moved the `CreateAccountants` function out of the `NodeStatsRecorder`
class to the namespace level for clarity.
(`include/onnxruntime/core/framework/resource_accountant.h`)

These changes enable more flexible and memory-efficient graph
partitioning, particularly for scenarios involving hardware-specific
layer assignments and dynamic resource constraints.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…icrosoft#27699)

### Description
If the ONNX file is malformed, it could lead to an incorrect memory
access. This change enforces that does not happen.



### Motivation and Context
security issue

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#27778)

This PR is on top of a previous PR and fixes the remaining issues.
microsoft#27706

All tests here should be passing now over webgpu:

https://wpt.live/webnn/conformance_tests/dequantizeLinear.https.any.html?gpu

---------

Co-authored-by: edgchen1 <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.