Improve compile and run- time of marray_* by reducing template instatiations (NFCI) by Maetveis · Pull Request #1173 · KhronosGroup/SYCL-CTS

Maetveis · 2026-01-21T10:04:24Z

Reduce the number of template instantiations in marray_operators.h by making the initial sequences and scalars runtime parameters instead of template parameters.

This reduces the number of SYCL kernels by 4x to 16x depending on the test case.

The change could theoretically decrease runtime performance, because the kernels now have to do runtime dispatch on the sequence initialization. It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

These numbers are from a local run on an Intel PVC GPU. For the rows marked JIT, the tests were compiled with -fsycl-targets=spir64 and for AOT with -fsycl-targets=intel_gpu_pvc.
For runtime testing in the case of JIT the on-disk cache was removed before each run. (With a warm cache the JIT runtime numbers are the same as AOT).

I used the following commands:

TESTS=(
    test_marray_arithmetic_assignment
    test_marray_arithmetic_binary
    test_marray_basic
    test_marray_bitwise
    test_marray_pre_post
    test_marray_relational
)
# For compilation
ninja -C build "${TESTS[@]}"
# Runtime
for test in "${TESTS[@]}"; do
    build/bin/${test}
done

Compile time

	Without PR	With PR	Speedup
JIT	9m 38s	1m 35s	6.1x
AOT	12m 32s	1m 57s	6.4x

Runtime

	Without PR	With PR	Speedup
JIT	10m 6s	2m 9s	4.7x
AOT	52s	42s	1.2x

…iations (NFCI) Reduce the number of template instantiations in marray_operators.h by making the initial sequences and scalars runtime parameters instead of template parameters. This reduces the number of SYCL kernels by 4x to 16x depending on the test case. This could theoretically decrease runtime performance, but it seems like that's also significantly improved, likely by reducing kernel JIT times. These numbers are from a local run on an Intel GPU. Obtained by running the commands ```bash TESTS=( test_marray_arithmetic_assignment test_marray_arithmetic_binary test_marray_basic test_marray_bitwise test_marray_pre_post test_marray_relational ) ninja -C build "${TESTS[@]}" ``` - Before: 15m 20s (920s) - After: 2m 19s (139s) (6.6x speedup) Runtime: ```bash for test in "${TESTS[@]}"; do build/bin/${test} done ``` - Before: 20m 51s (1251s) - After: 3m 8s (188s) (6.7x speedup)

bader · 2026-01-21T18:35:15Z

It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

@Maetveis, do you have build and run time numbers for AOT mode?

Maetveis · 2026-01-22T09:02:44Z

It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

@Maetveis, do you have build and run time numbers for AOT mode?

I updated the description with AOT measurements. The JIT numbers are also new because I was compiling with debug info enabled before. The new numbers are all with -DCMAKE_BUILD_TYPE=Release.
The overall picture seems to stay roughly the same.

Even with AOT mode there is some speedup at runtime, I would guess it could be from a reduced overhead on per-kernel SYCL-RT internal data-structures, but I didn't investigate further.

TApplencourt · 2026-02-02T17:07:03Z

tests/marray_basic/marray_operators.h

+    for (const init_sequence seq : all_init_sequences) {
+      for (const init_scalar sca : all_init_scalars) {


One day we will be able to use https://en.cppreference.com/w/cpp/ranges/cartesian_product_view.html and stop dealing with so too many indentation level :p

TApplencourt · 2026-02-02T17:07:45Z

LGTM thanks! I'm a little sad to loose the expressiveness of for_all_combinations but the speed-up are too good to be picky :)

Maetveis · 2026-02-04T09:15:21Z

@bader can this PR be merged?

bader · 2026-02-04T18:19:11Z

@TApplencourt, any objections to merge?

TApplencourt · 2026-02-04T19:33:41Z

None, thanks guys!

Maetveis requested a review from a team as a code owner January 21, 2026 10:04

TApplencourt added the Agenda To be discussed during a SYCL committee meeting label Jan 27, 2026

TApplencourt reviewed Feb 2, 2026

View reviewed changes

TApplencourt approved these changes Feb 2, 2026

View reviewed changes

bader merged commit cea4ae6 into KhronosGroup:main Feb 4, 2026
9 checks passed

Maetveis deleted the improve_marray_comptime branch February 4, 2026 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compile and run- time of marray_* by reducing template instatiations (NFCI)#1173

Improve compile and run- time of marray_* by reducing template instatiations (NFCI)#1173
bader merged 1 commit intoKhronosGroup:mainfrom
Maetveis:improve_marray_comptime

Maetveis commented Jan 21, 2026 •

edited

Loading

Uh oh!

bader commented Jan 21, 2026

Uh oh!

Maetveis commented Jan 22, 2026 •

edited

Loading

Uh oh!

TApplencourt Feb 2, 2026

Uh oh!

TApplencourt commented Feb 2, 2026

Uh oh!

Maetveis commented Feb 4, 2026

Uh oh!

bader commented Feb 4, 2026

Uh oh!

TApplencourt commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		for (const init_sequence seq : all_init_sequences) {
		for (const init_scalar sca : all_init_scalars) {

Conversation

Maetveis commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Compile time

Runtime

Uh oh!

bader commented Jan 21, 2026

Uh oh!

Maetveis commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TApplencourt Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

TApplencourt commented Feb 2, 2026

Uh oh!

Maetveis commented Feb 4, 2026

Uh oh!

bader commented Feb 4, 2026

Uh oh!

TApplencourt commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Maetveis commented Jan 21, 2026 •

edited

Loading

Maetveis commented Jan 22, 2026 •

edited

Loading