Skip to content

Improve compile and run- time of marray_* by reducing template instatiations (NFCI)#1173

Merged
bader merged 1 commit intoKhronosGroup:mainfrom
Maetveis:improve_marray_comptime
Feb 4, 2026
Merged

Improve compile and run- time of marray_* by reducing template instatiations (NFCI)#1173
bader merged 1 commit intoKhronosGroup:mainfrom
Maetveis:improve_marray_comptime

Conversation

@Maetveis
Copy link
Copy Markdown
Contributor

@Maetveis Maetveis commented Jan 21, 2026

Reduce the number of template instantiations in marray_operators.h by making the initial sequences and scalars runtime parameters instead of template parameters.

This reduces the number of SYCL kernels by 4x to 16x depending on the test case.

The change could theoretically decrease runtime performance, because the kernels now have to do runtime dispatch on the sequence initialization. It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

These numbers are from a local run on an Intel PVC GPU. For the rows marked JIT, the tests were compiled with -fsycl-targets=spir64 and for AOT with -fsycl-targets=intel_gpu_pvc.
For runtime testing in the case of JIT the on-disk cache was removed before each run. (With a warm cache the JIT runtime numbers are the same as AOT).

I used the following commands:

TESTS=(
    test_marray_arithmetic_assignment
    test_marray_arithmetic_binary
    test_marray_basic
    test_marray_bitwise
    test_marray_pre_post
    test_marray_relational
)
# For compilation
ninja -C build "${TESTS[@]}"
# Runtime
for test in "${TESTS[@]}"; do
    build/bin/${test}
done

Compile time

Without PR With PR Speedup
JIT 9m 38s 1m 35s 6.1x
AOT 12m 32s 1m 57s 6.4x

Runtime

Without PR With PR Speedup
JIT 10m 6s 2m 9s 4.7x
AOT 52s 42s 1.2x

…iations (NFCI)

Reduce the number of template instantiations in marray_operators.h by
making the initial sequences and scalars runtime parameters instead of
template parameters.

This reduces the number of SYCL kernels by 4x to 16x depending on the
test case.

This could theoretically decrease runtime performance, but it seems like
that's also significantly improved, likely by reducing kernel JIT times.

These numbers are from a local run on an Intel GPU. Obtained by running the commands
```bash
TESTS=(
    test_marray_arithmetic_assignment
    test_marray_arithmetic_binary
    test_marray_basic
    test_marray_bitwise
    test_marray_pre_post
    test_marray_relational
)
ninja -C build "${TESTS[@]}"
```

- Before: 15m 20s (920s)
- After: 2m 19s (139s) (6.6x speedup)

Runtime:

```bash
for test in "${TESTS[@]}"; do
    build/bin/${test}
done
```

- Before: 20m 51s (1251s)
- After:  3m  8s  (188s) (6.7x speedup)
@Maetveis Maetveis requested a review from a team as a code owner January 21, 2026 10:04
@bader
Copy link
Copy Markdown
Contributor

bader commented Jan 21, 2026

It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

@Maetveis, do you have build and run time numbers for AOT mode?

@Maetveis
Copy link
Copy Markdown
Contributor Author

Maetveis commented Jan 22, 2026

It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.

@Maetveis, do you have build and run time numbers for AOT mode?

I updated the description with AOT measurements. The JIT numbers are also new because I was compiling with debug info enabled before. The new numbers are all with -DCMAKE_BUILD_TYPE=Release.
The overall picture seems to stay roughly the same.

Even with AOT mode there is some speedup at runtime, I would guess it could be from a reduced overhead on per-kernel SYCL-RT internal data-structures, but I didn't investigate further.

@TApplencourt TApplencourt added the Agenda To be discussed during a SYCL committee meeting label Jan 27, 2026
Comment on lines +367 to +368
for (const init_sequence seq : all_init_sequences) {
for (const init_scalar sca : all_init_scalars) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day we will be able to use https://en.cppreference.com/w/cpp/ranges/cartesian_product_view.html and stop dealing with so too many indentation level :p

@TApplencourt
Copy link
Copy Markdown
Contributor

LGTM thanks! I'm a little sad to loose the expressiveness of for_all_combinations but the speed-up are too good to be picky :)

@Maetveis
Copy link
Copy Markdown
Contributor Author

Maetveis commented Feb 4, 2026

@bader can this PR be merged?

@bader
Copy link
Copy Markdown
Contributor

bader commented Feb 4, 2026

@TApplencourt, any objections to merge?

@TApplencourt
Copy link
Copy Markdown
Contributor

None, thanks guys!

@bader bader merged commit cea4ae6 into KhronosGroup:main Feb 4, 2026
9 checks passed
@Maetveis Maetveis deleted the improve_marray_comptime branch February 4, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Agenda To be discussed during a SYCL committee meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants