Improve compile and run- time of marray_* by reducing template instatiations (NFCI)#1173
Conversation
…iations (NFCI)
Reduce the number of template instantiations in marray_operators.h by
making the initial sequences and scalars runtime parameters instead of
template parameters.
This reduces the number of SYCL kernels by 4x to 16x depending on the
test case.
This could theoretically decrease runtime performance, but it seems like
that's also significantly improved, likely by reducing kernel JIT times.
These numbers are from a local run on an Intel GPU. Obtained by running the commands
```bash
TESTS=(
test_marray_arithmetic_assignment
test_marray_arithmetic_binary
test_marray_basic
test_marray_bitwise
test_marray_pre_post
test_marray_relational
)
ninja -C build "${TESTS[@]}"
```
- Before: 15m 20s (920s)
- After: 2m 19s (139s) (6.6x speedup)
Runtime:
```bash
for test in "${TESTS[@]}"; do
build/bin/${test}
done
```
- Before: 20m 51s (1251s)
- After: 3m 8s (188s) (6.7x speedup)
@Maetveis, do you have build and run time numbers for AOT mode? |
I updated the description with AOT measurements. The JIT numbers are also new because I was compiling with debug info enabled before. The new numbers are all with Even with AOT mode there is some speedup at runtime, I would guess it could be from a reduced overhead on per-kernel SYCL-RT internal data-structures, but I didn't investigate further. |
| for (const init_sequence seq : all_init_sequences) { | ||
| for (const init_scalar sca : all_init_scalars) { |
There was a problem hiding this comment.
One day we will be able to use https://en.cppreference.com/w/cpp/ranges/cartesian_product_view.html and stop dealing with so too many indentation level :p
|
LGTM thanks! I'm a little sad to loose the expressiveness of |
|
@bader can this PR be merged? |
|
@TApplencourt, any objections to merge? |
|
None, thanks guys! |
Reduce the number of template instantiations in marray_operators.h by making the initial sequences and scalars runtime parameters instead of template parameters.
This reduces the number of SYCL kernels by 4x to 16x depending on the test case.
The change could theoretically decrease runtime performance, because the kernels now have to do runtime dispatch on the sequence initialization. It seems like that's not the case and runtime is also significantly improved, likely by reducing kernel JIT times.
These numbers are from a local run on an Intel PVC GPU. For the rows marked JIT, the tests were compiled with
-fsycl-targets=spir64and for AOT with-fsycl-targets=intel_gpu_pvc.For runtime testing in the case of JIT the on-disk cache was removed before each run. (With a warm cache the JIT runtime numbers are the same as AOT).
I used the following commands:
TESTS=( test_marray_arithmetic_assignment test_marray_arithmetic_binary test_marray_basic test_marray_bitwise test_marray_pre_post test_marray_relational ) # For compilation ninja -C build "${TESTS[@]}" # Runtime for test in "${TESTS[@]}"; do build/bin/${test} doneCompile time
Runtime