[RFC] asm/x86: Use L2 prefetch in mc_avx2 to reduce cache stalls (6% gain at 4K)#1482
[RFC] asm/x86: Use L2 prefetch in mc_avx2 to reduce cache stalls (6% gain at 4K)#1482Toiabzahoor wants to merge 1 commit intomemorysafety:mainfrom
Conversation
kkysen
left a comment
There was a problem hiding this comment.
Hi! Thanks for the PR! This is certainly very interesting and promising. However, in rav1d, we haven't changed any of the assembly from dav1d, as we're not very familiar with assembly and we don't want to get out of sync with dav1d. I assume this isn't something that we can do in Rust (there is _mm_prefetch, but we can't easily interleave calls inside the asm loop), but if we can, and it applies to other assembly/SIMD variants as well, that would be a perfect solution.
Also, have you checked if upstream [dav1d](https://code.videolan.org/videolan/dav1d) has done this yet, as rav1d is somewhat behind backporting all of the latest changes? If it's not, I'm sure they would welcome this change and then we could backport it once the review and merge it. I think they could also help better with your questions about assembly.
And then finally, it seems some of the tests are failing. I'm not sure what's going wrong there, but I assume some error is happening in the new assembly? Is that the page boundary issue you mentioned, or something else?
Thanks for the helpful feedback and for clarifying the project's approach to the assembly files. I completely understand wanting to stay in sync with the upstream repository to avoid maintenance overhead. Regarding the suggestion to use Rust intrinsics, interleaving those calls directly into the loop isn't quite feasible. We would have to rewrite the entire block using Rust SIMD intrinsics to get the prefetch instructions placed exactly where they need to be inside the loop execution, which moves us away from keeping the original assembly intact. Taking this directly to the upstream developers makes the most sense. I will close this pull request and submit an RFC over on the VideoLAN side to see what their assembly maintainers think of the approach. As for the failing tests, I looked into the CI logs and they seem isolated to the macOS x86_64 runner. The tests are aborting with an exit status 1 during the verification step while running under memory and undefined behavior sanitizers (ASAN/MSAN/UBSAN). Since this runs fine natively on Linux without instrumentations, it is highly likely the sanitizers on macOS are catching a speculative out-of-bounds read from the prefetch offsets, or there is a strict stack alignment quirk triggering the failure. I will debug that specific sanitizer interaction before I submit it upstream. Thanks again for your time and pointing me in the right direction! |
Yup, that's what I assumed. Rewriting the whole thing in Rust SIMD really would be nice (it can be far safer and avoid annoying Rust to asm calls that are not always well optimized), but that's a much bigger change. |
I completely agree, having it in pure Rust SIMD would be the ideal end goal to eliminate that FFI overhead. |
I think @kkysen is making an important point here. If this is something that could benefit dav1d too; the change would ideally be made upsteam such that both projects would benefit. The dav1d developers are assembly experts so you'd likely get better feedback too. |
The Bottleneck
While profiling
mc_avx2.asmon x86_64, I noticed significant stall cycles during motion compensation, specifically when reading strided reference rows for sub-pixel interpolation on high-resolution workloads.The Experiment
To combat this without causing L1 cache pollution (which would evict the active filter coefficients), I introduced manual L2 prefetches (
prefetcht1) for the upcoming rows (+ssq*2, etc.).The Results
Benchmarking a 10-second 4K 8-bit IVF file (
hyperfine, 10 runs, 1 thread) shows a consistent ~6% overall speedup, with pure User CPU time dropping from 1.808s to 1.756s.Request for Comments
I am relatively new to writing assembly and wanted to put this up as an RFC.