Skip to content

[RFC] asm/x86: Use L2 prefetch in mc_avx2 to reduce cache stalls (6% gain at 4K)#1482

Closed
Toiabzahoor wants to merge 1 commit intomemorysafety:mainfrom
Toiabzahoor:l2-prefetch-opt
Closed

[RFC] asm/x86: Use L2 prefetch in mc_avx2 to reduce cache stalls (6% gain at 4K)#1482
Toiabzahoor wants to merge 1 commit intomemorysafety:mainfrom
Toiabzahoor:l2-prefetch-opt

Conversation

@Toiabzahoor
Copy link
Copy Markdown

@Toiabzahoor Toiabzahoor commented Apr 3, 2026

The Bottleneck
While profiling mc_avx2.asm on x86_64, I noticed significant stall cycles during motion compensation, specifically when reading strided reference rows for sub-pixel interpolation on high-resolution workloads.

The Experiment
To combat this without causing L1 cache pollution (which would evict the active filter coefficients), I introduced manual L2 prefetches (prefetcht1) for the upcoming rows (+ssq*2, etc.).

The Results
Benchmarking a 10-second 4K 8-bit IVF file (hyperfine, 10 runs, 1 thread) shows a consistent ~6% overall speedup, with pure User CPU time dropping from 1.808s to 1.756s.

Benchmark 1: ./dav1d_base -i 4k_test_8bit.ivf -o /dev/null --threads 1
  Time (mean ± σ):      2.011 s ±  0.258 s    [User: 1.808 s, System: 0.044 s]

Benchmark 2: ./dav1d_opt -i 4k_test_8bit.ivf -o /dev/null --threads 1
  Time (mean ± σ):      1.894 s ±  0.087 s    [User: 1.756 s, System: 0.044 s]

Summary
  ./dav1d_opt ran 1.06 ± 0.14 times faster than ./dav1d_base

Request for Comments
I am relatively new to writing assembly and wanted to put this up as an RFC.

  1. Is there any historical context on why manual L2 prefetching is omitted here? (e.g., regressions on Zen architectures?)
  2. If this approach is viable, I would appreciate guidance on how to safely handle the page boundaries at the tail end of the macroblock to avoid TLB miss penalties, as I currently haven't implemented a tail loop for the edges.

Copy link
Copy Markdown
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for the PR! This is certainly very interesting and promising. However, in rav1d, we haven't changed any of the assembly from dav1d, as we're not very familiar with assembly and we don't want to get out of sync with dav1d. I assume this isn't something that we can do in Rust (there is _mm_prefetch, but we can't easily interleave calls inside the asm loop), but if we can, and it applies to other assembly/SIMD variants as well, that would be a perfect solution.

Also, have you checked if upstream [dav1d](https://code.videolan.org/videolan/dav1d) has done this yet, as rav1d is somewhat behind backporting all of the latest changes? If it's not, I'm sure they would welcome this change and then we could backport it once the review and merge it. I think they could also help better with your questions about assembly.

And then finally, it seems some of the tests are failing. I'm not sure what's going wrong there, but I assume some error is happening in the new assembly? Is that the page boundary issue you mentioned, or something else?

@Toiabzahoor
Copy link
Copy Markdown
Author

Hi! Thanks for the PR! This is certainly very interesting and promising. However, in rav1d, we haven't changed any of the assembly from dav1d, as we're not very familiar with assembly and we don't want to get out of sync with dav1d. I assume this isn't something that we can do in Rust (there is _mm_prefetch, but we can't easily interleave calls inside the asm loop), but if we can, and it applies to other assembly/SIMD variants as well, that would be a perfect solution.

Also, have you checked if upstream [dav1d](https://code.videolan.org/videolan/dav1d) has done this yet, as rav1d is somewhat behind backporting all of the latest changes? If it's not, I'm sure they would welcome this change and then we could backport it once the review and merge it. I think they could also help better with your questions about assembly.

And then finally, it seems some of the tests are failing. I'm not sure what's going wrong there, but I assume some error is happening in the new assembly? Is that the page boundary issue you mentioned, or something else?

Thanks for the helpful feedback and for clarifying the project's approach to the assembly files. I completely understand wanting to stay in sync with the upstream repository to avoid maintenance overhead.

Regarding the suggestion to use Rust intrinsics, interleaving those calls directly into the loop isn't quite feasible. We would have to rewrite the entire block using Rust SIMD intrinsics to get the prefetch instructions placed exactly where they need to be inside the loop execution, which moves us away from keeping the original assembly intact.

Taking this directly to the upstream developers makes the most sense. I will close this pull request and submit an RFC over on the VideoLAN side to see what their assembly maintainers think of the approach.

As for the failing tests, I looked into the CI logs and they seem isolated to the macOS x86_64 runner. The tests are aborting with an exit status 1 during the verification step while running under memory and undefined behavior sanitizers (ASAN/MSAN/UBSAN). Since this runs fine natively on Linux without instrumentations, it is highly likely the sanitizers on macOS are catching a speculative out-of-bounds read from the prefetch offsets, or there is a strict stack alignment quirk triggering the failure. I will debug that specific sanitizer interaction before I submit it upstream.

Thanks again for your time and pointing me in the right direction!

@Toiabzahoor Toiabzahoor closed this Apr 3, 2026
@kkysen
Copy link
Copy Markdown
Collaborator

kkysen commented Apr 3, 2026

Regarding the suggestion to use Rust intrinsics, interleaving those calls directly into the loop isn't quite feasible. We would have to rewrite the entire block using Rust SIMD intrinsics to get the prefetch instructions placed exactly where they need to be inside the loop execution, which moves us away from keeping the original assembly intact.

Yup, that's what I assumed. Rewriting the whole thing in Rust SIMD really would be nice (it can be far safer and avoid annoying Rust to asm calls that are not always well optimized), but that's a much bigger change.

@Toiabzahoor Toiabzahoor reopened this Apr 3, 2026
@Toiabzahoor
Copy link
Copy Markdown
Author

Regarding the suggestion to use Rust intrinsics, interleaving those calls directly into the loop isn't quite feasible. We would have to rewrite the entire block using Rust SIMD intrinsics to get the prefetch instructions placed exactly where they need to be inside the loop execution, which moves us away from keeping the original assembly intact.

Yup, that's what I assumed. Rewriting the whole thing in Rust SIMD really would be nice (it can be far safer and avoid annoying Rust to asm calls that are not always well optimized), but that's a much bigger change.

I completely agree, having it in pure Rust SIMD would be the ideal end goal to eliminate that FFI overhead.
I've been working a lot with SIMD recently and I would actually be really interested in taking a crack at this. When I have some spare time, I might try porting one of the smaller mc_avx2 functions entirely into pure Rust core::arch intrinsics as a standalone proof-of-concept. If I can get the compiler to emit code that matches or beats the handwritten assembly performance, I'll open a new RFC to see if that's a direction the project would want to explore incrementally!

@Toiabzahoor Toiabzahoor closed this Apr 3, 2026
@thedataking
Copy link
Copy Markdown
Collaborator

Also, have you checked if upstream dav1d has done this yet, as rav1d is somewhat behind backporting all of the latest changes? If it's not, I'm sure they would welcome this change and then we could backport it once the review and merge it. I think they could also help better with your questions about assembly.

I think @kkysen is making an important point here. If this is something that could benefit dav1d too; the change would ideally be made upsteam such that both projects would benefit. The dav1d developers are assembly experts so you'd likely get better feedback too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants