Intermittent performance drop at larger sizes

On x86 I'm sometimes seeing performance drop for the largest sizes far more than the usual memory-bottlenecked curve suggests. We're likely hitting [cache associativity issues](https://igoro.com/archive/gallery-of-processor-cache-effects/) or TLB evictions or some other hardware quirk due to accesses at power-of-two strides. We should look into better mitigating that.