On x86 I'm sometimes seeing performance drop for the largest sizes far more than the usual memory-bottlenecked curve suggests. We're likely hitting cache associativity issues or TLB evictions or some other hardware quirk due to accesses at power-of-two strides. We should look into better mitigating that.