Numpy convolve and caching should be able to achieve a good speed up over the existing implementation. No API changes, quantify the speed up