Optimize large K mat mult "skinny matrix"

I choose the number of blocks based on outer dimensions. When K is large and the outer dims are small, performance is absolutely crawling.

Example: 10x12544 dot 12544x32 takes 3900us while the cublas version takes 17.6us. In other shapes I am within a factor of 2 of cuda performance.