Rewrote flash attention to use BF16, transpose k and v, rewrote the task distribution, increase parallelism on decode, and use double the registers for the core of flash attention. #835
test_868146247 was force-pushed and no longer has any new commits.
Pushing new commits will allow the pull request to be re-opened.