Skip to content

Optimize large K mat mult "skinny matrix" #41

@haroonsyed

Description

@haroonsyed

I choose the number of blocks based on outer dimensions. When K is large and the outer dims are small, performance is absolutely crawling.

Example: 10x12544 dot 12544x32 takes 3900us while the cublas version takes 17.6us. In other shapes I am within a factor of 2 of cuda performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions