Skip to content

[AIMIGRAPHX-885] Add Releaxed Check for Concat fusions#4724

Draft
TedThemistokleous wants to merge 1 commit intodevelopfrom
large_embedding_concats
Draft

[AIMIGRAPHX-885] Add Releaxed Check for Concat fusions#4724
TedThemistokleous wants to merge 1 commit intodevelopfrom
large_embedding_concats

Conversation

@TedThemistokleous
Copy link
Copy Markdown
Collaborator

Motivation

Modified fuse_concats to allow larger inputs to be fused by relaxing the overhead s which will result in a concat with many noop inputs. Since we're doing a read in most of these cases these operations should be zero copy

Technical Details

In some cases for prediction model we concat many inputs. Our concat fusion misses these as there's a large amount of noop inputs.

Adding the additional short circuit here allows us to still perform a concat fusion for very large concats with large 100+ inputs, this is relevant when many inputs are from slice operations when reading from gather table embeddings.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

…copies to allow for point wise ops to be fused along with the noops and concats
@TedThemistokleous TedThemistokleous self-assigned this Mar 31, 2026
@TedThemistokleous TedThemistokleous added simple small or simple changes Perf Improve labels Mar 31, 2026
@TedThemistokleous TedThemistokleous changed the title AIMIGRAPHX-885 Add Releaxed Check for Concat fusions [AIMIGRAPHX-885] Add Releaxed Check for Concat fusions Mar 31, 2026
@pfultz2
Copy link
Copy Markdown
Collaborator

pfultz2 commented Mar 31, 2026

this is relevant when many inputs are from slice operations when reading from gather table embeddings.

It seems like there is a better operator to be used here. concat kernel is better optimized for reading different tensors not slices from the same tensor.

If this comes from gather, then why cant we combine this into a larger gather operator?

}
}
if(num_noops > std::max(size_t{1}, concat_ins->inputs().size() / 4))
if((num_noops > std::max(size_t{1}, concat_ins->inputs().size() / 4) and (concat_ins->inputs().size() < 100)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this is a good heuristic in general. The noop copy can be eliminated completely in eliminate_concat. It doesnt work for your case because the inputs are slice.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so it sounds like a better approach is to do this entire thing in eliminate_concat

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4724   +/-   ##
========================================
  Coverage    92.29%   92.29%           
========================================
  Files          580      580           
  Lines        28687    28688    +1     
========================================
+ Hits         26474    26475    +1     
  Misses        2213     2213           
Files with missing lines Coverage Δ
src/fuse_concat.cpp 96.67% <100.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TedThemistokleous
Copy link
Copy Markdown
Collaborator Author

this is relevant when many inputs are from slice operations when reading from gather table embeddings.

It seems like there is a better operator to be used here. concat kernel is better optimized for reading different tensors not slices from the same tensor.

If this comes from gather, then why cant we combine this into a larger gather operator?

Cost/overhead. gathers get expensive and tend to scale poorly in this model. Using this eliminated a bunch of overhead by fusing the preceding ops into the concat even if they do appear noops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Perf Improve simple small or simple changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants