[AIMIGRAPHX-885] Add Releaxed Check for Concat fusions#4724
[AIMIGRAPHX-885] Add Releaxed Check for Concat fusions#4724TedThemistokleous wants to merge 1 commit intodevelopfrom
Conversation
…copies to allow for point wise ops to be fused along with the noops and concats
It seems like there is a better operator to be used here. concat kernel is better optimized for reading different tensors not slices from the same tensor. If this comes from gather, then why cant we combine this into a larger gather operator? |
| } | ||
| } | ||
| if(num_noops > std::max(size_t{1}, concat_ins->inputs().size() / 4)) | ||
| if((num_noops > std::max(size_t{1}, concat_ins->inputs().size() / 4) and (concat_ins->inputs().size() < 100))) |
There was a problem hiding this comment.
I dont think this is a good heuristic in general. The noop copy can be eliminated completely in eliminate_concat. It doesnt work for your case because the inputs are slice.
There was a problem hiding this comment.
Okay so it sounds like a better approach is to do this entire thing in eliminate_concat
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4724 +/- ##
========================================
Coverage 92.29% 92.29%
========================================
Files 580 580
Lines 28687 28688 +1
========================================
+ Hits 26474 26475 +1
Misses 2213 2213
🚀 New features to boost your workflow:
|
Cost/overhead. gathers get expensive and tend to scale poorly in this model. Using this eliminated a bunch of overhead by fusing the preceding ops into the concat even if they do appear noops |
Motivation
Modified fuse_concats to allow larger inputs to be fused by relaxing the overhead s which will result in a concat with many noop inputs. Since we're doing a read in most of these cases these operations should be zero copy
Technical Details
In some cases for prediction model we concat many inputs. Our concat fusion misses these as there's a large amount of noop inputs.
Adding the additional short circuit here allows us to still perform a concat fusion for very large concats with large 100+ inputs, this is relevant when many inputs are from slice operations when reading from gather table embeddings.
Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable