-
Notifications
You must be signed in to change notification settings - Fork 17
Reimplement llama with fused MLIR operators + simplify operators #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
andrej
wants to merge
121
commits into
devel
Choose a base branch
from
alt-simplifying-refactor
base: devel
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
121 commits
Select commit
Hold shift + click to select a range
cf6485d
rework profiling
andrej af46d81
vibe-coded flame graph visualization
andrej 1f184eb
plot updates
andrej eedc527
simplified implementation (no KV cache yet)
andrej 5372fca
add KV cache
andrej f0d3289
start with simplified llama for NPU
andrej fd6f759
offload last layer GEMM/GEMV
andrej ec0d372
add profile path analyzer
andrej 280917a
change profiling to allow annotation using contexts; remove nn.Module
andrej 7b56ef8
refactoring started; RMSNorm offloaded, last layer GEMM started
andrej 1daaa39
last layer GEMM offloaded
andrej 83380c2
fixes:
andrej 3aa1e02
cleanup
andrej 002d487
RMSNorm offloaded verywhere, cleanup
andrej db9c4b8
less buffer copying
andrej b0942e8
offload first residual
andrej 02e5307
simplify
andrej e815c50
offload second residual
andrej ebe8f32
SwiGLU offloaded
andrej 7515199
offload RoPE
andrej 7ef6b03
offload attention query projection linear layer
andrej d43eeea
offload attention key projection linear layer -- slight decrease in o…
andrej de2995c
add batching to GEMV, fix issue when K<vector_size, offload attention…
andrej 74300bd
add strided_copy operator
andrej b1eab7c
add patchable callable
andrej 119bb7e
simplify llama_npu.py, make GEMV operator input shapes simpler
andrej 7fee60d
fix strided copy; offload KV cache concat + transpose to NPU
andrej b72432b
offload repeat_interleave
andrej 8b4b2b3
offload normalization/scaling + softmax (with -inf masking on CPU for…
andrej 5575d4f
make softmax run-time parametrizable
andrej 7ed760a
Fix device manager singleton
andrej e2cdc1d
fix GEMV for large Bs
andrej 855bec3
rework and offload transpose operator in llama, attention weight * va…
andrej 0086168
commit forgotten repeat operator
andrej ffca541
offload last GEMV in GQA, reorganize/simplify decode code
andrej c7926e8
initial steps for automatically fusing operators
andrej f166957
[WIP] compilation refactor
andrej 11d5802
autofuse update
andrej f1a2ab9
refactor compilation
andrej d80b726
towards full fused ELF + some more refactoring
andrej f0cda24
fixes; requires XRT PR 9560 to be merged
andrej d715e48
fixes
andrej 684a725
finally working
andrej 447983c
fixes
andrej e025ac7
optimize out reconfiguration
andrej e3c0e64
fix some compilation issues
andrej 34fc4c5
make all llama operators take a kernel archive and func prefix arg
andrej a197f2d
txn-fused swiglu
andrej e35ed7b
bring up to speed after host runtime refactor
andrej 2994ba2
refactor symbol renaming to not clash with externally defined library…
andrej 25605be
fuse first part of attention
andrej e01a6f0
make it possible to slice buffers in fused txn specification; fused-t…
andrej e9cbc00
discover patching locations automatically by use of magic values
andrej b7d2834
make ELFs patchable; offload strided-copy for KV cache
andrej 687cb2a
fuse repeat_interleave and post attention residual onto other operato…
andrej 8b0aaeb
fused attn score and scaling onto end - 2.5 TPS
andrej 361f10e
fuse on softmax as well
andrej 834f33b
transpose fused onto the end; 2.6 TPS
andrej f345e8d
fuse attention context gemv - 2.7 TPS
andrej b46838f
fuse attn output onto end - 2.7 TPS
andrej 99ef9fa
fuse GQA + post attention - 2.5 TPS
andrej 8eaa9bc
fuse rms norm onto beginning of transformer block -- full transformer…
andrej 165a93b
[WRONG RESULTS] 16x-fused transformer block
andrej 86d7de8
remove unnecessary syncs, remove unused ops -- 4.4 TPS
andrej 77bac5a
[decode end-to-end fused] offload last rms norm and last linear layer…
andrej 6211124
cleanup
andrej 2da438d
remove old llama implementation
andrej 38e1b4c
naive attempt at porting operations
hunhoffe ba15525
most of the operators running in local tests
hunhoffe d483d4f
the great reformatting
hunhoffe 9f29a92
a few more formatting fixes
hunhoffe 404211e
some refactoring
hunhoffe 12b1dc6
some work with swiglu
hunhoffe 1aa050e
add licenses
hunhoffe b5898f0
Merge branch 'devel' into simplifying-refactor: Reconcile operator ab…
hunhoffe a8e07fc
Move remaining operators to iron/operators/
hunhoffe b8fb175
try to minimize changes that are not central to the refactor
hunhoffe 84bd103
remove extra test file
hunhoffe b3aa4d4
fixup imports
hunhoffe 2b0f897
fix another import path
hunhoffe b97f32e
Fix import issue
hunhoffe d06e8e0
GEMM fixup
hunhoffe 626f731
fix rms norm w/ weights
hunhoffe f589df8
fixup swiglu decode
hunhoffe 6722a05
small steps
hunhoffe b5fb899
Fixup paths a bit.
hunhoffe bc74a22
swiglu decode working locally
hunhoffe 83b0986
try to integrate tensor more
hunhoffe f9a2539
try to fix llama keywords
hunhoffe bb10c51
fix another arg
hunhoffe 520934d
Increment IRON
hunhoffe 1ce9fbb
cleanup pytest config a bit
hunhoffe 1c8a007
fixes to llama
andrej 142b3ad
remove unused operator
andrej bdf3987
remove old llama impl
andrej 71cdfca
add command line args to llama
andrej ce13fe8
Revert "try to integrate tensor more"
andrej 1761d64
reenable KV cache
andrej 7a4b207
format
andrej ff41674
fixes
andrej ba12295
Merge branch 'devel' into alt-simplifying-refactor
andrej 0335b30
merge devel
andrej fca9cc4
allow some outliers to fail SwigluPrefill output verification -- test…
andrej e23c8f9
remove unused, untested code
andrej efe8c61
remove dead code
andrej 938b093
fixes
andrej 06d13b7
add more output to help debug CI failure
andrej d908891
more CI troubleshooting
andrej 0a4c23e
more output for debugging
andrej 811ceb9
unbuffered output
andrej a28e86d
more debug output
andrej bd70cf7
PYTHONUNBUFFERED
andrej d321c87
no parallelism in aiecc.py
andrej 9731173
Update mlir-aie to version v1.3.0
github-actions[bot] a23f14f
Merge remote-tracking branch 'origin/update-mlir-aie-to-v1.3.0' into …
andrej 34d2590
format
andrej a077d8e
update compilation arguments after aiecc update
andrej 34136d9
remove debug output
andrej b5e2adb
add more debug output
andrej c509661
pick up aiecc UUID fix from latest wheels
andrej d8dd6ea
re-enable llama tests for different sequence lengths; clean up debug …
andrej File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make an issue for this TODO?