[Paper Link] This repository contains the official code for the paper "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference" (MLSys 2026). It provides the complete pipeline to reproduce the speedup and accuracy results reported in the paper.
This repository provides the code to validate the two core claims of our paper:
- Claim 1 (Accuracy): Our fully integer pipeline (IntAttention) maintains high-fidelity with FP16/Baseline accuracy across various LLMs and Vision Transformers. (Corresponds to Table 1 and Table 2 in the paper).
- Claim 2 (Latency/Speed): The proposed IntAttention pipeline achieves significant latency reduction on ARM CPUs compared to FP32, FP16, and INT8 quantization baselines. (Corresponds to Figure 6 and Figure 7 in the paper).
pysimulation/: PyTorch-based simulation of IntAttention and other baseline methods.*.patch: Patch to insert the IndexSoftmax implementation into the ARM ComputeLibrary and deit codebase.acc_llm.py: Evaluation script for Large Language Models.bench_speed.cpp: C++ benchmarking script for evaluating latency on Armv8 CPUs.
Due to the fundamental differences in hardware execution, our evaluation is split into two environments:
- Hardware: Armv8.6-a architecture device. The build script is optimized for generalized embedding chips, especially Apple M-series chips (e.g., M2/M3/M4).
- Software:
scons,clang.
- Hardware: CUDA-compatible GPU with at least 10GB VRAM. (Note: Because native INT32 matrix multiplication on GPUs is limited, our PyTorch simulation uses FP64 to guarantee precise INT32 emulation for accuracy validation).
- Software: Python 3.10+,
pip.
Clone the repository and its submodules (ARM ComputeLibrary and DeiT):
git clone --recursive https://github.com/WanliZhong/IntAttention
cd IntAttentionCheckout the required version of ComputeLibrary and apply the IndexSoftmax patch:
cd ComputeLibrary
git checkout v52.5.0
git apply ../add_impl_for_ACL.patchCompile the ComputeLibrary:
scons -j"$(sysctl -n hw.ncpu)" \
os=macos arch=armv8.6-a \
neon=1 opencl=0 embed_kernels=0 logging=0 \
Werror=0 debug=0 asserts=0 examples=0 \
extra_cxx_flags="-mcpu=apple-m2"Compile the C++ benchmark script:
cd ..
INCDIR="./ComputeLibrary"
LIBDIR="./ComputeLibrary/build"
clang++ bench_speed.cpp -O3 -std=c++17 -arch arm64 \
-I "$INCDIR/include" -I "$INCDIR" \
"$LIBDIR/libarm_compute-static.a" \
"$LIBDIR/libarm_compute_graph-static.a" \
-o bench_speed -lpthread -ldlRun the speed tests. The benchmark supports 4 pipelines (--pipe 0 to 3):
0(Pure FP32): QK(FP32) -> Softmax(FP32) -> PV(FP32)1(FP16): QK(F16) -> Cast(F32) -> Softmax(F32) -> Cast(F16) -> PV(F16)2(Quantized INT8): S8 QK -> S32 -> FP32 Softmax -> S8 PV -> S32 -> F163(IntAttention): S8 QK -> S32 -> IndexSoftmax(U8) -> U8xS8 PV -> S32 -> F16
Example Command (IntAttention vs FP16):
# Run IntAttention
./bench_speed --pipe 3 --L 1024 --d 128 --warmup 10 --runs 100
# Run FP16 Baseline for comparison
./bench_speed --pipe 1 --L 1024 --d 128 --warmup 10 --runs 100(Where L is the sequence length and d is the head dimension).
pip install torch==2.8.0 torchvision==0.23.0 transformers==4.57.0 timm==1.0.19 lm_eval==0.4.9.2We support evaluating meta-llama/Llama-3.2-1B, facebook/opt-1.3b, and Qwen/Qwen3-1.7B on standard zero-shot tasks.
Example Command (Llama-3.2-1B with IntAttention):
python acc_llm.py --model-name meta-llama/Llama-3.2-1B \
--method int_attention \
--tasks wikitext hellaswag lambada_openai piqa winogrande arc_challenge arc_easyEnsure you have the ImageNet-1k validation dataset downloaded. We support models including deit_base_patch16_224, vit_large_patch16_384, and cait_large_patch16_448.
Example Command (DeiT-B-224 with IntAttention):
PYTHONPATH=./ python deit/main.py --device=cuda \
--no-train-mode --eval \
--model deit_base_patch16_224 \
--data-path path/to/imagenet-object-localization-challenge/ILSVRC/Data/CLS-LOC/ \
--method int_attention