ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
Wenjie Liu*,1, Hao Wu*,1, Xin Qiu1, Yingqi Fan1, Yihan Zhang1, Anhao Zhao1, Yunpu Ma2, Xiaoyu Shen†,1
1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo
2LMU Munich
* Equal Contribution, † Corresponding Author (xyshen@eitech.edu.cn)
If you find this work useful for your research and applications, please consider citing:
@misc{liu2026vicaefficientmultimodalllms,
title={ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention},
author={Wenjie Liu and Hao Wu and Xin Qiu and Yingqi Fan and Yihan Zhang and Anhao Zhao and Yunpu Ma and Xiaoyu Shen},
year={2026},
eprint={2602.07574},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.07574},
}- [TODO] Code, checkpoints, and documentation are being prepared and will be released soon.
- [2026.02.07] The preprint is now published!
- Insights on MLLM Redundancy: Demonstrates that projected visual embeddings are already well-aligned with language space, and effective vision-language interaction occurs in only a small subset of Transformer layers, revealing substantial redundancy in dense visual processing.
- ViCA Architecture: Introduces Vision-only Cross-Attention (ViCA), a minimal MLLM design where visual tokens bypass all self-attention and feed-forward layers, interacting with text solely via sparse cross-attention at key layers for efficient multimodal fusion.
- Performance-Efficiency Trade-off: Maintains approximately 98% of baseline accuracy across three MLLM backbone models and nine multimodal benchmarks, while reducing visual-side computation to about 4% of the original, significantly outperforming 26 existing pruning methods in performance-efficiency trade-offs.
- Hardware-Friendly Acceleration: Achieves >3.5× speedup in single-batch inference and >10× speedup in multi-batch inference, compatible with FlashAttention.
- Orthogonal to Token Pruning: Compatible with token pruning methods for further gains, e.g., combining with PDrop in training-free inference reduces visual computation to 2% with over 96% performance retention.
- News: Latest updates, news, and announcements.
- Highlights: Core insights and key features highlighted in this work.
- Preparation: Environment setup and required dependencies.
- Usage: Instructions on how to run and use the code.
- License: License information for this repository.
- Acknowledgments: Credits to projects and contributors that inspired or supported this work.
- Contact: Contact information for questions, feedback, or collaboration.
- Related Projects: Research projects from our group (EIT-NLP) related to MLLM compression.
- Set up LLavA https://github.com/haotian-liu/LLaVA
cd LLaVA
conda create -n llava-vica python=3.10 -y
conda activate llava-vica
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install transformers==4.36.22. Copy updated files to llava library
modeling_llama_mask.pyandllava_llama_mask.pyimplement theoretical pruning ineager-attention:- Mask the corresponding attention weight in the attention block of each transformer layer.
- In FFN, extract text tokens from hidden states, feed them through FFN, and concatenate with visual tokens.
cp ../modeling_llama_mask.py ./llava/models/modeling_llama_prune.py
cp ../llava_llama_mask.py ./llava/model/language_model/llava_llama.pymodeling_llama_accel.pyandllava_llama_accel.pyimplement practical acceleration ineager-attentionandflash-attention:- Using only text tokens as the hidden state, while the visual tokens remain frozen. In a few layers, the visual tokens are used as KV-pairs in the attention block..
cp ../modeling_llama_accel.py ./llava/models/modeling_llama_prune.py
cp ../llava_llama_accel.py ./llava/model/language_model/llava_llama.py- Download the checkpoints from our Model Zoo.
- We evaluate our model on the following 9 widely-used multimodal benchmarks to provide a comprehensive assessment across perception, reasoning, hallucination, and specialized capabilities:
- MME
- GQA
- MMBench
- MMBench_CN
- POPE
- SEED-I (image subset of SEED-Bench)
- SQA-I (image subset of ScienceQA)
- TextVQA
- VQAv2
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/eval/mme.sh
……
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/eval/textvqa.shFor our experiments, we primarily use the LLaVA-1.5 training dataset, which can be prepared following the official guidelines.
We provide support for three model scales for LLaVA-1.5:
Our training approach consists of two stages: pretraining and fine-tuning. The training process is configured via the following shell script:
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/pretrain.sh
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/finetune.shT2V_LAYERS: Controls which transformer layers in the LLM apply text-vision cross-attention. Only the specified layers perform cross-attention between text and visual tokens; all remaining layers process only text tokens.
We train model variants in the implementation versions of the theoretical pruning of modeling.lama_mask. py and llava_1lama_mask. py
Preserved text-to-vision cross-attention layers in LLaVA-1.5 models:
- LLaVA-1.5-3B: {0, 1, 14, 15, 18, 19, 21, 22, 23}
- LLaVA-1.5-7B: {0, 1, 7, 8, 9, 10, 11, 14}
- LLaVA-1.5-13B: {0, 6, 8, 9, 10, 13, 14, 16}
This project is released under the Apache 2.0 license.
- Thanks for the LLaVA, FastV, and PyramidDrop library, which helps us to quickly implement our ideas.
For questions, suggestions, or collaboration opportunities, please feel free to reach out:
- Wenjie Liu: wenjay_leo@outlook.com
- Hao Wu: haowu.ai.research@gmail.com
- Xiaoyu Shen: xyshen@eitech.edu.cn
- Survey
- Vision Encoder
- MLLM
