ViCA: Efficient Multimodal LLMs
with Vision-Only Cross-Attention

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu^*,1, Hao Wu^*,1, Xin Qiu¹, Yingqi Fan¹, Yihan Zhang¹, Anhao Zhao¹, Yunpu Ma², Xiaoyu Shen^†,1

¹Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo

²LMU Munich

^* Equal Contribution, ^† Corresponding Author (xyshen@eitech.edu.cn)

If you find this work useful for your research and applications, please consider citing:

@misc{liu2026vicaefficientmultimodalllms,
    title={ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention}, 
    author={Wenjie Liu and Hao Wu and Xin Qiu and Yingqi Fan and Yihan Zhang and Anhao Zhao and Yunpu Ma and Xiaoyu Shen},
    year={2026},
    eprint={2602.07574},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2602.07574}, 
}

🔥News

[TODO] Code, checkpoints, and documentation are being prepared and will be released soon.
[2026.02.07] The preprint is now published!

💡 Highlights

Insights on MLLM Redundancy: Demonstrates that projected visual embeddings are already well-aligned with language space, and effective vision-language interaction occurs in only a small subset of Transformer layers, revealing substantial redundancy in dense visual processing.
ViCA Architecture: Introduces Vision-only Cross-Attention (ViCA), a minimal MLLM design where visual tokens bypass all self-attention and feed-forward layers, interacting with text solely via sparse cross-attention at key layers for efficient multimodal fusion.
Performance-Efficiency Trade-off: Maintains approximately 98% of baseline accuracy across three MLLM backbone models and nine multimodal benchmarks, while reducing visual-side computation to about 4% of the original, significantly outperforming 26 existing pruning methods in performance-efficiency trade-offs.
Hardware-Friendly Acceleration: Achieves >3.5× speedup in single-batch inference and >10× speedup in multi-batch inference, compatible with FlashAttention.
Orthogonal to Token Pruning: Compatible with token pruning methods for further gains, e.g., combining with PDrop in training-free inference reduces visual computation to 2% with over 96% performance retention.

📚 Contents

News: Latest updates, news, and announcements.
Highlights: Core insights and key features highlighted in this work.
Preparation: Environment setup and required dependencies.
Usage: Instructions on how to run and use the code.
License: License information for this repository.
Acknowledgments: Credits to projects and contributors that inspired or supported this work.
Contact: Contact information for questions, feedback, or collaboration.
Related Projects: Research projects from our group (EIT-NLP) related to MLLM compression.

✒️ Preparation

Installation

Set up LLavA https://github.com/haotian-liu/LLaVA

cd LLaVA
conda create -n llava-vica python=3.10 -y
conda activate llava-vica
pip install --upgrade pip  
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation   
pip install transformers==4.36.2

2. Copy updated files to llava library

modeling_llama_mask.py and llava_llama_mask.py implement theoretical pruning in eager-attention:
- Mask the corresponding attention weight in the attention block of each transformer layer.
- In FFN, extract text tokens from hidden states, feed them through FFN, and concatenate with visual tokens.

cp ../modeling_llama_mask.py ./llava/models/modeling_llama_prune.py
cp ../llava_llama_mask.py ./llava/model/language_model/llava_llama.py

modeling_llama_accel.py and llava_llama_accel.py implement practical acceleration in eager-attention and flash-attention:
- Using only text tokens as the hidden state, while the visual tokens remain frozen. In a few layers, the visual tokens are used as KV-pairs in the attention block..

cp ../modeling_llama_accel.py ./llava/models/modeling_llama_prune.py
cp ../llava_llama_accel.py ./llava/model/language_model/llava_llama.py

🎯 Usage

Inference

Download the checkpoints from our Model Zoo.
We evaluate our model on the following 9 widely-used multimodal benchmarks to provide a comprehensive assessment across perception, reasoning, hallucination, and specialized capabilities:

MME
GQA
MMBench
MMBench_CN
POPE
SEED-I (image subset of SEED-Bench)
SQA-I (image subset of ScienceQA)
TextVQA
VQAv2

T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/eval/mme.sh
……
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/eval/textvqa.sh

Train

Training Data

For our experiments, we primarily use the LLaVA-1.5 training dataset, which can be prepared following the official guidelines.

Models Used

We provide support for three model scales for LLaVA-1.5:

Training Recipe

Our training approach consists of two stages: pretraining and fine-tuning. The training process is configured via the following shell script:

T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/pretrain.sh
T2V_LAYERS="[0,1,7,8,9,10,11,14]" bash scripts/v1_5/finetune.sh

T2V_LAYERS: Controls which transformer layers in the LLM apply text-vision cross-attention. Only the specified layers perform cross-attention between text and visual tokens; all remaining layers process only text tokens.

We train model variants in the implementation versions of the theoretical pruning of modeling.lama_mask. py and llava_1lama_mask. py

Preserved text-to-vision cross-attention layers in LLaVA-1.5 models:

LLaVA-1.5-3B: {0, 1, 14, 15, 18, 19, 21, 22, 23}
LLaVA-1.5-7B: {0, 1, 7, 8, 9, 10, 11, 14}
LLaVA-1.5-13B: {0, 6, 8, 9, 10, 13, 14, 16}

📄 License

This project is released under the Apache 2.0 license.

🙏 Acknowledgments

Thanks for the LLaVA, FastV, and PyramidDrop library, which helps us to quickly implement our ideas.

✉️ Contact

For questions, suggestions, or collaboration opportunities, please feel free to reach out:

Wenjie Liu: wenjay_leo@outlook.com
Hao Wu: haowu.ai.research@gmail.com
Xiaoyu Shen: xyshen@eitech.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
LLaVA		LLaVA
assets		assets
docs		docs
LICENSE		LICENSE
README.md		README.md
llava_llama_accel.py		llava_llama_accel.py
llava_llama_mask.py		llava_llama_mask.py
modeling_llama_accel.py		modeling_llama_accel.py
modeling_llama_mask.py		modeling_llama_mask.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViCA: Efficient Multimodal LLMs
with Vision-Only Cross-Attention

🔥News

💡 Highlights

📚 Contents

✒️ Preparation

Installation

🎯 Usage

Inference

Train

Training Data

Models Used

Training Recipe

📄 License

🙏 Acknowledgments

✉️ Contact

🌐 Related Projects (ours)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

🔥News

💡 Highlights

📚 Contents

✒️ Preparation

Installation

🎯 Usage

Inference

Train

Training Data

Models Used

Training Recipe

📄 License

🙏 Acknowledgments

✉️ Contact

🌐 Related Projects (ours)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ViCA: Efficient Multimodal LLMs
with Vision-Only Cross-Attention

Packages