Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu
FluxMem uses a training-free hierarchical memory with temporal (mid-term) and spatial (long-term) compression to adaptively prune redundant visual tokens in streaming video, enabling efficient real-time reasoning for large multimodal models.
- 🧠 Hierarchical memory: short-term keeps the freshest frames, mid-term filters temporal redundancy, long-term further removes spatial redundancy by anchoring salient tokens.
- 🪄 Training-free: drop-in gains without extra finetuning; if you do fine-tune, the gap just gets wider.
- 🧩 Plug-and-play: slips into Qwen2.5-VL as a memory add-on—no model surgery, no code rewrites.
- ⚡ Efficient: trims 60–70% visual tokens while lifting performance on both online and offline long-video benchmarks.
FluxMem
├── models/
│ ├── qwen2-5-vl/ # FluxMem-patched Qwen2.5-VL model & processor
│ └── qwen-vl-utils/ # Vision preprocessing
├── qwen-vl-finetune/ # Training pipeline, data configs, SFT scripts
├── evaluation/ # StreamingBench, OVO-Bench, lmms-eval recipes
└── assets/ # Figures used in README
Create venv:
uv venv --python=python3.11
source .venv/bin/activate- Inference essentials
uv pip install -e models/qwen2-5-vl uv pip install -e models/qwen-vl-utils
- Training
uv pip install -e "qwen-vl-finetune[train]" - Evaluation
uv pip install -e evaluation/lmms-eval # for VideoMME / MLVU / LongVideoBench uv pip install ffmpeg-python==0.2.0 moviepy==1.0.3 # for StreamingBench / OVO-Bench
- Flash-Attn 2: download the matching wheel, then
e.g.
uv pip install ./flash_attn-*.whl --no-build-isolationflash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl(CUDA 12, torch 2.6, Python 3.11).
import torch
from qwen_vl_utils_fluxmem import process_vision_info
from qwen2_5_vl_fluxmem import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
processor = Qwen2_5_VLProcessor.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
)
video_path = 'PATH_TO_YOUR_VIDEO'
prompt = 'Describe this video.'
max_pixels = 256 * 28 * 28
fps = 1
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "text", "text": prompt},
{"video": video_path, 'fps': fps, "max_pixels": max_pixels},
],
}]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False,
temperature=0.0,
use_fluxmem=True, # enable FluxMem
short_frames=8,
medium_frames=64,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)- Configure dataset paths in
qwen-vl-finetune/qwenvl/data/__init__.py(annotation_path,data_path). - Run the default SFT script:
cd qwen-vl-finetune
bash scripts/sft.sh(Adjust hyperparameters/paths in scripts as needed; deepspeed & flash-attn are supported.)
- StreamingBench:
bash evaluation/streamingbench/streamingbench.sh. - OVO-Bench:
bash evaluation/ovobench/ovobench.sh. - VideoMME / MLVU / LongVideoBench:
bash evaluation/lmms-eval/qwen25vl_fluxmem_*.sh. - Datasets: You can download the evaluation benchmarks from the corresponding link: StreamingBench; OVO-Bench.
Apache-2.0. Please also follow upstream model and dataset licenses.
If you find FluxMem useful, please cite:
@inproceedings{xie2026fluxmem,
title={FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding},
author={Xie, Yiweng and He, Bo and Wang, Junke and Zheng, Xiangyu and Ye, Ziyi and Wu, Zuxuan},
booktitle={CVPR},
year={2026}
}We thank the following projects for their contributions and inspiration: Qwen2.5-VL, TimeChat-online, OVOBench, StreamingBench, LMMS-Eval.




