[CVPR 26] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu

FluxMem uses a training-free hierarchical memory with temporal (mid-term) and spatial (long-term) compression to adaptively prune redundant visual tokens in streaming video, enabling efficient real-time reasoning for large multimodal models.

Highlights

🧠 Hierarchical memory: short-term keeps the freshest frames, mid-term filters temporal redundancy, long-term further removes spatial redundancy by anchoring salient tokens.
🪄 Training-free: drop-in gains without extra finetuning; if you do fine-tune, the gap just gets wider.
🧩 Plug-and-play: slips into Qwen2.5-VL as a memory add-on—no model surgery, no code rewrites.
⚡ Efficient: trims 60–70% visual tokens while lifting performance on both online and offline long-video benchmarks.

Repository Layout

FluxMem
├── models/
│   ├── qwen2-5-vl/         # FluxMem-patched Qwen2.5-VL model & processor
│   └── qwen-vl-utils/      # Vision preprocessing
├── qwen-vl-finetune/       # Training pipeline, data configs, SFT scripts
├── evaluation/             # StreamingBench, OVO-Bench, lmms-eval recipes
└── assets/                 # Figures used in README

Installation

Create venv:

uv venv --python=python3.11
source .venv/bin/activate

Inference essentials

uv pip install -e models/qwen2-5-vl
uv pip install -e models/qwen-vl-utils

Training

uv pip install -e "qwen-vl-finetune[train]"

Evaluation

uv pip install -e evaluation/lmms-eval      # for VideoMME / MLVU / LongVideoBench
uv pip install ffmpeg-python==0.2.0 moviepy==1.0.3   # for StreamingBench / OVO-Bench

Flash-Attn 2: download the matching wheel, then
```
uv pip install ./flash_attn-*.whl --no-build-isolation
```
e.g. flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl (CUDA 12, torch 2.6, Python 3.11).

Quick Start

import torch

from qwen_vl_utils_fluxmem import process_vision_info
from qwen2_5_vl_fluxmem import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = Qwen2_5_VLProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
)

video_path = 'PATH_TO_YOUR_VIDEO'
prompt = 'Describe this video.'

max_pixels = 256 * 28 * 28 
fps = 1

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "text", "text": prompt},
        {"video": video_path, 'fps': fps, "max_pixels": max_pixels},
    ],
}]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text], 
    images=image_inputs, 
    videos=video_inputs, 
    padding=True, 
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    temperature=0.0,
    use_fluxmem=True,   # enable FluxMem
    short_frames=8,
    medium_frames=64,
)

generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Training

Configure dataset paths in qwen-vl-finetune/qwenvl/data/__init__.py (annotation_path, data_path).
Run the default SFT script:

cd qwen-vl-finetune
bash scripts/sft.sh

(Adjust hyperparameters/paths in scripts as needed; deepspeed & flash-attn are supported.)

Evaluation

StreamingBench: bash evaluation/streamingbench/streamingbench.sh.
OVO-Bench: bash evaluation/ovobench/ovobench.sh.
VideoMME / MLVU / LongVideoBench: bash evaluation/lmms-eval/qwen25vl_fluxmem_*.sh.
Datasets: You can download the evaluation benchmarks from the corresponding link: StreamingBench; OVO-Bench.

Visualizations

License

Apache-2.0. Please also follow upstream model and dataset licenses.

Citation

If you find FluxMem useful, please cite:

@inproceedings{xie2026fluxmem,
  title={FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding},
  author={Xie, Yiweng and He, Bo and Wang, Junke and Zheng, Xiangyu and Ye, Ziyi and Wu, Zuxuan},
  booktitle={CVPR},
  year={2026}
}

Acknowledgements

We thank the following projects for their contributions and inspiration: Qwen2.5-VL, TimeChat-online, OVOBench, StreamingBench, LMMS-Eval.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
evaluation		evaluation
models		models
qwen-vl-finetune		qwen-vl-finetune
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 26] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Highlights

Repository Layout

Installation

Quick Start

Training

Evaluation

Visualizations

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[CVPR 26] FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Highlights

Repository Layout

Installation

Quick Start

Training

Evaluation

Visualizations

License

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages