[CVPR 2026] VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
CVPR 2026 | Paper | Project Page | Checkpoints
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32--256 tokens, achieving state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources.
| 🎯 1024×1024 in just 64 tokens | Achieves 3.94 gFID vs. 5.87 gFID for diffusion-based SOTA (1,024 tokens) |
| ⚡ Constant 179G FLOPs | 63× more efficient than LlamaGen (11T FLOPs at 1024×1024) |
| 🌐 Resolution-agnostic | Supports arbitrary resolutions and aspect ratios out of the box |
| 🎛️ Dynamic token count | User-controllable 32--256 tokens per image |
| 🔍 Native super-resolution | Supports image super-resolution out of the box |
- [Feb 2026] 🎉 VibeToken is accepted at CVPR 2026!
- [Feb 2026] Training scripts released.
- [Feb 2026] Inference code and checkpoints released.
# 1. Clone and setup
git clone https://github.com/<your-org>/VibeToken.git
cd VibeToken
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txt
# 2. Download a checkpoint (see Checkpoints section below)
mkdir -p checkpoints
wget https://huggingface.co/mpatel57/VibeToken/resolve/main/VibeToken_LL.bin -O ./checkpoints/VibeToken_LL.bin
# 3. Reconstruct an image
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.pngAll checkpoints are hosted on Hugging Face.
| Name | Resolution | rFID (256 tokens) | rFID (64 tokens) | Download |
|---|---|---|---|---|
| VibeToken-LL | 1024×1024 | 3.76 | 4.12 | VibeToken_LL.bin |
| VibeToken-LL | 256×256 | 5.12 | 0.90 | same as above |
| VibeToken-SL | 1024×1024 | 4.25 | 2.41 | VibeToken_SL.bin |
| VibeToken-SL | 256×256 | 5.44 | 0.40 | same as above |
| Name | Training Resolution(s) | Tokens | Best gFID | Download |
|---|---|---|---|---|
| VibeToken-Gen-B | 256×256 | 65 | 7.62 | VibeTokenGen-b-fixed65_dynamic_1500k.pt |
| VibeToken-Gen-B | 1024×1024 | 65 | 7.37 | same as above |
| VibeToken-Gen-XXL | 256×256 | 65 | 3.62 | VibeTokenGen-xxl-dynamic-65_750k.pt |
| VibeToken-Gen-XXL | 1024×1024 | 65 | 3.54 | same as above |
uv venv --python=3.11.6
source .venv/bin/activate
uv pip install -r requirements.txtTip: If you don't have
uv, install it viapip install uvor see uv docs. Alternatively, usepython -m venv .venv && pip install -r requirements.txt.
Download the VibeToken-LL checkpoint (see Checkpoints), then:
# Auto mode (recommended) -- automatically determines optimal patch sizes
python reconstruct.py --auto \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png
# Manual mode -- specify patch sizes explicitly
python reconstruct.py \
--config configs/vibetoken_ll.yaml \
--checkpoint ./checkpoints/VibeToken_LL.bin \
--image ./assets/example_1.png \
--output ./assets/reconstructed.png \
--encoder_patch_size 16 \
--decoder_patch_size 16Note: For best performance, the input image resolution should be a multiple of 32. Images with other resolutions are automatically rescaled to the nearest multiple of 32.
Download both the VibeToken-LL and VibeToken-Gen-XXL checkpoints (see Checkpoints), then:
python generate.py \
--gpt-ckpt ./checkpoints/VibeTokenGen-xxl-dynamic-65_750k.pt \
--gpt-model GPT-XXL --num-output-layer 4 \
--num-codebooks 8 --codebook-size 32768 \
--image-size 256 --cfg-scale 4.0 --top-k 500 --temperature 1.0 \
--class-dropout-prob 0.1 \
--extra-layers "QKV" \
--latent-size 65 \
--config ./configs/vibetoken_ll.yaml \
--vq-ckpt ./checkpoints/VibeToken_LL.bin \
--sample-dir ./assets/ \
--skip-folder-creation \
--compile \
--decoder-patch-size 32,32 \
--target-resolution 1024,1024 \
--llamagen-target-resolution 256,256 \
--precision bf16 \
--global-seed 156464151The --target-resolution controls the tokenizer output resolution, while --llamagen-target-resolution controls the generator's internal resolution (max 512×512; for higher resolutions, the tokenizer handles upscaling).
To train the VibeToken tokenizer from scratch, please refer to TRAIN.md for detailed instructions.
We would like to acknowledge the following repositories that inspired our work and upon which we directly build: 1d-tokenizer, LlamaGen, and UniTok.
If you find VibeToken useful in your research, please consider citing:
@inproceedings{vibetoken2026,
title = {VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations},
author = {Patel, Maitreya and Li, Jingtao and Zhuang, Weiming and Yang, Yezhou and Lyu, Lingjuan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}If you have any questions, feel free to open an issue or reach out!
