[mllm] support longcat_next by xin3he · Pull Request #1637 · intel/auto-round

xin3he · 2026-03-30T06:28:42Z

Description

ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

To reproduce: auto-round /storage/xinhe/meituan-longcat/LongCat-Next/

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: Xin He <xin3.he@intel.com>

for more information, see https://pre-commit.ci

XuehaoSun · 2026-03-31T01:20:43Z

2026-03-30 15:56:33 INFO __main__.py L599: start to quantize meituan-longcat/LongCat-Next
2026-03-30 15:56:34 INFO autoround.py L178: using MLLM mode for multimodal model.
/data3/hf_new_model_cache/modules/transformers_modules/meituan_hyphen_longcat/LongCat_hyphen_Next/522f2020e5ed353429cc403b72491ba1899ef0e6/modular_longcat_next_audio.py:220: Fut
  @autocast(enabled=True, dtype=torch.float32)
2026-03-30 15:56:41 WARNING modeling_utils.py L2446: You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/diffusers/models/lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
self.visual_offset_vals=tensor([150581, 166965, 183349, 199733, 216117, 232501, 248885, 265269])
self.audio_offset_vals=tensor([131125, 139317, 143413, 145461, 146485, 147509, 148533, 149557])
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 11.14it/s]
2026-03-30 15:57:01 WARNING compressor.py L286: longcat_next does not support for NeelNanda/pile-10k, will use liuhaotian/llava_conv_58k with default config as an alternative.
2026-03-30 15:57:01 WARNING compressor.py L296: reset batch_size(8) to 1 and gradient_accumulate_steps(1) to 8, because batch_size=8 cannot be used for liuhaotian/llava_conv_58k
2026-03-30 15:57:01 INFO base.py L517: using torch.bfloat16 for quantization tuning
2026-03-30 15:57:01 INFO base.py L834: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-03-30 15:57:01 WARNING formats.py L166: some layers are skipped quantization (shape not divisible by 32): audio_head.heads.[0-7], lm_head, model.audio_tokenizer.audio_flow_
2026-03-30 15:57:01 INFO base.py L1660: Using predefined ignore_layers: classifier
2026-03-30 15:57:02 INFO base.py L1818: start to cache block inputs
2026-03-30 15:57:07 WARNING base.py L2328: Some layers are offloaded to cpu, which may severely impact calibration speed. Please consider using more cards.
Some parameters are on the meta device because they were offloaded to the cpu.
2026-03-30 15:57:28 WARNING dataset.py L251: seqlen(2048) is greater than the maximum length supported by the liuhaotian/llava_conv_58k, reset to 512
2026-03-30 15:57:28 INFO dataset.py L99: use dataset llava_conv_58k, downloading...
cache block inputs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [14:44<00:00,  6.91s/it]
2026-03-30 16:12:42 INFO base.py L1835: caching done
Quantizing model.layers.0:   0%|                                                                                                                         | 0/100 [00:10<?, ?it/s]
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
^[[B^[[A2026-03-30 16:54:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.000444 -> iter 194: 0.000079,'peak_ram': 86.58GB, 'peak_vram': 66.75GB
Quantizing model.layers.1:   1%|█                                                                                                           | 1/100 [42:12<69:39:08, 2532.81s/it]
quantized 784/785 layers in the block, loss iter 0: 0.001716 -> iter 195: 0.000445,'peak_ram': 94.89GB, 'peak_vram': 66.75GB
Quantizing model.layers.2:   2%|██                                                                                                        | 2/100 [1:23:27<68:01:31, 2498.89s/it]
quantized 784/785 layers in the block, loss iter 0: 0.002576 -> iter 199: 0.001224,'peak_ram': 103.3GB, 'peak_vram': 66.75GB
Quantizing model.layers.3:   3%|███▏                                                                                                      | 3/100 [2:04:37<66:58:09, 2485.46s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003595 -> iter 197: 0.001099,'peak_ram': 104.32GB, 'peak_vram': 66.75GB
Quantizing model.layers.4:   4%|████▏                                                                                                     | 4/100 [2:43:32<64:41:42, 2426.07s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003605 -> iter 192: 0.001413,'peak_ram': 116.5GB, 'peak_vram': 66.75GB
Quantizing model.layers.5:   5%|█████▎                                                                                                    | 5/100 [3:21:56<62:51:48, 2382.19s/it]
quantized 784/785 layers in the block, loss iter 0: 0.004384 -> iter 192: 0.002084,'peak_ram': 116.6GB, 'peak_vram': 66.75GB
Quantizing model.layers.6:   6%|██████▎                                                                                                   | 6/100 [4:00:49<61:45:39, 2365.31s/it]
quantized 784/785 layers in the block, loss iter 0: 0.006060 -> iter 196: 0.002672,'peak_ram': 121.61GB, 'peak_vram': 66.75GB
Quantizing model.layers.7:   7%|███████▍                                                                                                  | 7/100 [4:39:00<60:28:36, 2341.03s/it]2026-03-30 21:30:55 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009842 -> iter 169: 0.003777,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.8:   8%|████████▍                                                                                                 | 8/100 [5:18:48<60:12:30, 2355.99s/it]2026-03-30 22:10:08 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009777 -> iter 199: 0.004623,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.9:   9%|█████████▌                                                                                                | 9/100 [5:57:55<59:29:10, 2353.30s/it]2026-03-30 22:48:58 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.018928 -> iter 191: 0.008281,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.10:  10%|██████████▍                                                                                             | 10/100 [6:36:50<58:41:27, 2347.64s/it]2026-03-30 23:28:31 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.022149 -> iter 180: 0.011693,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.11:  11%|███████████▍                                                                                            | 11/100 [7:16:19<58:12:02, 2354.18s/it]2026-03-31 00:09:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.041877 -> iter 196: 0.017732,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.12:  12%|████████████▍                                                                                           | 12/100 [7:57:11<58:16:20, 2383.87s/it]2026-03-31 00:52:34 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.072172 -> iter 197: 0.030324,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  13%|█████████████▌                                                                                          | 13/100 [8:40:29<59:10:31, 2448.64s/it]2026-03-31 01:34:45 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.134645 -> iter 190: 0.045848,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  14%|██████████████▌                                                                                         | 14/100 [9:22:36<59:03:33, 2472.25s/it]Traceback (most recent call last):
  File "/home/uttest/miniforge3/envs/autoround_test/bin/auto-round", line 10, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 822, in run
    start()
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 541, in start
    tune(args)
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 761, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1018, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1850, in quantize
    inputs = all_inputs[block_names[0]]
             ~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'model.audio_tokenizer.audio_model.layers.0'

xin3he · 2026-03-31T07:03:17Z

Thank you for the checking. @XuehaoSun
Audio part should be skipped since the datasets only contains image and text, I will fix it and let you know.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-02T03:24:34Z

It's more complex than the original expectation. since it's an omni model, more time is needed to enable it.

xin3he · 2026-04-09T02:22:26Z

NotImplementedError: Could not run 'flash_attn::_flash_attn_varlen_forward' with arguments from the 'CPU' backend.
This model must always run on a CUDA device; offloading is not supported.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-09T07:17:37Z

@XuehaoSun Please run with AR_CALIB_FORCE_CUDA=1 and try it again. Thanks.

Copilot

Pull request overview

Adds support for the meituan-longcat/LongCat-Next MLLM family by introducing a dedicated longcat_next processor/template and small loader behavior tweaks to avoid chat-template related failures.

Changes:

Register a new longcat_next MLLM template and processor.
Adjust chat-template handling to avoid calling apply_chat_template when no template is present.
Add LongCat-specific tokenizer loading behavior (fix_mistral_regex) and additional calibration/block-selection tweaks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
auto_round/utils/model.py	Adds LongCat-related tokenizer loading flag and (per diff) updates multimodal block discovery behavior.
auto_round/envs.py	Adds a new environment toggle for calibration device behavior.
auto_round/compressors/mllm/template.py	Registers the new `longcat_next` template.
auto_round/compressors/mllm/processor.py	Adds `LongCatNextProcessor` and tightens chat-template checks.
auto_round/compressors/base.py	Adds (per diff) an env-controlled override for GPU vs CPU calibration caching behavior.

auto_round/compressors/mllm/processor.py

auto_round/compressors/mllm/template.py

Signed-off-by: Xin He <xin3.he@intel.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

xin3he · 2026-04-10T03:11:56Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-10T03:12:08Z

Azure Pipelines successfully started running 1 pipeline(s).

xin3he · 2026-04-10T06:11:24Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-10T06:11:34Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

[mllm] support longcat_next

b777e59

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI review requested due to automatic review settings March 30, 2026 06:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

12f1f3d

for more information, see https://pre-commit.ci

Copilot started reviewing on behalf of xin3he March 30, 2026 06:30 View session

xin3he requested review from lvliang-intel and n1ck-guo and removed request for Copilot March 30, 2026 06:31

xin3he marked this pull request as draft April 1, 2026 11:09

fix processor issue

abfef49

Signed-off-by: Xin He <xin3.he@intel.com>

Merge branch 'main' into xinhe/3-30

07674bf

xin3he marked this pull request as ready for review April 9, 2026 07:16

Copilot AI review requested due to automatic review settings April 9, 2026 07:16

add AR_CALIB_FORCE_CUDA for longcat-next flash-attn requirement

9e74c5f

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot started reviewing on behalf of xin3he April 9, 2026 07:17 View session

xin3he requested a review from wenhuach21 April 9, 2026 07:17

Copilot AI reviewed Apr 9, 2026

View reviewed changes

auto_round/compressors/mllm/processor.py Show resolved Hide resolved

auto_round/compressors/mllm/processor.py Outdated Show resolved Hide resolved

auto_round/compressors/mllm/processor.py Show resolved Hide resolved

auto_round/compressors/mllm/template.py Show resolved Hide resolved

xin3he and others added 2 commits April 9, 2026 08:43

remove shared weights

e0dbe9f

Signed-off-by: Xin He <xin3.he@intel.com>

Apply suggestion from @Copilot

b8c2406

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

xin3he requested a review from XuehaoSun April 10, 2026 03:12

intel deleted a comment from azure-pipelines bot Apr 10, 2026

hook_ngram_embeddings_on_cpu and remove AR_CALIB_FORCE_CUDA

de60130

Signed-off-by: Xin He <xin3.he@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mllm] support longcat_next#1637

[mllm] support longcat_next#1637
xin3he wants to merge 8 commits intomainfrom
xinhe/3-30

xin3he commented Mar 30, 2026 •

edited

Loading

Uh oh!

XuehaoSun commented Mar 31, 2026

Uh oh!

xin3he commented Mar 31, 2026

Uh oh!

xin3he commented Apr 2, 2026 •

edited

Loading

Uh oh!

xin3he commented Apr 9, 2026

Uh oh!

xin3he commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xin3he commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

xin3he commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xin3he commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

XuehaoSun commented Mar 31, 2026

Uh oh!

xin3he commented Mar 31, 2026

Uh oh!

xin3he commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xin3he commented Apr 9, 2026

Uh oh!

xin3he commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xin3he commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

xin3he commented Apr 10, 2026

Uh oh!

azure-pipelines bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xin3he commented Mar 30, 2026 •

edited

Loading

xin3he commented Apr 2, 2026 •

edited

Loading