Skip to content

[mllm] support longcat_next#1637

Open
xin3he wants to merge 8 commits intomainfrom
xinhe/3-30
Open

[mllm] support longcat_next#1637
xin3he wants to merge 8 commits intomainfrom
xinhe/3-30

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented Mar 30, 2026

Description

ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

To reproduce: auto-round /storage/xinhe/meituan-longcat/LongCat-Next/

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: Xin He <xin3.he@intel.com>
Copilot AI review requested due to automatic review settings March 30, 2026 06:28
@xin3he xin3he requested review from lvliang-intel and n1ck-guo and removed request for Copilot March 30, 2026 06:31
@XuehaoSun
Copy link
Copy Markdown
Contributor

2026-03-30 15:56:33 INFO __main__.py L599: start to quantize meituan-longcat/LongCat-Next
2026-03-30 15:56:34 INFO autoround.py L178: using MLLM mode for multimodal model.
/data3/hf_new_model_cache/modules/transformers_modules/meituan_hyphen_longcat/LongCat_hyphen_Next/522f2020e5ed353429cc403b72491ba1899ef0e6/modular_longcat_next_audio.py:220: Fut
  @autocast(enabled=True, dtype=torch.float32)
2026-03-30 15:56:41 WARNING modeling_utils.py L2446: You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/diffusers/models/lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
self.visual_offset_vals=tensor([150581, 166965, 183349, 199733, 216117, 232501, 248885, 265269])
self.audio_offset_vals=tensor([131125, 139317, 143413, 145461, 146485, 147509, 148533, 149557])
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 11.14it/s]
2026-03-30 15:57:01 WARNING compressor.py L286: longcat_next does not support for NeelNanda/pile-10k, will use liuhaotian/llava_conv_58k with default config as an alternative.
2026-03-30 15:57:01 WARNING compressor.py L296: reset batch_size(8) to 1 and gradient_accumulate_steps(1) to 8, because batch_size=8 cannot be used for liuhaotian/llava_conv_58k
2026-03-30 15:57:01 INFO base.py L517: using torch.bfloat16 for quantization tuning
2026-03-30 15:57:01 INFO base.py L834: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-03-30 15:57:01 WARNING formats.py L166: some layers are skipped quantization (shape not divisible by 32): audio_head.heads.[0-7], lm_head, model.audio_tokenizer.audio_flow_
2026-03-30 15:57:01 INFO base.py L1660: Using predefined ignore_layers: classifier
2026-03-30 15:57:02 INFO base.py L1818: start to cache block inputs
2026-03-30 15:57:07 WARNING base.py L2328: Some layers are offloaded to cpu, which may severely impact calibration speed. Please consider using more cards.
Some parameters are on the meta device because they were offloaded to the cpu.
2026-03-30 15:57:28 WARNING dataset.py L251: seqlen(2048) is greater than the maximum length supported by the liuhaotian/llava_conv_58k, reset to 512
2026-03-30 15:57:28 INFO dataset.py L99: use dataset llava_conv_58k, downloading...
cache block inputs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [14:44<00:00,  6.91s/it]
2026-03-30 16:12:42 INFO base.py L1835: caching done
Quantizing model.layers.0:   0%|                                                                                                                         | 0/100 [00:10<?, ?it/s]
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
^[[B^[[A2026-03-30 16:54:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.000444 -> iter 194: 0.000079,'peak_ram': 86.58GB, 'peak_vram': 66.75GB
Quantizing model.layers.1:   1%|█                                                                                                           | 1/100 [42:12<69:39:08, 2532.81s/it]
quantized 784/785 layers in the block, loss iter 0: 0.001716 -> iter 195: 0.000445,'peak_ram': 94.89GB, 'peak_vram': 66.75GB
Quantizing model.layers.2:   2%|██                                                                                                        | 2/100 [1:23:27<68:01:31, 2498.89s/it]
quantized 784/785 layers in the block, loss iter 0: 0.002576 -> iter 199: 0.001224,'peak_ram': 103.3GB, 'peak_vram': 66.75GB
Quantizing model.layers.3:   3%|███▏                                                                                                      | 3/100 [2:04:37<66:58:09, 2485.46s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003595 -> iter 197: 0.001099,'peak_ram': 104.32GB, 'peak_vram': 66.75GB
Quantizing model.layers.4:   4%|████▏                                                                                                     | 4/100 [2:43:32<64:41:42, 2426.07s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003605 -> iter 192: 0.001413,'peak_ram': 116.5GB, 'peak_vram': 66.75GB
Quantizing model.layers.5:   5%|█████▎                                                                                                    | 5/100 [3:21:56<62:51:48, 2382.19s/it]
quantized 784/785 layers in the block, loss iter 0: 0.004384 -> iter 192: 0.002084,'peak_ram': 116.6GB, 'peak_vram': 66.75GB
Quantizing model.layers.6:   6%|██████▎                                                                                                   | 6/100 [4:00:49<61:45:39, 2365.31s/it]
quantized 784/785 layers in the block, loss iter 0: 0.006060 -> iter 196: 0.002672,'peak_ram': 121.61GB, 'peak_vram': 66.75GB
Quantizing model.layers.7:   7%|███████▍                                                                                                  | 7/100 [4:39:00<60:28:36, 2341.03s/it]2026-03-30 21:30:55 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009842 -> iter 169: 0.003777,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.8:   8%|████████▍                                                                                                 | 8/100 [5:18:48<60:12:30, 2355.99s/it]2026-03-30 22:10:08 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009777 -> iter 199: 0.004623,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.9:   9%|█████████▌                                                                                                | 9/100 [5:57:55<59:29:10, 2353.30s/it]2026-03-30 22:48:58 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.018928 -> iter 191: 0.008281,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.10:  10%|██████████▍                                                                                             | 10/100 [6:36:50<58:41:27, 2347.64s/it]2026-03-30 23:28:31 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.022149 -> iter 180: 0.011693,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.11:  11%|███████████▍                                                                                            | 11/100 [7:16:19<58:12:02, 2354.18s/it]2026-03-31 00:09:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.041877 -> iter 196: 0.017732,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.12:  12%|████████████▍                                                                                           | 12/100 [7:57:11<58:16:20, 2383.87s/it]2026-03-31 00:52:34 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.072172 -> iter 197: 0.030324,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  13%|█████████████▌                                                                                          | 13/100 [8:40:29<59:10:31, 2448.64s/it]2026-03-31 01:34:45 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.134645 -> iter 190: 0.045848,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  14%|██████████████▌                                                                                         | 14/100 [9:22:36<59:03:33, 2472.25s/it]Traceback (most recent call last):
  File "/home/uttest/miniforge3/envs/autoround_test/bin/auto-round", line 10, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 822, in run
    start()
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 541, in start
    tune(args)
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 761, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1018, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1850, in quantize
    inputs = all_inputs[block_names[0]]
             ~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'model.audio_tokenizer.audio_model.layers.0'

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Mar 31, 2026

Thank you for the checking. @XuehaoSun
Audio part should be skipped since the datasets only contains image and text, I will fix it and let you know.

@xin3he xin3he marked this pull request as draft April 1, 2026 11:09
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 2, 2026

It's more complex than the original expectation. since it's an omni model, more time is needed to enable it.

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 9, 2026

NotImplementedError: Could not run 'flash_attn::_flash_attn_varlen_forward' with arguments from the 'CPU' backend.
This model must always run on a CUDA device; offloading is not supported.

@xin3he xin3he marked this pull request as ready for review April 9, 2026 07:16
Copilot AI review requested due to automatic review settings April 9, 2026 07:16
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 9, 2026

@XuehaoSun Please run with AR_CALIB_FORCE_CUDA=1 and try it again. Thanks.

@xin3he xin3he requested a review from wenhuach21 April 9, 2026 07:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for the meituan-longcat/LongCat-Next MLLM family by introducing a dedicated longcat_next processor/template and small loader behavior tweaks to avoid chat-template related failures.

Changes:

  • Register a new longcat_next MLLM template and processor.
  • Adjust chat-template handling to avoid calling apply_chat_template when no template is present.
  • Add LongCat-specific tokenizer loading behavior (fix_mistral_regex) and additional calibration/block-selection tweaks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
auto_round/utils/model.py Adds LongCat-related tokenizer loading flag and (per diff) updates multimodal block discovery behavior.
auto_round/envs.py Adds a new environment toggle for calibration device behavior.
auto_round/compressors/mllm/template.py Registers the new longcat_next template.
auto_round/compressors/mllm/processor.py Adds LongCatNextProcessor and tightens chat-template checks.
auto_round/compressors/base.py Adds (per diff) an env-controlled override for GPU vs CPU calibration caching behavior.

xin3he and others added 2 commits April 9, 2026 08:43
Signed-off-by: Xin He <xin3.he@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 10, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xin3he xin3he requested a review from XuehaoSun April 10, 2026 03:12
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 10, 2026

/azp run Unit-Test-CUDA-AutoRound

@intel intel deleted a comment from azure-pipelines bot Apr 10, 2026
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants