Skip to content

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832

Merged
vadiklyutiy merged 2 commits intovllm-project:mainfrom
vadiklyutiy:qwen35-fp4-mtp.fc
Apr 3, 2026
Merged

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832
vadiklyutiy merged 2 commits intovllm-project:mainfrom
vadiklyutiy:qwen35-fp4-mtp.fc

Conversation

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

Description

Fix AssertionError when loading nvidia/Qwen3.5-397B-A17B-NVFP4 with method="mtp".

The NVFP4 checkpoint stores the entire MTP branch in BF16, but hf_quant_config.json only excludes mtp.layers.0* — missing mtp.fc. This causes ColumnParallelLinear for mtp.fc to be created with NVFP4 quantization (packed uint8, half input dim), which then crashes at weight loading when the BF16 checkpoint weight shape doesn't match.

Fix: Force quant_config=None for mtp.fc when the quant is modelopt_fp4.

This is a temporary workaround until NVIDIA/Model-Optimizer#1124 is merged and the checkpoint is re-exported with the corrected exclude_modules.

Related:

Test

2x B200, TP=2:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --tensor-parallel-size 2 \
  --language-model-only \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --max-model-len 1024

Before: AssertionError at parameter.py:153 during MTP weight loading.

After: Server starts, inference works:

{"prompt": "What is 2+2?", "max_tokens": 32}
-> "The sum of 2 and 2 is 4..."

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
@vadiklyutiy vadiklyutiy requested a review from sighingnow as a code owner April 2, 2026 17:20
@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a workaround for Qwen 3.5 MTP models by forcing the fc layer to remain unquantized when the modelopt_fp4 quantization configuration is used. This addresses an issue where the layer is stored as BF16 in checkpoints but missing from the exclusion list in the quantization configuration. I have no feedback to provide.

@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026
@vadiklyutiy vadiklyutiy merged commit 771913e into vllm-project:main Apr 3, 2026
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants