[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832
Merged
vadiklyutiy merged 2 commits intovllm-project:mainfrom Apr 3, 2026
Merged
[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5#38832vadiklyutiy merged 2 commits intovllm-project:mainfrom
vadiklyutiy merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a workaround for Qwen 3.5 MTP models by forcing the fc layer to remain unquantized when the modelopt_fp4 quantization configuration is used. This addresses an issue where the layer is stored as BF16 in checkpoints but missing from the exclusion list in the quantization configuration. I have no feedback to provide.
ZJY0516
approved these changes
Apr 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix
AssertionErrorwhen loadingnvidia/Qwen3.5-397B-A17B-NVFP4withmethod="mtp".The NVFP4 checkpoint stores the entire MTP branch in BF16, but
hf_quant_config.jsononly excludesmtp.layers.0*— missingmtp.fc. This causesColumnParallelLinearformtp.fcto be created with NVFP4 quantization (packed uint8, half input dim), which then crashes at weight loading when the BF16 checkpoint weight shape doesn't match.Fix: Force
quant_config=Noneformtp.fcwhen the quant ismodelopt_fp4.This is a temporary workaround until NVIDIA/Model-Optimizer#1124 is merged and the checkpoint is re-exported with the corrected
exclude_modules.Related:
Test
2x B200, TP=2:
Before:
AssertionErroratparameter.py:153during MTP weight loading.After: Server starts, inference works:
{"prompt": "What is 2+2?", "max_tokens": 32} -> "The sum of 2 and 2 is 4..."