Skip to content

[Bug]: gemma 4: RuntimeError: The size of tensor a (512) must match the size of tensor b (256) at non-singleton dimension 3 #1651

@XuehaoSun

Description

@XuehaoSun

Problem Description

2026-04-03 09:25:38 INFO __main__.py L599: start to quantize google/gemma-4-26B-A4B
2026-04-03 09:25:39 INFO autoround.py L178: using MLLM mode for multimodal model.
model.safetensors.index.json: 103kB [00:00, 7.08MB/s]
Fetching 2 files: 100%|██████████| 2/2 [09:06<00:00, 273.07s/it]
Download complete: 100%|██████████| 51.6G/51.6G [09:06<00:00, 94.5MB/s]00:17, 61.7MB/s]
Loading weights: 100%|██████████| 1013/1013 [00:00<00:00, 4362.49it/s]]
generation_config.json: 100%|██████████| 181/181 [00:00<00:00, 729kB/s]
processor_config.json: 1.69kB [00:00, 426kB/s]181 [00:00<?, ?B/s]
2026-04-03 09:35:42 INFO base.py L517: using torch.bfloat16 for quantization tuning
2026-04-03 09:35:42 INFO base.py L834: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-04-03 09:35:42 WARNING formats.py L166: some layers are skipped quantization (shape not divisible by 32): model.vision_tower.encoder.layers.[0-26].mlp.down_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.gate_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.up_proj.linear
2026-04-03 09:35:42 INFO replace_modules.py L107: Experts (before replacement) [model.language_model.layers.0.experts] (Gemma4TextExperts):
Gemma4TextExperts(
  (act_fn): GELUTanh()
)
2026-04-03 09:35:42 WARNING modeling_utils.py L4432: `loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.
2026-04-03 09:35:42 INFO device.py L1690: Before applying custom replacements 'peak_ram': 3.72GB, 'peak_vram': 48.32GB
2026-04-03 09:35:53 INFO moe_experts_interface.py L642: [MoE Prep] Unfused 30 MOE experts modules
2026-04-03 09:35:53 INFO device.py L1690: After applying custom replacements 'peak_ram': 46.27GB, 'peak_vram': 48.32GB
2026-04-03 09:35:53 INFO replace_modules.py L80: Prepared 30 MOE modules for quantization
2026-04-03 09:35:53 INFO replace_modules.py L107: Experts (after replacement) [model.language_model.layers.0.experts] (Gemma4TextExperts):
Gemma4TextExperts(
  (act_fn): GELUTanh()
  (0-127): 128 x _ExpertContainer(
    (down_proj): Linear(in_features=704, out_features=2816, bias=False)
    (gate_proj): Linear(in_features=2816, out_features=704, bias=False)
    (up_proj): Linear(in_features=2816, out_features=704, bias=False)
  )
)
2026-04-03 09:35:55 INFO base.py L1818: start to cache block inputs
2026-04-03 09:35:55 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
README.md: 100%|██████████| 373/373 [00:00<00:00, 787kB/s]
dataset_infos.json: 100%|██████████| 921/921 [00:00<00:00, 2.65MB/s]
Map: 100%|██████████| 10000/10000 [00:38<00:00, 260.69 examples/s]
Filter: 100%|██████████| 10000/10000 [00:07<00:00, 1371.86 examples/s]
Casting the dataset: 100%|██████████| 1243/1243 [00:05<00:00, 223.56 examples/s]
cache block inputs: 100%|██████████| 128/128 [00:00<00:00, 258.52it/s]
2026-04-03 09:37:08 INFO base.py L1835: caching done ?it/s]
/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/autograd/graph.py:865: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:897.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
quantized 392/392 layers in the block, loss iter 0: 0.009922 -> iter 197: 0.002370,'peak_ram': 53.63GB, 'peak_vram': 48.32GB
quantized 392/392 layers in the block, loss iter 0: 0.013414 -> iter 181: 0.004113,'peak_ram': 53.63GB, 'peak_vram': 48.32GBrs.1:   3%|▎         | 1/30 [04:54<2:22:32, 294.92s/it]
quantized 392/392 layers in the block, loss iter 0: 0.028987 -> iter 167: 0.008328,'peak_ram': 53.63GB, 'peak_vram': 48.32GBrs.2:   7%|▋         | 2/30 [11:06<2:38:46, 340.22s/it]
quantized 392/392 layers in the block, loss iter 0: 0.022758 -> iter 144: 0.006956,'peak_ram': 53.63GB, 'peak_vram': 48.32GBrs.3:  10%|█         | 3/30 [17:33<2:42:35, 361.32s/it]
quantized 392/392 layers in the block, loss iter 0: 0.018828 -> iter 152: 0.006226,'peak_ram': 53.63GB, 'peak_vram': 48.32GBrs.4:  13%|█▎        | 4/30 [24:19<2:44:16, 379.11s/it]
Traceback (most recent call last):
  File "/home/uttest/miniforge3/envs/autoround_test/bin/auto-round", line 10, in <module>t]
    sys.exit(run())
             ^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 822, in run
    start()
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 541, in start
    tune(args)
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 761, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1018, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1867, in quantize
    self._quantize_blocks(
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 3303, in _quantize_blocks
    q_input, input_ids = self._quantize_block(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 3011, in _quantize_block
    output = self._get_block_outputs(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 2044, in _get_block_outputs
    tmp_output = self.block_forward(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/utils.py", line 149, in block_forward
    output = block(input_ids, *input_tuple, **input_others)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/transformers/modeling_layers.py", line 93, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/transformers/models/gemma4/modeling_gemma4.py", line 1361, in forward
    hidden_states, _ = self.self_attn(
                       ^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/transformers/models/gemma4/modeling_gemma4.py", line 1194, in forward
    query_states = apply_rotary_pos_emb(query_states, cos, sin, unsqueeze_dim=2)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/transformers/models/gemma4/modeling_gemma4.py", line 753, in apply_rotary_pos_emb
    return (x * cos) + (rotate_half(x) * sin)
            ~~^~~~~
RuntimeError: The size of tensor a (512) must match the size of tensor b (256) at non-singleton dimension 3

Reproduction Steps

auto-round --model_name google/gemma-4-26B-A4B --iters 200 --bits 4

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions