Skip to content

GLM-4.5-Air quantized model failed to load in VLLM - with fix #769

@avtc

Description

@avtc

I have quantized GLM-4.5-Air model with autoround and with GPTQModel.
During quantization I have skipped shared_experts layers in both configurations.
The GPTQModel quant loads in VLLM.
The autoround quant does not load into VLLM with errors.

The initial error:

(VllmWorker rank=3 pid=35201) INFO 08-26 12:47:02 [gptq_marlin.py:266] Using ExllamaLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards:   0% Completed | 0/23 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] WorkerProc failed to start.
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] Traceback (most recent call last):
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/v1/executor/multiproc_executor.py", line 485, in worker_main
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/v1/executor/multiproc_executor.py", line 382, in __init__
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     self.worker.load_model()
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/v1/worker/gpu_worker.py", line 201, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     self.model = model_loader.load_model(
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     self.load_weights(model, model_config)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     loaded_weights = model.load_weights(
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/glm4_moe.py", line 673, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     return loader.load_weights(weights)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 291, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 249, in _load_module
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     yield from self._load_module(prefix,
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     loaded_params = module_load_weights(weights)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]   File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/glm4_moe.py", line 565, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]     param = params_dict[name]
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511]             ~~~~~~~~~~~^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] KeyError: 'layers.29.mlp.shared_experts.down_proj.weight'
  1. I have compared the quants metadata and have found that autoround quant has model.layers.1.mlp.gate.e_score_correction_bias (and all other modules with e_score_correction_bias) of type torch.bfloat16, while the GPTQModel quant has type torch.float32.

  2. I have compared the configs, and tried to use GPTQModel quantize config for autoround quant. And it works.
    In fact adding this property made it work:

    "dynamic": {
      "-:.*shared_experts": {},
      "-:.*shared_head": {},
      "-:lm_head.weight": {},
      "-:model.embed_tokens.weight": {}
    },

P.S. I have padded intermediate size to run with tp8. Also use --enable-expert-parallel --dtype=float16 for vllm. --enable-expert-parallel can be omitted if moe intermediate size is also padded.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions