-
Notifications
You must be signed in to change notification settings - Fork 55
Description
I have quantized GLM-4.5-Air model with autoround
and with GPTQModel
.
During quantization I have skipped shared_experts
layers in both configurations.
The GPTQModel
quant loads in VLLM.
The autoround
quant does not load into VLLM with errors.
The initial error:
(VllmWorker rank=3 pid=35201) INFO 08-26 12:47:02 [gptq_marlin.py:266] Using ExllamaLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/23 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] WorkerProc failed to start.
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] Traceback (most recent call last):
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/v1/executor/multiproc_executor.py", line 485, in worker_main
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/v1/executor/multiproc_executor.py", line 382, in __init__
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] self.worker.load_model()
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/v1/worker/gpu_worker.py", line 201, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] self.model = model_loader.load_model(
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] self.load_weights(model, model_config)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] loaded_weights = model.load_weights(
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/glm4_moe.py", line 673, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] return loader.load_weights(weights)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 291, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 249, in _load_module
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] yield from self._load_module(prefix,
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] loaded_params = module_load_weights(weights)
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] File "/home/ubuntu/git/avtc/vllm/vllm/model_executor/models/glm4_moe.py", line 565, in load_weights
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] param = params_dict[name]
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] ~~~~~~~~~~~^^^^^^
(VllmWorker rank=0 pid=35198) ERROR 08-26 12:47:03 [multiproc_executor.py:511] KeyError: 'layers.29.mlp.shared_experts.down_proj.weight'
-
I have compared the quants metadata and have found that
autoround
quant hasmodel.layers.1.mlp.gate.e_score_correction_bias
(and all other modules withe_score_correction_bias
) of typetorch.bfloat16
, while theGPTQModel
quant has typetorch.float32
. -
I have compared the configs, and tried to use GPTQModel quantize config for autoround quant. And it works.
In fact adding this property made it work:
"dynamic": {
"-:.*shared_experts": {},
"-:.*shared_head": {},
"-:lm_head.weight": {},
"-:model.embed_tokens.weight": {}
},
P.S. I have padded intermediate size to run with tp8. Also use --enable-expert-parallel --dtype=float16
for vllm. --enable-expert-parallel
can be omitted if moe intermediate size is also padded.