[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

mgoin · 2024-06-27T16:38:26Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Description:
When loading FP8 quantized models with merged linear modules (e.g., Phi-3 with merged qkv_proj and up_gate_proj), the scales for each shard are not handled correctly. This occurs because the vLLM FP8 config assumes separate scales for each shard, but merged layers have a single scale.

Steps to Reproduce:

Attempt to load an FP8 quantized Phi-3 model (e.g., https://huggingface.co/nm-testing/Phi-3-mini-128k-instruct-FP8)

Observe error due to shape mismatch:

param_data.shape=torch.Size([2]) loaded_weight.shape=torch.Size([])
param_data.shape=torch.Size([3]) loaded_weight.shape=torch.Size([])

Expected Behavior:
Scales should be correctly loaded for merged linear modules in FP8 checkpoints.

Proposed Fix:
Modify process_weights_after_loading in MergedColumnParallelLinear and QKVParallelLinear to repeat the merged scale during weight loading.

Temporary Workaround:
Apply the following patch in vllm/model_executor/layers/linear.py:

- assert param_data.shape == loaded_weight.shape
- param_data.copy_(loaded_weight)
+ temp = loaded_weight.repeat(param_data.shape)
+ assert param_data.shape == temp.shape
+ param_data.copy_(temp)

cc @robertgshaw2-neuralmagic @comaniac

The text was updated successfully, but these errors were encountered:

comaniac · 2024-06-27T16:47:19Z

I thought we handled this already? All the FP8 checkpoints have separated QKV scales and we merged them after weight loading. Is there anything special in Phi-3?

robertgshaw2-neuralmagic · 2024-06-27T16:48:41Z

Im working on a fix for this right now.

Issue is phi3 has fused qkv on disk so theres already only one scale!

mgoin added the bug Something isn't working label Jun 27, 2024

robertgshaw2-neuralmagic mentioned this issue Jun 27, 2024

[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 #5921

Merged

robertgshaw2-neuralmagic closed this as completed in #5921 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

mgoin commented Jun 27, 2024 •

edited

Loading

comaniac commented Jun 27, 2024

robertgshaw2-neuralmagic commented Jun 27, 2024

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

Comments

mgoin commented Jun 27, 2024 • edited Loading

Your current environment

🐛 Describe the bug

comaniac commented Jun 27, 2024

robertgshaw2-neuralmagic commented Jun 27, 2024

mgoin commented Jun 27, 2024 •

edited

Loading