You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
When loading FP8 quantized models with merged linear modules (e.g., Phi-3 with merged qkv_proj and up_gate_proj), the scales for each shard are not handled correctly. This occurs because the vLLM FP8 config assumes separate scales for each shard, but merged layers have a single scale.
Expected Behavior:
Scales should be correctly loaded for merged linear modules in FP8 checkpoints.
Proposed Fix:
Modify process_weights_after_loading in MergedColumnParallelLinear and QKVParallelLinear to repeat the merged scale during weight loading.
Temporary Workaround:
Apply the following patch in vllm/model_executor/layers/linear.py:
I thought we handled this already? All the FP8 checkpoints have separated QKV scales and we merged them after weight loading. Is there anything special in Phi-3?
Your current environment
馃悰 Describe the bug
Description:
When loading FP8 quantized models with merged linear modules (e.g., Phi-3 with merged qkv_proj and up_gate_proj), the scales for each shard are not handled correctly. This occurs because the vLLM FP8 config assumes separate scales for each shard, but merged layers have a single scale.
Steps to Reproduce:
Expected Behavior:
Scales should be correctly loaded for merged linear modules in FP8 checkpoints.
Proposed Fix:
Modify
process_weights_after_loading
in MergedColumnParallelLinear and QKVParallelLinear to repeat the merged scale during weight loading.Temporary Workaround:
Apply the following patch in
vllm/model_executor/layers/linear.py
:cc @robertgshaw2-neuralmagic @comaniac
The text was updated successfully, but these errors were encountered: