You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No nn.Parameter, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.
Every tensor can have a weight_loader attribute. We can bind arguments needed for weight loading, e.g. self.qweight.weight_loader = partial(generic_weight_loader, args)
For the information we need during runtime (i.e. computing the forward pass), we store them in the module object, e.g. self.qweight_packed_factor = self.quant_config.pack_factor.
With all these changes, we should be able to use cpu offloading with quantization and torch.compile .
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Hey @youkaichao - are there any specific quantization methods that are failing? We ran into this problem when originally refactoring the quantization parameters. Inside process_weights_after_loading, we actually save the loaded weights into a raw nn.Parameter - is this code not run during cpu offloading?
the current cpu offloading method is not compatible with torch.compile . this is not relevant to quantization. any user using cpu offloading cannot use torch.compile .
To get the full compatibility of torch.compile + cpu offloading + quantization, we need to refactor the quantization weight loading logic.
we actually save the loaded weights into a raw nn.Parameter
Although we reset the loaded weights into a raw nn.Parameter later, it turns out tensor subclasses initialization is quite complicated. The moment we create PackedvLLMParameter, there are something we don't want happening.
Your current environment
N/A
Model Input Dumps
No response
🐛 Describe the bug
When we use cpu offloading together with
torch.compile
, it will error:The error is caused by this line:
vllm/vllm/model_executor/models/utils.py
Line 482 in 49628fe
Creating a state dict during forward will error.
I tried another approach of using tensor subclasses in #10609 . It works well for unquantized models, but does not work for quantized models.
The problem with quantized models, is that we have some classes inherits
torch.nn.Parameter
, e.g.vllm/vllm/model_executor/parameter.py
Line 19 in 49628fe
Using both tensor subclasses and parameter subclasses is a known problem in pytorch. See https://github.com/albanD/subclass_zoo/blob/main/custom_parameter.py for example.
To make
torch.compile
compatible with cpu offloading and quantization, we need to refactor the weight loading logic and how we create/store weights.Take the GPTQ linear layer for example:
We should avoid using
nn.Parameter
, and directly register the tensor as buffer:The key ideas are:
nn.Parameter
, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.weight_loader
attribute. We can bind arguments needed for weight loading, e.g.self.qweight.weight_loader = partial(generic_weight_loader, args)
self.qweight_packed_factor = self.quant_config.pack_factor
.With all these changes, we should be able to use cpu offloading with quantization and
torch.compile
.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: