tracking torch.compile compatibility with cpu offloading #10612

youkaichao · 2024-11-25T02:50:12Z

Your current environment

N/A

Model Input Dumps

No response

🐛 Describe the bug

When we use cpu offloading together with torch.compile, it will error:

torch._dynamo.exc.Unsupported: builtin: setattr [<class 'torch._dynamo.variables.dicts.ConstDictVariable'>, <class 'torch._dynamo.variables.constant.ConstantVariable'>, <class 'torch._dynamo.variables.dicts.ConstDictVariable'>] False

The error is caused by this line:

vllm/vllm/model_executor/models/utils.py

Line 482 in 49628fe

for k, v in module.state_dict().items()

Creating a state dict during forward will error.

I tried another approach of using tensor subclasses in #10609 . It works well for unquantized models, but does not work for quantized models.

The problem with quantized models, is that we have some classes inherits torch.nn.Parameter, e.g.

vllm/vllm/model_executor/parameter.py

Line 19 in 49628fe

class BasevLLMParameter(Parameter):

Using both tensor subclasses and parameter subclasses is a known problem in pytorch. See https://github.com/albanD/subclass_zoo/blob/main/custom_parameter.py for example.

To make torch.compile compatible with cpu offloading and quantization, we need to refactor the weight loading logic and how we create/store weights.

Take the GPTQ linear layer for example:

        qweight = PackedvLLMParameter(
            data=torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            ),
            input_dim=0,
            output_dim=1,
            packed_dim=0,
            packed_factor=self.quant_config.pack_factor,
            weight_loader=weight_loader)

We should avoid using nn.Parameter, and directly register the tensor as buffer:

self.qweight = torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            )
self.qweight.weight_loader = weight_loader
self.qweight_packed_factor = self.quant_config.pack_factor

The key ideas are:

No nn.Parameter, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.
Every tensor can have a weight_loader attribute. We can bind arguments needed for weight loading, e.g. self.qweight.weight_loader = partial(generic_weight_loader, args)
For the information we need during runtime (i.e. computing the forward pass), we store them in the module object, e.g. self.qweight_packed_factor = self.quant_config.pack_factor.

With all these changes, we should be able to use cpu offloading with quantization and torch.compile .

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-11-25T05:37:54Z

cc @dsikka

robertgshaw2-neuralmagic · 2024-11-25T11:34:46Z

Hey @youkaichao - are there any specific quantization methods that are failing? We ran into this problem when originally refactoring the quantization parameters. Inside process_weights_after_loading, we actually save the loaded weights into a raw nn.Parameter - is this code not run during cpu offloading?

youkaichao · 2024-11-25T18:14:31Z

the issue is quite complicated.

the current cpu offloading method is not compatible with torch.compile . this is not relevant to quantization. any user using cpu offloading cannot use torch.compile .
To solve the compatibility, I'm investigating the tensor subclasses approach in [core] improve cpu offloading implementation #10609 . It works for unquantized models, but does not work for quantization.
To get the full compatibility of torch.compile + cpu offloading + quantization, we need to refactor the quantization weight loading logic.

we actually save the loaded weights into a raw nn.Parameter

Although we reset the loaded weights into a raw nn.Parameter later, it turns out tensor subclasses initialization is quite complicated. The moment we create PackedvLLMParameter, there are something we don't want happening.

youkaichao added the bug Something isn't working label Nov 25, 2024

youkaichao mentioned this issue Nov 25, 2024

[misc] add torch.compile compatibility check #10618

Merged

cennn linked a pull request Dec 30, 2024 that will close this issue

[Quantization/Parameter] WIP: Replace parameter subclasses with raw nn.Parameter with additional attributes #11622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking torch.compile compatibility with cpu offloading #10612

tracking torch.compile compatibility with cpu offloading #10612

youkaichao commented Nov 25, 2024

youkaichao commented Nov 25, 2024

robertgshaw2-neuralmagic commented Nov 25, 2024

youkaichao commented Nov 25, 2024

tracking torch.compile compatibility with cpu offloading #10612

tracking torch.compile compatibility with cpu offloading #10612

Comments

youkaichao commented Nov 25, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

youkaichao commented Nov 25, 2024

robertgshaw2-neuralmagic commented Nov 25, 2024

youkaichao commented Nov 25, 2024