Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking torch.compile compatibility with cpu offloading #10612

Open
1 task done
youkaichao opened this issue Nov 25, 2024 · 3 comments · May be fixed by #11622
Open
1 task done

tracking torch.compile compatibility with cpu offloading #10612

youkaichao opened this issue Nov 25, 2024 · 3 comments · May be fixed by #11622
Labels
bug Something isn't working

Comments

@youkaichao
Copy link
Member

Your current environment

N/A

Model Input Dumps

No response

🐛 Describe the bug

When we use cpu offloading together with torch.compile, it will error:

torch._dynamo.exc.Unsupported: builtin: setattr [<class 'torch._dynamo.variables.dicts.ConstDictVariable'>, <class 'torch._dynamo.variables.constant.ConstantVariable'>, <class 'torch._dynamo.variables.dicts.ConstDictVariable'>] False

The error is caused by this line:

for k, v in module.state_dict().items()

Creating a state dict during forward will error.

I tried another approach of using tensor subclasses in #10609 . It works well for unquantized models, but does not work for quantized models.

The problem with quantized models, is that we have some classes inherits torch.nn.Parameter, e.g.

class BasevLLMParameter(Parameter):

Using both tensor subclasses and parameter subclasses is a known problem in pytorch. See https://github.com/albanD/subclass_zoo/blob/main/custom_parameter.py for example.

To make torch.compile compatible with cpu offloading and quantization, we need to refactor the weight loading logic and how we create/store weights.

Take the GPTQ linear layer for example:

        qweight = PackedvLLMParameter(
            data=torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            ),
            input_dim=0,
            output_dim=1,
            packed_dim=0,
            packed_factor=self.quant_config.pack_factor,
            weight_loader=weight_loader)

We should avoid using nn.Parameter, and directly register the tensor as buffer:

self.qweight = torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            )
self.qweight.weight_loader = weight_loader
self.qweight_packed_factor = self.quant_config.pack_factor

The key ideas are:

  1. No nn.Parameter, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.
  2. Every tensor can have a weight_loader attribute. We can bind arguments needed for weight loading, e.g. self.qweight.weight_loader = partial(generic_weight_loader, args)
  3. For the information we need during runtime (i.e. computing the forward pass), we store them in the module object, e.g. self.qweight_packed_factor = self.quant_config.pack_factor.

With all these changes, we should be able to use cpu offloading with quantization and torch.compile .

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@youkaichao youkaichao added the bug Something isn't working label Nov 25, 2024
@youkaichao
Copy link
Member Author

cc @dsikka

@robertgshaw2-neuralmagic
Copy link
Collaborator

Hey @youkaichao - are there any specific quantization methods that are failing? We ran into this problem when originally refactoring the quantization parameters. Inside process_weights_after_loading, we actually save the loaded weights into a raw nn.Parameter - is this code not run during cpu offloading?

@youkaichao
Copy link
Member Author

the issue is quite complicated.

  1. the current cpu offloading method is not compatible with torch.compile . this is not relevant to quantization. any user using cpu offloading cannot use torch.compile .
  2. To solve the compatibility, I'm investigating the tensor subclasses approach in [core] improve cpu offloading implementation #10609 . It works for unquantized models, but does not work for quantization.
  3. To get the full compatibility of torch.compile + cpu offloading + quantization, we need to refactor the quantization weight loading logic.

we actually save the loaded weights into a raw nn.Parameter

Although we reset the loaded weights into a raw nn.Parameter later, it turns out tensor subclasses initialization is quite complicated. The moment we create PackedvLLMParameter, there are something we don't want happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants