Skip to content

Conversation

@evgri243
Copy link
Contributor

@evgri243 evgri243 commented Oct 23, 2025

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

This PR adds support for multi-device training scenarios where model parameters are distributed across multiple GPU devices (e.g., when assigning different layers directly with module.to(device[I]) oe using device_map="auto" with accelerate).

Problem solved:
When training large models that don't fit on a single GPU, parameters and gradients can be spread across multiple devices. The existing Opacus optimizers and gradient clipping modules assumed all tensors were on the same device, causing runtime errors during norm computation and gradient clipping operations.

Changes:

  1. Multi-device support in optimizers: Modified DPOptimizer and AdaClipDPOptimizer to move tensors to appropriate devices before operations like torch.stack() and torch.einsum(), preventing device mismatch errors during gradient clipping and accumulation.

  2. Multi-device support in GradSampleModuleFastGradientClipping: Extended multi-device handling to GradSampleModuleFastGradientClipping, DPPerLayerOptimizer, and additional edge cases in optimizers that were previously uncovered.

How Has This Been Tested

  • The code was used to train 7B Zetta model with LoRA on 8xH200 GPU node.
  • Added test suite in multidevice_optimizer_test.py covering:
    • DPOptimizer, AdaClipDPOptimizer, and DPPerLayerOptimizer with multi-device models
    • Both clip_and_accumulate() and full step() operations
    • Helper function _clip_and_accumulate_parameter() with multi-device parameters
  • Added additional tests in grad_sample_module_fast_gradient_clipping_test.py for:
    • get_norm_sample() with parameters on different devices
    • get_clipping_coef() with parameters on different devices
  • All tests require at least 2 GPUs and verify that operations complete without device mismatch errors while maintaining correctness

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 23, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 23, 2025

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D85355821. (Because this pull request was imported automatically, there will not be any future comments.)

@iden-kalemaj
Copy link
Contributor

Hi there, I'd like to understand better your use case. Did you try using FSDP (Fully-sharded data parallel) to train the 7B model? https://pytorch.org/blog/enabling-fully-sharded-data-parallel-fsdp2-in-opacus/. We've used it to train similar-sized models as yours with similar GPU resources.

If you did try it, was there any issues/gaps? If not, can FSDP fill the same need for your use case?

@iden-kalemaj iden-kalemaj self-assigned this Oct 23, 2025
@evgri243 evgri243 changed the title Evgri243/multi device models Enable multi-device model support Oct 24, 2025
@evgri243
Copy link
Contributor Author

evgri243 commented Oct 24, 2025

We are slowly getting there.

The product (we are Research) is using the TRL library which is heavily built around Transformer's trainer. So far, we struggle to match FastGradientClippingTensor with custom loss used by DPO and KTO Trainers, although we are working on that.

Meanwhile, at least that: it very inefficiently allows to use

model = AutoModel.from_pretrained(model_name, device_map="auto")

trainer = KTOTrainer(model, ...)
trainer.train(input_data, ...)

It is inefficient to the scale that only 1 GPU is used at every time out of 8, but it at least manages to run KTO with the minimal recommended batch size of 16. Soon we hope to come up with a better solution. But meanwhile, this execution mode is quite simple to support and it doesn't affect the code much.

@evgri243
Copy link
Contributor Author

evgri243 commented Oct 24, 2025

@iden-kalemaj considering my other PR. As we are exploring wrap-less methods, assuming there are more PyTorch seasoned experts in the meta-pytorch, is there any feasible way to implement ghost clipping pytorch-way but without loss wrapping? as loss wrapping cause another set of problems :( The method itself and the idea is amazing btw!

As we were quite successful with the GradSampleModule, this one is harder to implement alternatively. Is there some other workable options like functions with backward, modules, hooks, PyTorch feature suggestion to upvote that would allow double backward pass without major side effects on the auto-grad? We are up to attempt implementing, we just need an idea.

Just look at KTO for example: https://github.com/huggingface/trl/blob/05a1feb05010b4c321abce481bc43de6ec366d48/trl/trainer/kto_trainer.py#L1119 Any idea how to make them work together is welcome.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18756613059

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 40 of 252 (15.87%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-2.2%) to 78.167%

Changes Missing Coverage Covered Lines Changed/Added Lines %
opacus/optimizers/adaclipoptimizer.py 0 5 0.0%
opacus/tests/grad_sample_module_fast_gradient_clipping_test.py 4 60 6.67%
opacus/tests/multidevice_optimizer_test.py 26 177 14.69%
Totals Coverage Status
Change from base Build 18769308800: -2.2%
Covered Lines: 5671
Relevant Lines: 7255

💛 - Coveralls

@iden-kalemaj
Copy link
Contributor

Thank you for explaining the use case. This makes sense to me, although probably it will have limited use cases given that it is not data parallel. I will approve this change.

Re your second question on ghost clipping can you open an issue so that we can discuss there?

Copy link
Contributor

@iden-kalemaj iden-kalemaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@meta-codesync
Copy link

meta-codesync bot commented Oct 27, 2025

This pull request has been merged in 7dbbb40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants