Enable multi-device model support #796

evgri243 · 2025-10-23T17:24:30Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Docs change / refactoring / dependency upgrade

Motivation and Context / Related issue

This PR adds support for multi-device training scenarios where model parameters are distributed across multiple GPU devices (e.g., when assigning different layers directly with module.to(device[I]) oe using device_map="auto" with accelerate).

Problem solved:
When training large models that don't fit on a single GPU, parameters and gradients can be spread across multiple devices. The existing Opacus optimizers and gradient clipping modules assumed all tensors were on the same device, causing runtime errors during norm computation and gradient clipping operations.

Changes:

Multi-device support in optimizers: Modified DPOptimizer and AdaClipDPOptimizer to move tensors to appropriate devices before operations like torch.stack() and torch.einsum(), preventing device mismatch errors during gradient clipping and accumulation.
Multi-device support in GradSampleModuleFastGradientClipping: Extended multi-device handling to GradSampleModuleFastGradientClipping, DPPerLayerOptimizer, and additional edge cases in optimizers that were previously uncovered.

How Has This Been Tested

The code was used to train 7B Zetta model with LoRA on 8xH200 GPU node.
Added test suite in multidevice_optimizer_test.py covering:
- DPOptimizer, AdaClipDPOptimizer, and DPPerLayerOptimizer with multi-device models
- Both clip_and_accumulate() and full step() operations
- Helper function _clip_and_accumulate_parameter() with multi-device parameters
Added additional tests in grad_sample_module_fast_gradient_clipping_test.py for:
- get_norm_sample() with parameters on different devices
- get_clipping_coef() with parameters on different devices
All tests require at least 2 GPUs and verify that operations complete without device mismatch errors while maintaining correctness

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

…overed optimizers (#10)

meta-codesync · 2025-10-23T17:28:56Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D85355821. (Because this pull request was imported automatically, there will not be any future comments.)

iden-kalemaj · 2025-10-23T21:05:52Z

Hi there, I'd like to understand better your use case. Did you try using FSDP (Fully-sharded data parallel) to train the 7B model? https://pytorch.org/blog/enabling-fully-sharded-data-parallel-fsdp2-in-opacus/. We've used it to train similar-sized models as yours with similar GPU resources.

If you did try it, was there any issues/gaps? If not, can FSDP fill the same need for your use case?

evgri243 · 2025-10-24T20:52:18Z

We are slowly getting there.

The product (we are Research) is using the TRL library which is heavily built around Transformer's trainer. So far, we struggle to match FastGradientClippingTensor with custom loss used by DPO and KTO Trainers, although we are working on that.

Meanwhile, at least that: it very inefficiently allows to use

model = AutoModel.from_pretrained(model_name, device_map="auto")

trainer = KTOTrainer(model, ...)
trainer.train(input_data, ...)

It is inefficient to the scale that only 1 GPU is used at every time out of 8, but it at least manages to run KTO with the minimal recommended batch size of 16. Soon we hope to come up with a better solution. But meanwhile, this execution mode is quite simple to support and it doesn't affect the code much.

evgri243 · 2025-10-24T21:25:46Z

@iden-kalemaj considering my other PR. As we are exploring wrap-less methods, assuming there are more PyTorch seasoned experts in the meta-pytorch, is there any feasible way to implement ghost clipping pytorch-way but without loss wrapping? as loss wrapping cause another set of problems :( The method itself and the idea is amazing btw!

As we were quite successful with the GradSampleModule, this one is harder to implement alternatively. Is there some other workable options like functions with backward, modules, hooks, PyTorch feature suggestion to upvote that would allow double backward pass without major side effects on the auto-grad? We are up to attempt implementing, we just need an idea.

Just look at KTO for example: https://github.com/huggingface/trl/blob/05a1feb05010b4c321abce481bc43de6ec366d48/trl/trainer/kto_trainer.py#L1119 Any idea how to make them work together is welcome.

coveralls · 2025-10-25T00:10:01Z

Pull Request Test Coverage Report for Build 18756613059

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

40 of 252 (15.87%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-2.2%) to 78.167%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
opacus/optimizers/adaclipoptimizer.py	0	5	0.0%
opacus/tests/grad_sample_module_fast_gradient_clipping_test.py	4	60	6.67%
opacus/tests/multidevice_optimizer_test.py	26	177	14.69%

Totals
Change from base Build 18769308800:	-2.2%
Covered Lines:	5671
Relevant Lines:	7255

💛 - Coveralls

iden-kalemaj · 2025-10-25T01:31:27Z

Thank you for explaining the use case. This makes sense to me, although probably it will have limited use cases given that it is not data parallel. I will approve this change.

Re your second question on ghost clipping can you open an issue so that we can discuss there?

iden-kalemaj

Review automatically exported from Phabricator review in Meta.

meta-codesync · 2025-10-27T15:43:14Z

This pull request has been merged in 7dbbb40.

evgri243 added 2 commits October 23, 2025 17:15

Add sequential multi-device execution support (#9)

64961a3

Multi-device models in GradSampleModuleFastGradientClipping and unc…

499ffaa

…overed optimizers (#10)

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 23, 2025

iden-kalemaj self-assigned this Oct 23, 2025

evgri243 changed the title ~~Evgri243/multi device models~~ Enable multi-device model support Oct 24, 2025

iden-kalemaj approved these changes Oct 25, 2025

View reviewed changes

meta-codesync bot closed this in 7dbbb40 Oct 27, 2025

facebook-github-bot added the Merged label Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable multi-device model support #796

Enable multi-device model support #796

Uh oh!

evgri243 commented Oct 23, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

iden-kalemaj commented Oct 23, 2025

Uh oh!

evgri243 commented Oct 24, 2025 •

edited

Loading

Uh oh!

evgri243 commented Oct 24, 2025 •

edited

Loading

Uh oh!

coveralls commented Oct 25, 2025

Uh oh!

iden-kalemaj commented Oct 25, 2025

Uh oh!

iden-kalemaj left a comment

Uh oh!

meta-codesync bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable multi-device model support #796

Enable multi-device model support #796

Uh oh!

Conversation

evgri243 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Types of changes

Motivation and Context / Related issue

How Has This Been Tested

Checklist

Uh oh!

meta-codesync bot commented Oct 23, 2025

Uh oh!

iden-kalemaj commented Oct 23, 2025

Uh oh!

evgri243 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

evgri243 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Oct 25, 2025

Pull Request Test Coverage Report for Build 18756613059

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

iden-kalemaj commented Oct 25, 2025

Uh oh!

iden-kalemaj left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

evgri243 commented Oct 23, 2025 •

edited

Loading

evgri243 commented Oct 24, 2025 •

edited

Loading

evgri243 commented Oct 24, 2025 •

edited

Loading