-
Notifications
You must be signed in to change notification settings - Fork 386
Enable multi-device model support #796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multi-device model support #796
Conversation
|
@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this in D85355821. (Because this pull request was imported automatically, there will not be any future comments.) |
|
Hi there, I'd like to understand better your use case. Did you try using FSDP (Fully-sharded data parallel) to train the 7B model? https://pytorch.org/blog/enabling-fully-sharded-data-parallel-fsdp2-in-opacus/. We've used it to train similar-sized models as yours with similar GPU resources. If you did try it, was there any issues/gaps? If not, can FSDP fill the same need for your use case? |
|
We are slowly getting there. The product (we are Research) is using the TRL library which is heavily built around Transformer's trainer. So far, we struggle to match FastGradientClippingTensor with custom loss used by DPO and KTO Trainers, although we are working on that. Meanwhile, at least that: it very inefficiently allows to use model = AutoModel.from_pretrained(model_name, device_map="auto")
trainer = KTOTrainer(model, ...)
trainer.train(input_data, ...)It is inefficient to the scale that only 1 GPU is used at every time out of 8, but it at least manages to run KTO with the minimal recommended batch size of 16. Soon we hope to come up with a better solution. But meanwhile, this execution mode is quite simple to support and it doesn't affect the code much. |
|
@iden-kalemaj considering my other PR. As we are exploring wrap-less methods, assuming there are more PyTorch seasoned experts in the As we were quite successful with the GradSampleModule, this one is harder to implement alternatively. Is there some other workable options like functions with backward, modules, hooks, PyTorch feature suggestion to upvote that would allow double backward pass without major side effects on the auto-grad? We are up to attempt implementing, we just need an idea. Just look at KTO for example: https://github.com/huggingface/trl/blob/05a1feb05010b4c321abce481bc43de6ec366d48/trl/trainer/kto_trainer.py#L1119 Any idea how to make them work together is welcome. |
Pull Request Test Coverage Report for Build 18756613059Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
Thank you for explaining the use case. This makes sense to me, although probably it will have limited use cases given that it is not data parallel. I will approve this change. Re your second question on ghost clipping can you open an issue so that we can discuss there? |
iden-kalemaj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review automatically exported from Phabricator review in Meta.
|
This pull request has been merged in 7dbbb40. |
Types of changes
Motivation and Context / Related issue
This PR adds support for multi-device training scenarios where model parameters are distributed across multiple GPU devices (e.g., when assigning different layers directly with
module.to(device[I])oe usingdevice_map="auto"with accelerate).Problem solved:
When training large models that don't fit on a single GPU, parameters and gradients can be spread across multiple devices. The existing Opacus optimizers and gradient clipping modules assumed all tensors were on the same device, causing runtime errors during norm computation and gradient clipping operations.
Changes:
Multi-device support in optimizers: Modified
DPOptimizerandAdaClipDPOptimizerto move tensors to appropriate devices before operations liketorch.stack()andtorch.einsum(), preventing device mismatch errors during gradient clipping and accumulation.Multi-device support in GradSampleModuleFastGradientClipping: Extended multi-device handling to
GradSampleModuleFastGradientClipping,DPPerLayerOptimizer, and additional edge cases in optimizers that were previously uncovered.How Has This Been Tested
multidevice_optimizer_test.pycovering:DPOptimizer,AdaClipDPOptimizer, andDPPerLayerOptimizerwith multi-device modelsclip_and_accumulate()and fullstep()operations_clip_and_accumulate_parameter()with multi-device parametersgrad_sample_module_fast_gradient_clipping_test.pyfor:get_norm_sample()with parameters on different devicesget_clipping_coef()with parameters on different devicesChecklist