Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend HuggingFace Trainer test to multinode multiGPU #303

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

sutaakar
Copy link
Contributor

@sutaakar sutaakar commented Jan 13, 2025

Description

Extend HuggingFace Trainer tests to cover multinode and multiGPU scenarios.

Note: Prometheus GPU utilization check is currently limited to Nvidia only, will be extended to AMD once AMD Prometheus integration is fully ready in test environment.
Note: MultiGPU AMD tests are failing with some generic NCCL error. Will try to run tests again once we upgrade to ROCm 6.3.

How Has This Been Tested?

Manually by running the tests against OCP cluster with Nvidia and AMD GPUs.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a couple of comments / questions, otherwise looks already great!

tests/kfto/support.go Outdated Show resolved Hide resolved
tests/kfto/support.go Outdated Show resolved Hide resolved
tests/kfto/support.go Outdated Show resolved Hide resolved
trainer.train()
trainer.save_model()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the model being saved here? Are we sure the local weights trained one each worker shard are shared across workers so the saved model is trained on the global dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking how to check that....
any idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly looking at the code, it doesn't look like the model is wrapped to sync the weights globally. I'd check the logs to get more evidence. We may need to adapt the example e.g. with https://huggingface.co/docs/transformers/accelerate or with something equivalent to DDP. Otherwise that dilutes the value of this example in distributed mode.

Copy link
Contributor Author

@sutaakar sutaakar Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get it right then the Trainer should already use distributed training under the hood - https://huggingface.co/docs/transformers/main/main_classes/trainer#id1

The Trainer class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for NVIDIA GPUs, AMD GPUs, and torch.amp for PyTorch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, exactly what we wanted to see!

That's OK if that's DDP. The model is a bit small and it's good to have an example for DDP.

We can plan to work on an another example for FSDP with a larger model, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can plan to work on an another example for FSDP with a larger model, WDYT?

Sounds good.
Though thinking whether we should put FSDP (maybe even DDP) rather as an example into https://github.com/opendatahub-io/distributed-workloads/tree/main/examples.
As it doesn't cover the Training operator capabilities much as it rather covers Trainer usage itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's cover FSDP with an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe such FSDP example can be placed into Training operator upstream?

Copy link
Contributor

@astefanutti astefanutti Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could yes, though I'm also thinking about an example equivalent to the LLM fine-tuning example we have for Ray, that is, using the KFTO SDK from a workbench, and more state-of-the-art with LoRA / qLoRA, etc.

@astefanutti
Copy link
Contributor

Awesome work!

/lgtm

And(
HaveLen(numGpus),
ContainElement(
// Check that at lest some GPU was utilized on more than 50%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karel! PR looks awesome!
Just a nit pick : lest --> least

@abhijeet-dhumal
Copy link
Contributor

/lgtm

@abhijeet-dhumal
Copy link
Contributor

abhijeet-dhumal commented Jan 16, 2025

Thanks, I ran the tests and the execution works as expected 🙌
image

@abhijeet-dhumal
Copy link
Contributor

/approve

1 similar comment
@sutaakar
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Jan 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhijeet-dhumal, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 37a59e4 into opendatahub-io:main Jan 16, 2025
2 checks passed
@sutaakar sutaakar deleted the refactor-training branch January 16, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants