Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Add PyTorch DDP MNIST training example #2387

Merged
merged 1 commit into from
Feb 1, 2025

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

This PR adds an example that demonstrates how to train MNIST with PyTorch DDP using the training operator and SDK v2.

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jan 14, 2025

Pull Request Test Coverage Report for Build 13055130505

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on pr-10 at 100.0%

Totals Coverage Status
Change from base Build 13016586638: 100.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

@astefanutti astefanutti force-pushed the pr-10 branch 3 times, most recently from c953498 to dced478 Compare January 14, 2025 13:57
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this example @astefanutti. We will use it as getting started example!
However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

So Data Scientists and ML Engineers can quickly take them and execute locally or inside the Kubeflow Notebooks.

Additionally, we are planning to build the testing infra using Papermill to make sure these Notebooks are runnable.

cc @kubeflow/wg-training-leads @Electronic-Waste @akshaychitneni @shravan-achar

@andreyvelich
Copy link
Member

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

@astefanutti astefanutti force-pushed the pr-10 branch 2 times, most recently from 6b542e9 to 5f45584 Compare January 14, 2025 14:11
@astefanutti
Copy link
Contributor Author

astefanutti commented Jan 14, 2025

However, we discussed before that we want to keep all of our examples as Jupyter Notebooks: #2213.
At least initially, before we see a need to have other examples.

@andreyvelich awesome, sorry if I missed that.

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

@astefanutti
Copy link
Contributor Author

Also, we should keep this example super easy with training function as small as possible, similar to this getting started example: https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob
So it will be easier to understand.

I can see it's nice to have an example as small as possible. I can certainly remove the "evaluation" part to make it shorter.

That being said, I'd be inclined to assume evaluation is a critical part of the training for any Data Scientist or ML Engineer, so the value is high and it does not add much complexity nor foreign concepts that the train section already has.

@andreyvelich
Copy link
Member

Do I understand it correctly you initially want the examples to be created under /test/e2e/notebooks?

We can still use the /examples folder for them, maybe we can use the /test/e2e/notebooks folder for additional test suites, if we need them.
For example, we can keep the script to run Notebooks in the e2e/notebooks folder: https://github.com/kubeflow/training-operator/blob/master/scripts/run-notebook.sh

@astefanutti
Copy link
Contributor Author

Thanks, that makes all sense. Keeping examples in the examples directory makes them easier to find, obviously :)

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.

Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

@andreyvelich
Copy link
Member

I can turn this into a Jupyter notebook. I don't see one for the v1 MNIST example in any case.
Would that be useful if I proceed with that, or you'd rather have that done as part of the e2e testing?

Sure, go ahead! We can create the E2Es once you have Notebook ready.

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

@astefanutti
Copy link
Contributor Author

astefanutti commented Jan 14, 2025

Do you want to take the FashionMNIST example as reference: https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/image-classification/Train-CNN-with-FashionMNIST.ipynb ?
I think, FashionMNIST might be more representative than MNIST (it is a first example that PyTorch also shows in their tutorials: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html?highlight=nn%20crossentropyloss)

Actually I hesitated when I started :)

I agree with you. Let's move it to use FashionMNIST.

@andreyvelich
Copy link
Member

Hi @astefanutti, did you get a chance to transfer your example into Jupyter Notebook so we can use it as Getting Started example ?

@astefanutti
Copy link
Contributor Author

@andreyvelich yes I'm on it, I should be able to push it quickly.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@astefanutti astefanutti force-pushed the pr-10 branch 2 times, most recently from 77fbfdc to 95788ea Compare January 20, 2025 17:47
@astefanutti
Copy link
Contributor Author

@andreyvelich PTAL

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @astefanutti!
I left a few comments.

examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.py Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
@astefanutti
Copy link
Contributor Author

@andreyvelich thanks for the review, I'm on it!

@astefanutti
Copy link
Contributor Author

@andreyvelich it's updated according to your comments. PTAL.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these updates @astefanutti!
I left a few more comments.
cc @Electronic-Waste @kubeflow/wg-training-leads @seanlaii @deepanker13 @saileshd1402

examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
examples/pytorch/mnist-ddp/mnist.ipynb Outdated Show resolved Hide resolved
@astefanutti astefanutti force-pushed the pr-10 branch 4 times, most recently from cd0b98c to 531f78a Compare January 30, 2025 14:19
@astefanutti
Copy link
Contributor Author

@andreyvelich I think that's ready for another review cycle.

There is one open point about nproc_per_node that may be best addressed separately.

@andreyvelich
Copy link
Member

There is one open point about nproc_per_node that may be best addressed separately.

Sure, please can you create an issue for it, so we can track it for CPU-based TrainJobs @astefanutti ?

@andreyvelich
Copy link
Member

/rerun-all

@andreyvelich
Copy link
Member

Thank you for this great contribution @astefanutti!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@astefanutti
Copy link
Contributor Author

There is one open point about nproc_per_node that may be best addressed separately.

Sure, please can you create an issue for it, so we can track it for CPU-based TrainJobs @astefanutti ?

@andreyvelich thanks, created #2407.

@google-oss-prow google-oss-prow bot merged commit ee11629 into kubeflow:master Feb 1, 2025
56 checks passed
@astefanutti astefanutti deleted the pr-10 branch February 1, 2025 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants