-
Notifications
You must be signed in to change notification settings - Fork 88
Cuda update policy and guide #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 5 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,197 @@ | ||||||
|
|
||||||
| # [CUDA version support] | ||||||
|
|
||||||
| **Authors:** | ||||||
| * @atalman @malfet @tinglvv @nWEIdia @ptrblck | ||||||
|
|
||||||
|
|
||||||
| ## **Motivation** | ||||||
|
|
||||||
| The proposal is to provide two main benefits | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit curious why we didn't go for "prodictability" as a motivation here? (note that I'm not arguing we should do this, just want to hear the reasoning on your end) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe @ptrblck can also chime in for this. I believe for CUDA version updates we would like to first evaluate if update is necessary and there are no major regressions in new version of CUDA, this is why we have a delay and evaluation step. For example we added support for CUDA 12.4, 12.6 and 12.8, skipping 12.5 and 12.7 versions. |
||||||
|
|
||||||
| - Aligned decision policy provides transparent decision making; | ||||||
| - Incorporate decision points in the early release process to stagger CUDA migration from the RC release integration work. | ||||||
|
|
||||||
| ## **Proposed Policy and Process** | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is an existing policy, in the past year when has this resulted in us deciding to not do a cuda upgrade? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is a history of decisions we took for past year #74 (comment) We supported CUDA: 11.8, 12.1, 12.4, 12.6 and 12.8 . |
||||||
|
|
||||||
| ### We should introduce new version of CUDA when | ||||||
|
|
||||||
| - It enables new important GPU architecture (For example Blackwell with CUDA-12.8) | ||||||
| - It brings significant performance improvements | ||||||
| - It fixes significant correctness issues | ||||||
| - It significantly reduces binary/memory footprint | ||||||
atalman marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### We would deprecate version of CUDA when | ||||||
|
|
||||||
| As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows: | ||||||
|
||||||
|
|
||||||
| - Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version | ||||||
|
||||||
| - Stable Version: This is a stable CUDA version that is used most of the time. This is the version we want to upload to PyPI. Please note that if the stable version is equal or slightly older to the version then the version fbcode is using, we probably need to keep it in CI/CD system as Optional Legacy Version. | ||||||
|
||||||
| - Latest Experimental Version: This is the latest version of CUDA that we want to support. Minimal requirement is to have it available in nightly releases (CD) | ||||||
|
||||||
|
|
||||||
| ### Detailed Process of Introducing new CUDA version | ||||||
|
|
||||||
| 1. Evaluate CUDA update necessity | ||||||
|
||||||
| Goal: When any of this is true: | ||||||
| - It enables new important GPU architecture (For example Blackwell with CUDA-12.8) | ||||||
| - It brings significant performance improvements | ||||||
| - It fixes significant correctness issues | ||||||
| - It significantly reduces binary/memory footprint | ||||||
|
|
||||||
| 2. Evaluate if we have all packages for update | ||||||
| When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release. | ||||||
|
||||||
| When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release. | |
| When: As soon as Update determined to be necessary. Start by creating RFC issue (see [example](https://github.com/pytorch/pytorch/issues/145544)) with possible CUDA matrix to support for next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supporting 3 versions can be an exception for certain release where we need to keep legacy version
Can you share examples of when we have needed to keep a legacy version around in the past? Sounds like we might incur a significant CI cost to test 50% more CUDA versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is current situation with CUDA 11.8: pytorch/pytorch#147383
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this as an example here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we drop CD support and then CI support.
Is the order important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, dropping CD support is easy to do. When Dropping CI support is most likely a migration of these CI jobs to next CUDA version and may take somewhat longer.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: cuDNN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add this matrix to the install matrix on https://pytorch.org/get-started/locally/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add it to RELEASE.md as well
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? I thought we moved to plain Linux containers and are using our install_cuda.sh script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is still used, however should be deprecated soon once we migrate focal->jammy in our CI/CD https://github.com/pytorch/pytorch/blob/main/.ci/docker/build.sh#L73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA driver updates should not be required during the lifetime of a major CUDA release as minor version compatibility will be used. If we add driver API calls, we should use cudaGetDriverEntryPoint making sure this API is available on the driver we are using. Keeping older driver on CI nodes will also give us a better signal what users would expect. Updating the driver is of course still valid, but might not be necessary in each CUDA toolkit update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any broad policy we want to set for cudnn upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some ambiguity below on what it means to "support" a give version of cuda.
In particular make it clear that this is always about full releases, unless specified otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add that we are defining the PyTorch binary support. Source builds are most of the time supported at CUDA release.