Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763
OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763
Changes from 1 commit
f71e27e
5995dbe
561e674
5970c34
b4ab721
9e632e4
4b009bd
a386fa8
a4e2c66
0cdc58d
9c3e625
aaa05a5
e7556fb
8497a25
bc60b5f
10c2ada
bb0bc4f
5e442bf
97a800c
21588ba
77d2851
5b704a8
1155135
2f43313
81420df
572f2be
ade45ea
8d607e1
7ffdd96
5644f43
3b532a9
616036f
968a00f
639df62
ec4206f
d64a726
0d2cbe5
764c3f1
14ed8be
888c343
5c1e096
e3aa624
41beffd
4f6859e
b237f96
c4a25ea
eeebf19
731fda0
5d5d9e1
4a41e66
87d0ea7
f7198bc
44ef482
fc8a8c7
2bd282c
4546a1f
88f4d21
3f892b9
81288e1
0d10556
dddc2ef
80cb49b
7a1bf71
cbd3da8
9bb2693
81b4d92
10444f8
02c88f5
2383c93
6f066e0
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaskType may also need to be supported at some point, as an alternative to LengthsType. I don't think it's a big deal, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a bit of doc for all the cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does relative gap mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added doc to explain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this used for ? Might as well use hydra with a dataclass than click
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I went with click out of an old habit. This auto-parses bucket duration bins
[1,2,3,4]
to list of floats.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use restore_from(..., return_config=True)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up going with
--module-name
and--config-path
like discussed offline. It works well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is theoretically possible to do these three lines in a cuda stream capture with "relaxed" mode to avoid doing any sort of GPU-side computation. However, it will work only for code that has no data-dependent shapes (like torch.nonzero). Note that I haven't run your code and don't know how slow it is right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is surprisingly fast - for ~30 buckets the total runtime seems within 1-2 minutes. If CUDA graph "relaxed" mode would be "ok" with skipping NCCL ops then we might even incorporate this as a training time calibration (which we can't do now because these steps trigger NCCL syncs, if one GPU dies and other doesn't, it would hang). But even as-is I think this is a viable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lots of buckets (i.e. 100+) it takes a while. We should try the "relaxed" CUDA graph trick, and if it works, make a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious how long "a while" is.
The relaxed cuda graph trick definitely won't always work unfortunately... I spoke with someone who works on end-to-end training and he told me that there is a cudaStreamSynchronize() is the torch.amp.GradScaler, which will prevent using relaxed stream capture for models that do gradient scaling in mixed precision training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's around 15 minutes for 150 buckets.