OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763

pzelasko · 2024-07-17T15:14:26Z

What does this PR do ?

Major contributions:

Canary-1B can be trained with 5x larger batch sizes compared to our earlier baseline. It maxes out GPU utilization (memory, compute, and power consumption wise). As a result the mean training step time is 2.75x longer, resulting in a training throughput of 5x / 2.75x ~= 180% of the original recipe. I managed to reproduce Canary-1B in about 40k training steps on the same number of GPUs, changing only bucketing/batch size settings using new features in this PR.
- Update: actually reproduces Canary-1B in half of the training time with slightly improved WERs.
- Update 2: also reproduced Canary-1B in the original training time using 4x less GPUs.
Note: these tools are applicable to all ASR models and can easily be made applicable to any (audio|text)->(audio|text) model.
OOMptimizer script that given a model config and bucket bins, finds the optimal batch sizes for each bucket bin. Optimal = maximum GPU utilization.
2D bucketing with a dedicated estimation script and dataloading support. Allows to stratify sampling by input and output sequence lengths, resulting in improved training throughput for encoder-decoder models.
Ability to filter out examples exceeding a tokens-per-second threshold during training (e.g. some datasets have very severe outliers, 20x more tokens than median).
Enables concurrent bucketing, speeding up the start of the training loop (in my experiments reduces the wait time from ~2min to ~20 seconds). With this setting, the bucketing buffer is filled asynchronously and lets the sampler draw batches when it's at least 10% filled, so the training can start faster.
Documentation and examples of usage of the new features.
Unit and integration for the new features.

Collection: ASR

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <[email protected]>

titu1994

Very interesting approach to maximize the batch size, minor comments but it looks good.

titu1994 · 2024-07-17T15:56:24Z

scripts/speech_recognition/oomptimizer.py

+
+    @property
+    def max_batch_size(self) -> int | None:
+        if (


Needs a bit of doc for all the cases

titu1994 · 2024-07-17T15:56:48Z

scripts/speech_recognition/oomptimizer.py

+            return self._max_ok
+        return None
+
+    @property


What does relative gap mean?

Added doc to explain

titu1994 · 2024-07-17T15:58:59Z

scripts/speech_recognition/oomptimizer.py

+        return False
+
+
+class FloatList(click.Option):


What's this used for ? Might as well use hydra with a dataclass than click

Right, I went with click out of an old habit. This auto-parses bucket duration bins [1,2,3,4] to list of floats.

titu1994 · 2024-07-17T16:00:41Z

scripts/speech_recognition/oomptimizer.py

+
+    print("Intializing ASR model.")
+    # TODO(pzelasko): This currently only supports "from_pretrained".
+    #                 We need to be able to read a model training configuration and instantiate the model


You can use restore_from(..., return_config=True)...

Ended up going with --module-name and --config-path like discussed offline. It works well.

pzelasko · 2024-07-17T17:49:41Z

I realized we can make a post-processing pass on the max_batch_size list and merge buckets with identical batch sizes. Merging buckets will improve randomization. If the original num_buckets was large enough to trigger merging, that approach will lead us to an optimal number of buckets.

galv · 2024-07-17T17:52:41Z

scripts/speech_recognition/oomptimizer.py

+            oom = False
+            try:
+                print(f"Current gap: {gen.current_rel_gap}. Attempting shapes: {[b.shape for b in batch]}", end=" ")
+                optimizer.zero_grad()


It is theoretically possible to do these three lines in a cuda stream capture with "relaxed" mode to avoid doing any sort of GPU-side computation. However, it will work only for code that has no data-dependent shapes (like torch.nonzero). Note that I haven't run your code and don't know how slow it is right now.

It is surprisingly fast - for ~30 buckets the total runtime seems within 1-2 minutes. If CUDA graph "relaxed" mode would be "ok" with skipping NCCL ops then we might even incorporate this as a training time calibration (which we can't do now because these steps trigger NCCL syncs, if one GPU dies and other doesn't, it would hang). But even as-is I think this is a viable approach.

For lots of buckets (i.e. 100+) it takes a while. We should try the "relaxed" CUDA graph trick, and if it works, make a follow up PR.

just curious how long "a while" is.

The relaxed cuda graph trick definitely won't always work unfortunately... I spoke with someone who works on end-to-end training and he told me that there is a cudaStreamSynchronize() is the torch.amp.GradScaler, which will prevent using relaxed stream capture for models that do gradient scaling in mixed precision training.

I think it's around 15 minutes for 150 buckets.

Signed-off-by: Piotr Żelasko <[email protected]>

…raining Signed-off-by: Piotr Żelasko <[email protected]>

Signed-off-by: Piotr Żelasko <[email protected]>

…ptimizer

ko3n1g · 2024-08-15T14:08:26Z

.github/workflows/cicd-main.yml

+             /home/TestData/asr_tokenizers/canary/es/tokenizer_spe_bpe_v1024_max_4/tokenizer.model \
+          --langs spl_tokens en es \
+          --prompt-format canary \
+          --prompt '[{"role":"user","slots":{"source_lang":"en","target_lang":"en","task":"asr","pnc":"yes"}}]' \


Suggested change

--prompt '[{"role":"user","slots":{"source_lang":"en","target_lang":"en","task":"asr","pnc":"yes"}}]' \

--prompt \'[{"role":"user","slots":{"source_lang":"en","target_lang":"en","task":"asr","pnc":"yes"}}]\' \

I think that should do the trick @pzelasko

Thanks... I was seriously scratching my head with this one lol.

It also resulted with an error. I am disabling the check so this PR may go in. If we can figure out how to work around the quoting issue, I will enable the check in a follow up PR.

ko3n1g

hope you don't mind my comments!

.github/workflows/cicd-main.yml

Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

galv

Forgot to hit approve last time.

Signed-off-by: Piotr Żelasko <[email protected]>

ko3n1g

will merge this after pipeline pass

…9763) * Initial working draft of the OOMptimizer. Signed-off-by: Piotr Żelasko <[email protected]> * Support model config. Add bucket merging. Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * code review Signed-off-by: Piotr Żelasko <[email protected]> * Support bucket_batch_size option for lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Fix for autocast and configurable dtype Signed-off-by: Piotr Żelasko <[email protected]> * Allow token-per-second filtering Signed-off-by: Piotr Żelasko <[email protected]> * Fix an issue with canary tokenizer Signed-off-by: Piotr Żelasko <[email protected]> * Lift the requirement to use CanaryTokenizer with canary prompt format * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Initial 2D bucketing draft Signed-off-by: Piotr Żelasko <[email protected]> * Separate script for 2D bucket estimation Signed-off-by: Piotr Żelasko <[email protected]> * Full 2D bucketing support: estimate_uduration_bins_2d, oomptimizer, training Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Unit tests for bucket_batch_size and 2D bucketing for audio Signed-off-by: Piotr Żelasko <[email protected]> * Docs for 2D estimate duration bins Signed-off-by: Piotr Żelasko <[email protected]> * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for prompt format in estimate_duration_bins_2d Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * fix for bucket selection edge case Signed-off-by: Piotr Żelasko <[email protected]> * Add more info about the distribution to estimate_duration_bins_2d.py Signed-off-by: Piotr Żelasko <[email protected]> * Include CUDA RAM usage tracking in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Track batch_size, num frames/tokens, and their padding ratio for AED multi task models Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer documentation Signed-off-by: Piotr Żelasko <[email protected]> * Resolve TODOs and support any combination of (audio|text)->(audio|text) modalities Signed-off-by: Piotr Żelasko <[email protected]> * Add missing property decorator Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * Add docs about 2D bucketing with tokenizer and prompts Signed-off-by: Piotr Żelasko <[email protected]> * Fix bucket allocation logic for 2D bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Bump lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * fix... Signed-off-by: Piotr Żelasko <[email protected]> * Reverse bucket iteration order; move oomptimizer_schema to AsrModel Signed-off-by: Piotr Żelasko <[email protected]> * Make OOMptimizer compatible with dataclass mini-batches Signed-off-by: Piotr Żelasko <[email protected]> * Refine the schema Signed-off-by: Piotr Żelasko <[email protected]> * fixes after merging main Signed-off-by: Piotr Żelasko <[email protected]> * fix oomptimizer with pretrained models; verified canary, parakeet tdt and ctc Signed-off-by: Piotr Żelasko <[email protected]> * Disable concurrent bucketing to prevent spawning extra threads in tests Signed-off-by: Piotr Żelasko <[email protected]> * fix tests and make life more colorful Signed-off-by: Piotr Żelasko <[email protected]> * formatting Signed-off-by: Piotr Żelasko <[email protected]> * more reasonable starting batch size settings Signed-off-by: Piotr Żelasko <[email protected]> * Disable clearing of cuda memory cache Signed-off-by: Piotr Żelasko <[email protected]> * Even more conservative profile by incorporating DDP overhead simulation Signed-off-by: Piotr Żelasko <[email protected]> * Bucket selection fix and an extended unit test * Refactor registered_prompt_format_fn to enable prompt formatting before Sampler Signed-off-by: Piotr Żelasko <[email protected]> * porting fix Signed-off-by: Piotr Żelasko <[email protected]> * Fixes, move fast-path to prompted dataset Signed-off-by: Piotr Żelasko <[email protected]> * Changes from Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer tests + fixes for 1D bucketing case Signed-off-by: Piotr Żelasko <[email protected]> * estimate duration bins tests Signed-off-by: Piotr Żelasko <[email protected]> * address Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * fix CPU unit test Signed-off-by: Piotr Żelasko <[email protected]> * try to fix CI test Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> * Disable 2D bucketing test with prompt due to quoting issue Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: oliver könig <[email protected]>

…9763) * Initial working draft of the OOMptimizer. Signed-off-by: Piotr Żelasko <[email protected]> * Support model config. Add bucket merging. Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * code review Signed-off-by: Piotr Żelasko <[email protected]> * Support bucket_batch_size option for lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Fix for autocast and configurable dtype Signed-off-by: Piotr Żelasko <[email protected]> * Allow token-per-second filtering Signed-off-by: Piotr Żelasko <[email protected]> * Fix an issue with canary tokenizer Signed-off-by: Piotr Żelasko <[email protected]> * Lift the requirement to use CanaryTokenizer with canary prompt format * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Initial 2D bucketing draft Signed-off-by: Piotr Żelasko <[email protected]> * Separate script for 2D bucket estimation Signed-off-by: Piotr Żelasko <[email protected]> * Full 2D bucketing support: estimate_uduration_bins_2d, oomptimizer, training Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Unit tests for bucket_batch_size and 2D bucketing for audio Signed-off-by: Piotr Żelasko <[email protected]> * Docs for 2D estimate duration bins Signed-off-by: Piotr Żelasko <[email protected]> * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for prompt format in estimate_duration_bins_2d Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * fix for bucket selection edge case Signed-off-by: Piotr Żelasko <[email protected]> * Add more info about the distribution to estimate_duration_bins_2d.py Signed-off-by: Piotr Żelasko <[email protected]> * Include CUDA RAM usage tracking in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Track batch_size, num frames/tokens, and their padding ratio for AED multi task models Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer documentation Signed-off-by: Piotr Żelasko <[email protected]> * Resolve TODOs and support any combination of (audio|text)->(audio|text) modalities Signed-off-by: Piotr Żelasko <[email protected]> * Add missing property decorator Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * Add docs about 2D bucketing with tokenizer and prompts Signed-off-by: Piotr Żelasko <[email protected]> * Fix bucket allocation logic for 2D bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Bump lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * fix... Signed-off-by: Piotr Żelasko <[email protected]> * Reverse bucket iteration order; move oomptimizer_schema to AsrModel Signed-off-by: Piotr Żelasko <[email protected]> * Make OOMptimizer compatible with dataclass mini-batches Signed-off-by: Piotr Żelasko <[email protected]> * Refine the schema Signed-off-by: Piotr Żelasko <[email protected]> * fixes after merging main Signed-off-by: Piotr Żelasko <[email protected]> * fix oomptimizer with pretrained models; verified canary, parakeet tdt and ctc Signed-off-by: Piotr Żelasko <[email protected]> * Disable concurrent bucketing to prevent spawning extra threads in tests Signed-off-by: Piotr Żelasko <[email protected]> * fix tests and make life more colorful Signed-off-by: Piotr Żelasko <[email protected]> * formatting Signed-off-by: Piotr Żelasko <[email protected]> * more reasonable starting batch size settings Signed-off-by: Piotr Żelasko <[email protected]> * Disable clearing of cuda memory cache Signed-off-by: Piotr Żelasko <[email protected]> * Even more conservative profile by incorporating DDP overhead simulation Signed-off-by: Piotr Żelasko <[email protected]> * Bucket selection fix and an extended unit test * Refactor registered_prompt_format_fn to enable prompt formatting before Sampler Signed-off-by: Piotr Żelasko <[email protected]> * porting fix Signed-off-by: Piotr Żelasko <[email protected]> * Fixes, move fast-path to prompted dataset Signed-off-by: Piotr Żelasko <[email protected]> * Changes from Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer tests + fixes for 1D bucketing case Signed-off-by: Piotr Żelasko <[email protected]> * estimate duration bins tests Signed-off-by: Piotr Żelasko <[email protected]> * address Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * fix CPU unit test Signed-off-by: Piotr Żelasko <[email protected]> * try to fix CI test Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> * Disable 2D bucketing test with prompt due to quoting issue Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: adityavavre <[email protected]>

* Initial working draft of the OOMptimizer. Signed-off-by: Piotr Żelasko <[email protected]> * Support model config. Add bucket merging. Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * code review Signed-off-by: Piotr Żelasko <[email protected]> * Support bucket_batch_size option for lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Fix for autocast and configurable dtype Signed-off-by: Piotr Żelasko <[email protected]> * Allow token-per-second filtering Signed-off-by: Piotr Żelasko <[email protected]> * Fix an issue with canary tokenizer Signed-off-by: Piotr Żelasko <[email protected]> * Lift the requirement to use CanaryTokenizer with canary prompt format * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Initial 2D bucketing draft Signed-off-by: Piotr Żelasko <[email protected]> * Separate script for 2D bucket estimation Signed-off-by: Piotr Żelasko <[email protected]> * Full 2D bucketing support: estimate_uduration_bins_2d, oomptimizer, training Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Unit tests for bucket_batch_size and 2D bucketing for audio Signed-off-by: Piotr Żelasko <[email protected]> * Docs for 2D estimate duration bins Signed-off-by: Piotr Żelasko <[email protected]> * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for prompt format in estimate_duration_bins_2d Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * fix for bucket selection edge case Signed-off-by: Piotr Żelasko <[email protected]> * Add more info about the distribution to estimate_duration_bins_2d.py Signed-off-by: Piotr Żelasko <[email protected]> * Include CUDA RAM usage tracking in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Track batch_size, num frames/tokens, and their padding ratio for AED multi task models Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer documentation Signed-off-by: Piotr Żelasko <[email protected]> * Resolve TODOs and support any combination of (audio|text)->(audio|text) modalities Signed-off-by: Piotr Żelasko <[email protected]> * Add missing property decorator Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * Add docs about 2D bucketing with tokenizer and prompts Signed-off-by: Piotr Żelasko <[email protected]> * Fix bucket allocation logic for 2D bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Bump lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * fix... Signed-off-by: Piotr Żelasko <[email protected]> * Reverse bucket iteration order; move oomptimizer_schema to AsrModel Signed-off-by: Piotr Żelasko <[email protected]> * Make OOMptimizer compatible with dataclass mini-batches Signed-off-by: Piotr Żelasko <[email protected]> * Refine the schema Signed-off-by: Piotr Żelasko <[email protected]> * fixes after merging main Signed-off-by: Piotr Żelasko <[email protected]> * fix oomptimizer with pretrained models; verified canary, parakeet tdt and ctc Signed-off-by: Piotr Żelasko <[email protected]> * Disable concurrent bucketing to prevent spawning extra threads in tests Signed-off-by: Piotr Żelasko <[email protected]> * fix tests and make life more colorful Signed-off-by: Piotr Żelasko <[email protected]> * formatting Signed-off-by: Piotr Żelasko <[email protected]> * more reasonable starting batch size settings Signed-off-by: Piotr Żelasko <[email protected]> * Disable clearing of cuda memory cache Signed-off-by: Piotr Żelasko <[email protected]> * Even more conservative profile by incorporating DDP overhead simulation Signed-off-by: Piotr Żelasko <[email protected]> * Bucket selection fix and an extended unit test * Refactor registered_prompt_format_fn to enable prompt formatting before Sampler Signed-off-by: Piotr Żelasko <[email protected]> * porting fix Signed-off-by: Piotr Żelasko <[email protected]> * Fixes, move fast-path to prompted dataset Signed-off-by: Piotr Żelasko <[email protected]> * Changes from Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer tests + fixes for 1D bucketing case Signed-off-by: Piotr Żelasko <[email protected]> * estimate duration bins tests Signed-off-by: Piotr Żelasko <[email protected]> * address Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * fix CPU unit test Signed-off-by: Piotr Żelasko <[email protected]> * try to fix CI test Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> * Disable 2D bucketing test with prompt due to quoting issue Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: oliver könig <[email protected]>

…9763) * Initial working draft of the OOMptimizer. Signed-off-by: Piotr Żelasko <[email protected]> * Support model config. Add bucket merging. Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * code review Signed-off-by: Piotr Żelasko <[email protected]> * Support bucket_batch_size option for lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Ability to force a memory fraction to be unused in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Fix for autocast and configurable dtype Signed-off-by: Piotr Żelasko <[email protected]> * Allow token-per-second filtering Signed-off-by: Piotr Żelasko <[email protected]> * Fix an issue with canary tokenizer Signed-off-by: Piotr Żelasko <[email protected]> * Lift the requirement to use CanaryTokenizer with canary prompt format * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Initial 2D bucketing draft Signed-off-by: Piotr Żelasko <[email protected]> * Separate script for 2D bucket estimation Signed-off-by: Piotr Żelasko <[email protected]> * Full 2D bucketing support: estimate_uduration_bins_2d, oomptimizer, training Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Unit tests for bucket_batch_size and 2D bucketing for audio Signed-off-by: Piotr Żelasko <[email protected]> * Docs for 2D estimate duration bins Signed-off-by: Piotr Żelasko <[email protected]> * Fixes Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for prompt format in estimate_duration_bins_2d Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * fix for bucket selection edge case Signed-off-by: Piotr Żelasko <[email protected]> * Add more info about the distribution to estimate_duration_bins_2d.py Signed-off-by: Piotr Żelasko <[email protected]> * Include CUDA RAM usage tracking in OOMptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Track batch_size, num frames/tokens, and their padding ratio for AED multi task models Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer documentation Signed-off-by: Piotr Żelasko <[email protected]> * Resolve TODOs and support any combination of (audio|text)->(audio|text) modalities Signed-off-by: Piotr Żelasko <[email protected]> * Add missing property decorator Signed-off-by: Piotr Żelasko <[email protected]> * fixes Signed-off-by: Piotr Żelasko <[email protected]> * Add docs about 2D bucketing with tokenizer and prompts Signed-off-by: Piotr Żelasko <[email protected]> * Fix bucket allocation logic for 2D bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Bump lhotse version Signed-off-by: Piotr Żelasko <[email protected]> * fix... Signed-off-by: Piotr Żelasko <[email protected]> * Reverse bucket iteration order; move oomptimizer_schema to AsrModel Signed-off-by: Piotr Żelasko <[email protected]> * Make OOMptimizer compatible with dataclass mini-batches Signed-off-by: Piotr Żelasko <[email protected]> * Refine the schema Signed-off-by: Piotr Żelasko <[email protected]> * fixes after merging main Signed-off-by: Piotr Żelasko <[email protected]> * fix oomptimizer with pretrained models; verified canary, parakeet tdt and ctc Signed-off-by: Piotr Żelasko <[email protected]> * Disable concurrent bucketing to prevent spawning extra threads in tests Signed-off-by: Piotr Żelasko <[email protected]> * fix tests and make life more colorful Signed-off-by: Piotr Żelasko <[email protected]> * formatting Signed-off-by: Piotr Żelasko <[email protected]> * more reasonable starting batch size settings Signed-off-by: Piotr Żelasko <[email protected]> * Disable clearing of cuda memory cache Signed-off-by: Piotr Żelasko <[email protected]> * Even more conservative profile by incorporating DDP overhead simulation Signed-off-by: Piotr Żelasko <[email protected]> * Bucket selection fix and an extended unit test * Refactor registered_prompt_format_fn to enable prompt formatting before Sampler Signed-off-by: Piotr Żelasko <[email protected]> * porting fix Signed-off-by: Piotr Żelasko <[email protected]> * Fixes, move fast-path to prompted dataset Signed-off-by: Piotr Żelasko <[email protected]> * Changes from Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer tests + fixes for 1D bucketing case Signed-off-by: Piotr Żelasko <[email protected]> * estimate duration bins tests Signed-off-by: Piotr Żelasko <[email protected]> * address Daniel's review Signed-off-by: Piotr Żelasko <[email protected]> * fix CPU unit test Signed-off-by: Piotr Żelasko <[email protected]> * try to fix CI test Signed-off-by: Piotr Żelasko <[email protected]> * Apply suggestions from code review Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> * Disable 2D bucketing test with prompt due to quoting issue Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

Initial working draft of the OOMptimizer.

f71e27e

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko requested review from titu1994 and galv July 17, 2024 15:14

titu1994 previously approved these changes Jul 17, 2024

View reviewed changes

pzelasko mentioned this pull request Jul 17, 2024

Support for pre-determined batch sizes in DynamicBucketingSampler lhotse-speech/lhotse#1372

Merged

galv reviewed Jul 17, 2024

View reviewed changes

Support model config. Add bucket merging.

5995dbe

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko dismissed titu1994’s stale review via 5995dbe July 18, 2024 18:56

pzelasko and others added 3 commits July 18, 2024 15:50

fix

561e674

Signed-off-by: Piotr Żelasko <[email protected]>

code review

5970c34

Signed-off-by: Piotr Żelasko <[email protected]>

Support bucket_batch_size option for lhotse dataloading

b4ab721

Signed-off-by: Piotr Żelasko <[email protected]>

github-actions bot added the common label Jul 19, 2024

pzelasko added 2 commits July 19, 2024 08:49

Ability to force a memory fraction to be unused in OOMptimizer

9e632e4

Signed-off-by: Piotr Żelasko <[email protected]>

Ability to force a memory fraction to be unused in OOMptimizer

4b009bd

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko force-pushed the oomptimizer branch from b0e9057 to 4b009bd Compare July 19, 2024 12:53

pzelasko added 2 commits July 19, 2024 09:02

Fix for autocast and configurable dtype

a386fa8

Signed-off-by: Piotr Żelasko <[email protected]>

Allow token-per-second filtering

a4e2c66

Signed-off-by: Piotr Żelasko <[email protected]>

github-actions bot added the ASR label Jul 19, 2024

pzelasko added 11 commits July 22, 2024 13:06

Fix an issue with canary tokenizer

0cdc58d

Signed-off-by: Piotr Żelasko <[email protected]>

Lift the requirement to use CanaryTokenizer with canary prompt format

9c3e625

Fixes

aaa05a5

Signed-off-by: Piotr Żelasko <[email protected]>

Initial 2D bucketing draft

e7556fb

Signed-off-by: Piotr Żelasko <[email protected]>

Separate script for 2D bucket estimation

8497a25

Signed-off-by: Piotr Żelasko <[email protected]>

Full 2D bucketing support: estimate_uduration_bins_2d, oomptimizer, t…

bc60b5f

…raining Signed-off-by: Piotr Żelasko <[email protected]>

fix

10c2ada

Signed-off-by: Piotr Żelasko <[email protected]>

fix

bb0bc4f

Signed-off-by: Piotr Żelasko <[email protected]>

fix

5e442bf

Signed-off-by: Piotr Żelasko <[email protected]>

fix

97a800c

Signed-off-by: Piotr Żelasko <[email protected]>

fix

21588ba

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko added Run CICD and removed Run CICD labels Aug 14, 2024

Merge branch 'main' into oomptimizer

9bb2693

pzelasko added Run CICD and removed Run CICD labels Aug 15, 2024

pzelasko added 2 commits August 15, 2024 09:49

try to fix CI test

81b4d92

Signed-off-by: Piotr Żelasko <[email protected]>

Merge branch 'oomptimizer' of https://github.com/nvidia/nemo into oom…

10444f8

…ptimizer

pzelasko added Run CICD and removed Run CICD labels Aug 15, 2024

ko3n1g reviewed Aug 15, 2024

View reviewed changes

.github/workflows/cicd-main.yml Outdated Show resolved Hide resolved

.github/workflows/cicd-main.yml Outdated Show resolved Hide resolved

.github/workflows/cicd-main.yml Outdated Show resolved Hide resolved

Apply suggestions from code review

02c88f5

Co-authored-by: oliver könig <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko added Run CICD and removed Run CICD labels Aug 15, 2024

Merge remote-tracking branch 'origin/main' into oomptimizer

2383c93

galv previously approved these changes Aug 15, 2024

View reviewed changes

Disable 2D bucketing test with prompt due to quoting issue

6f066e0

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko dismissed galv’s stale review via 6f066e0 August 15, 2024 18:46

pzelasko added Run CICD and removed Run CICD labels Aug 15, 2024

galv approved these changes Aug 15, 2024

View reviewed changes

ko3n1g approved these changes Aug 16, 2024

View reviewed changes

pzelasko merged commit a2c1627 into main Aug 16, 2024
128 of 129 checks passed

pzelasko deleted the oomptimizer branch August 16, 2024 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763

OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763

pzelasko commented Jul 17, 2024 •

edited

Loading

titu1994 left a comment

titu1994 Jul 17, 2024

titu1994 Jul 17, 2024

pzelasko Jul 18, 2024

titu1994 Jul 17, 2024

pzelasko Jul 18, 2024

titu1994 Jul 17, 2024

pzelasko Jul 18, 2024

pzelasko commented Jul 17, 2024

galv Jul 17, 2024

pzelasko Jul 17, 2024

pzelasko Jul 26, 2024

galv Jul 26, 2024

pzelasko Jul 29, 2024 •

edited

Loading

ko3n1g Aug 15, 2024

pzelasko Aug 15, 2024

pzelasko Aug 15, 2024

ko3n1g left a comment

galv left a comment

ko3n1g left a comment

	--prompt '[{"role":"user","slots":{"source_lang":"en","target_lang":"en","task":"asr","pnc":"yes"}}]' \
	--prompt \'[{"role":"user","slots":{"source_lang":"en","target_lang":"en","task":"asr","pnc":"yes"}}]\' \

OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763

OOMptimizer: bucketing batch size profiles to make GPUs go 🔥 #9763

Conversation

pzelasko commented Jul 17, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Jul 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ko3n1g left a comment

Choose a reason for hiding this comment

galv left a comment

Choose a reason for hiding this comment

ko3n1g left a comment

Choose a reason for hiding this comment

pzelasko commented Jul 17, 2024 •

edited

Loading

pzelasko Jul 29, 2024 •

edited

Loading