add variable batch_size for training (#3387) #3388

TroyGarden · 2025-09-19T22:08:15Z

Summary:

context

APS is using "variable batch size" during training, e.g., using a smaller batch_size (like 32) to warm up then use a larger batch_size (like 64) for the rest of training.

    batch_size_schedule:
      - batch_size: 32
        max_iters: 5
      - batch_size: 64
        max_iters: 999999999

however, this becomes a problem for torch.export (PT2 IR) because the exported program assumes the batch_size to be constant.
NOTE: this "variable batch" concept is fundamentally different from the "variable length" (VLE/VBE)
in the variable batch scenario, within the same batch/training iteration, each feature in the KJT shares the same batch_size (which can only vary in a later iteration), so it follows the correlation: batch_size = length(kjt._lengths) // len(kjt._keys), and kjt.stride() returns the batch_size by calculation from _lengths and _keys.
in the variable length scenario, within the same batch/training iteration, each feature in the KJT could have different batch_size, and there's no correlation between _lengths and _keys or batch_size.
so this "variable batch size" CAN NOT simply be resolved by setting all input KJTs as variable lengths, instead, it has to use batch_size as a dynamic shape implicitly from the mark_dynamic_kjt util function.
WARNING: it's the user's responsibility to make sure that the variable_batch is only used when setting variable_length to False, otherwise it will cause unexpected behavior with the dynamic shapes in torch.export

Reviewed By: spmex, malaybag

Differential Revision: D82792378

Differential Revision: D82837368

facebook-github-bot · 2025-09-19T22:08:43Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating diff in D82792378.

Summary: Pull Request resolved: pytorch#3388 Pull Request resolved: pytorch#3387 # context * APS is using "variable batch size" during training, e.g., using a smaller `batch_size` (like 32) to warm up then use a larger `batch_size` (like 64) for the rest of training. ``` batch_size_schedule: - batch_size: 32 max_iters: 5 - batch_size: 64 max_iters: 999999999 ``` * however, this becomes a problem for torch.export (PT2 IR) because the exported program assumes the `batch_size` to be constant. NOTE: this "variable batch" concept is fundamentally different from the "variable length" (VLE/VBE) * in the variable batch scenario, within the same batch/training iteration, each feature in the KJT shares the same `batch_size` (which can only vary in a later iteration), so it follows the correlation: `batch_size = length(kjt._lengths) // len(kjt._keys)`, and `kjt.stride()` returns the `batch_size` by calculation from `_lengths` and `_keys`. * in the variable length scenario, within the same batch/training iteration, each feature in the KJT could have different `batch_size`, and there's no correlation between `_lengths` and `_keys` or `batch_size`. * so this "variable batch size" **CAN NOT** simply be resolved by setting all input KJTs as variable lengths, instead, it has to use `batch_size` as a dynamic shape implicitly from the `mark_dynamic_kjt` util function. WARNING: it's the user's responsibility to make sure that the `variable_batch` is only used when setting `variable_length` to `False`, otherwise it will cause unexpected behavior with the dynamic shapes in torch.export Reviewed By: spmex, malaybag Differential Revision: D82792378

facebook-github-bot · 2025-09-19T22:22:20Z

@TroyGarden has exported this pull request. If you are a Meta employee, you can view the originating diff in D82792378.

Change check to use torch.compiler.is_compiling()

c9ec08f

Differential Revision: D82837368

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2025

facebook-github-bot added fb-exported meta-exported labels Sep 19, 2025

TroyGarden force-pushed the export-D82792378 branch from 357c61f to 73f0d7f Compare September 19, 2025 22:22

facebook-github-bot closed this in 2a68dde Sep 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add variable batch_size for training (#3387) #3388

add variable batch_size for training (#3387) #3388

Uh oh!

TroyGarden commented Sep 19, 2025

Uh oh!

facebook-github-bot commented Sep 19, 2025

Uh oh!

facebook-github-bot commented Sep 19, 2025

Uh oh!

Uh oh!

add variable batch_size for training (#3387) #3388

add variable batch_size for training (#3387) #3388

Uh oh!

Conversation

TroyGarden commented Sep 19, 2025

context

Uh oh!

facebook-github-bot commented Sep 19, 2025

Uh oh!

facebook-github-bot commented Sep 19, 2025

Uh oh!

Uh oh!