-
Notifications
You must be signed in to change notification settings - Fork 426
Commit
The algorithm form of scale schedule has been deprecated. It is available as an argument to the trainer. This PR removes the Scale Schedule Algorithm as an algorithm (it must be specified via the trainer init args). It also restores the scale schedule method card that was in the 0.3.1 release. This method card has been updated to reflect the non-algorithm-class usage. Closes #434.
- Loading branch information
There are no files selected for viewing
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# ⚖️ Scale Schedule | ||
|
||
![scale_schedule.png](https://storage.googleapis.com/docs.mosaicml.com/images/methods/scale_schedule.png) | ||
|
||
Tags: `Best Practice`, `Speedup` | ||
|
||
## TL;DR | ||
|
||
Scale Schedule changes the number of training steps by a dilation factor and dilating learning rate changes | ||
accordingly. Doing so varies the training budget, making it possible to explore tradeoffs between cost (measured in | ||
time or money) and the quality of the final model. | ||
|
||
## Attribution | ||
|
||
The number of training steps to perform is an important hyperparameter to tune when developing a model. This technique | ||
appears implicitly throughout the deep learning literature. One example of a systematic study of this approach is the | ||
*scan-SGD* technique in | ||
[How Important is Importance Sampling for Deep Budgeted Training](https://openreview.net/forum?id=TqQ0oOzJlai) by | ||
Eric Arazo, Diego Ortega, Paul Albert, Noel O'Connor, and Kevin McGuinness. Posted to OpenReview in 2020. | ||
|
||
## Hyperparameters | ||
|
||
- `ratio` - The ratio of the scaled learning rate schedule to the full learning rate schedule. For example, a ratio | ||
of 0.8 would train for 80% as many steps as the original schedule. | ||
|
||
## Example Effects | ||
|
||
Changing the length of training will affect the final accuracy of the model. For example, training ResNet-50 on | ||
ImageNet for the standard schedule in the `composer` library leads to final validation accuracy of 76.6%, while | ||
using scale schedule with a ratio of 0.5 leads to final validation accuracy of 75.6%. Training for longer can lead | ||
to diminishing returns or even overfitting and worse validation accuracy. In general, the cost of training is | ||
proportional to the length of training when using scale schedule (assuming all other techniques, such as progressive | ||
resizing, have their schedules scaled accordingly). | ||
|
||
```{note} | ||
The warmup periods of schedulers are not scaled by the scale schedule ratio. | ||
``` | ||
|
||
## Implementation Details | ||
|
||
Scale schedule is implemented as part of the {class}`~.Trainer` via the `scale_schedule_ratio` argument. | ||
The trainer will scale the ``max_duration`` by the ``scale_schedule_ratio``, and also adjust non-warmup milestones | ||
for the learning rate schedulers. | ||
|
||
Scale schedule supports all Composer Schedulers: | ||
|
||
```{eval-rst} | ||
.. currentmodule:: composer.optim.scheduler | ||
.. autosummary:: | ||
:nosignatures: | ||
StepScheduler | ||
MultiStepScheduler | ||
MultiStepWithWarmupScheduler | ||
ConstantScheduler | ||
LinearScheduler | ||
LinearWithWarmupScheduler | ||
ExponentialScheduler | ||
CosineAnnealingScheduler | ||
CosineAnnealingWithWarmupScheduler | ||
CosineAnnealingWarmRestartsScheduler | ||
PolynomialScheduler | ||
``` | ||
|
||
```{eval-rst} | ||
.. seealso:: The :ref:`Scheduling Guide <Composer Schedulers>` more information about Composer Schedulers. | ||
``` | ||
|
||
Scale schedule also supports the following PyTorch Schedulers: | ||
* {class}`~torch.optim.lr_scheduler.StepLR` | ||
* {class}`~torch.optim.lr_scheduler.MultiStepLR` | ||
* {class}`~torch.optim.lr_scheduler.ExponentialLR` | ||
* {class}`~torch.optim.lr_scheduler.CosineAnnealingLR` | ||
* {class}`~torch.optim.lr_scheduler.CosineAnnealingWarmRestarts`. | ||
|
||
|
||
For example, the code below will scale the training time by half | ||
(to 10 epochs) and also scale the learning rate schedule. | ||
|
||
```{eval-rst} | ||
.. testcode:: | ||
from composer import Trainer | ||
from composer.optim.scheduler import MultiStepScheduler | ||
trainer = Trainer( | ||
..., | ||
max_duration="20ep", | ||
schedulers=MultiStepScheduler(milestones=["10ep", "16ep"]), | ||
scale_schedule_ratio=0.5, | ||
) | ||
# or equivalently, with default SSR=1.0: | ||
trainer = Trainer( | ||
..., | ||
max_duration="10ep", | ||
schedulers=MultiStepScheduler(milestones=["5ep", "8ep"]) | ||
) | ||
``` | ||
|
||
For additional details on using the scale schedule ratio, see the {ref}`Scale Schedule Ratio <Scale Schedule Ratio>` | ||
section in the schedulers guide. | ||
|
||
## Suggested Hyperparameters | ||
|
||
The default scale schedule ratio is 1.0. For a standard maximum number of epochs (these will differ depending on the | ||
task), scaling down the learning rate schedule will lead to a monotonic decrease in accuracy. Increasing the scale | ||
schedule ratio will often improve the accuracy up to a plateau, although this leads to longer training time and added | ||
cost. | ||
|
||
## Composability | ||
|
||
As general rule, scale schedule can be applied in conjunction with any method. If other methods also perform actions | ||
according to a schedule, it is important to modify their schedules to coincide with the altered number of epochs. |