-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diverging runs when multiple resume #3725
Comments
@maxrousseau have you tried this with a bigger model like llama3-8b or mpt-125m using our llm-foundry stack? also, curious how are we re-starting mid-batch in your examples? |
Hi, thanks for replying! No I have not tried using llm-foundry, I am developing my own infra for small-scale experiments using composer to have as much flexibility as possible (i.e. not limited to training LLMs). Sorry if I was unclear about the mid-batch thing. It is not clear to me what a "step" corresponds to a) a full dataloader batch (in my case 1024 examples) or b) a device_microbatch (here 32 ro 64 samples on my RTX3090). In a case where it is the latter then it could be the source of the problem. |
IIUC, gradient accumulation is not the issue as in the event loop of composer we only checkpoint after the gradient accumulation is done and the optimizer has taken it's step. Another way of saying it is, a step should be considered a full dataloader batch. To test this hypothesis, you can set Alternatively to fix your non-determinism, I would double check if you are running forward passes with the exact same data points at each checkpoint resumption. We use streaming to ensure that our samples are exactly the same upon resumption. |
So I just tried with this configuration:
But the number of steps is identical, so as you said a composer step is indeed a full dataloader batch. see run here: wandb run I will check if the batches are identical after resumption, however as I understand from your docs spin_dataloader should be set to "True" by default. |
** Environment **
composer version 0.26.0
torch 2.4.0
** To reproduce
Steps to reproduce the behavior:
$ git clone https://github.com/maxrousseau/rafale.git
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt
$ uv pip install -e .
$ rafale-run test/pythia_tinystories.yaml
$ # cancel the current run
$ rafale-run test/pythia_tinystories.yaml # resumes from the "latest" checkpoint
Expected behavior
Near exact continuation of the training loss curve compared to the uninterrupted run. After the second or third resumptions, the loss begins to diverge (see plot below). I suspect that maybe gradient accumulation is causing an issue where the gradients are not stored in the checkpoint or that we are restarting mid-batch (and the accumulated gradients are lost) ?
Note: purple is the uninterrupted run which has lower training loss.
Additional context
I am using device_microbatch_size="auto" for my training run the configuration of the run is the following:
The text was updated successfully, but these errors were encountered: