Diverging runs when multiple resume #3725

maxrousseau · 2024-11-26T16:47:38Z

** Environment **

composer version 0.26.0
torch 2.4.0

** To reproduce

Steps to reproduce the behavior:

$ git clone https://github.com/maxrousseau/rafale.git
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt
$ uv pip install -e .
$ rafale-run test/pythia_tinystories.yaml
$ # cancel the current run
$ rafale-run test/pythia_tinystories.yaml # resumes from the "latest" checkpoint

Expected behavior

Near exact continuation of the training loss curve compared to the uninterrupted run. After the second or third resumptions, the loss begins to diverge (see plot below). I suspect that maybe gradient accumulation is causing an issue where the gradients are not stored in the checkpoint or that we are restarting mid-batch (and the accumulated gradients are lost) ?

Note: purple is the uninterrupted run which has lower training loss.

Additional context

I am using device_microbatch_size="auto" for my training run the configuration of the run is the following:

run:
    name: "pythia14m-tinystories" # name of your experiment, used for checkpointing
    seed: 42
    n_epochs: 1
    max_lr: 6e-04
    warmup_pct: 0.01
    schedule: "cosine-warmup" # linear, linear-warmup, cosine, cosine-warmup
    optimizer: "AdamW"
    eval_interval: "100ba"
    clip_type: "norm"
    clip_value: 1.0
    device_bs: "auto"
    save_interval: "50ba"
    train_key: "train"
    eval_key: "validation"
    model:
        config: "pythia14m" # config key
        type: "decoder"
        use_pretrained: True
        # mode: None
        # n_classes: None

data:
    pipeline: "tinystories_neox" # the preprocessing/tokenization pipeline
    config:
        name: "tinystories"
        num_processes: 8
        tokenizer_name: "neox"
        shuffle_dataset: True # this will shufflle the whole training dataset once
        input_id_key: "input_ids"
        train_batch_size: 1024
        eval_batch_size: 16
        shuffle_train: False
        dataset_path: "~/code/data/TinyStories"
        tokenizer_path: "EleutherAI/pythia-14m"
        max_sequence_length: 512
        pad_token_id: -100
        pad_inputs: True
        is_prepared: False
        subset_key_mappings: { "train": "train", "validation": "validation" } # (source: target)

j316chuck · 2024-11-26T19:13:08Z

@maxrousseau have you tried this with a bigger model like llama3-8b or mpt-125m using our llm-foundry stack?

also, curious how are we re-starting mid-batch in your examples?

maxrousseau · 2024-11-26T19:22:57Z

Hi, thanks for replying! No I have not tried using llm-foundry, I am developing my own infra for small-scale experiments using composer to have as much flexibility as possible (i.e. not limited to training LLMs).

Sorry if I was unclear about the mid-batch thing. It is not clear to me what a "step" corresponds to a) a full dataloader batch (in my case 1024 examples) or b) a device_microbatch (here 32 ro 64 samples on my RTX3090). In a case where it is the latter then it could be the source of the problem.

j316chuck · 2024-11-27T20:44:47Z

IIUC, gradient accumulation is not the issue as in the event loop of composer we only checkpoint after the gradient accumulation is done and the optimizer has taken it's step. Another way of saying it is, a step should be considered a full dataloader batch.

To test this hypothesis, you can set auto to a value divisible by your save_interval and you should still be able to reproduce the non-determinism. If not, there is likely a bug on our side and we would need to investigate further on our recent changes.

Alternatively to fix your non-determinism, I would double check if you are running forward passes with the exact same data points at each checkpoint resumption. We use streaming to ensure that our samples are exactly the same upon resumption.

maxrousseau · 2024-11-30T15:18:25Z

So I just tried with this configuration:

training dataloader batch size = 1024
device microbatch size = 32
save interval = 32

But the number of steps is identical, so as you said a composer step is indeed a full dataloader batch.
In any case, after the second resumption from checkpoint after 128 steps the loss begins to diverge from the baseline.

see run here: wandb run

I will check if the batches are identical after resumption, however as I understand from your docs spin_dataloader should be set to "True" by default.

maxrousseau added the bug Something isn't working label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diverging runs when multiple resume #3725

Diverging runs when multiple resume #3725

maxrousseau commented Nov 26, 2024 •

edited

Loading

j316chuck commented Nov 26, 2024

maxrousseau commented Nov 26, 2024

j316chuck commented Nov 27, 2024 •

edited

Loading

maxrousseau commented Nov 30, 2024 •

edited

Loading

Diverging runs when multiple resume #3725

Diverging runs when multiple resume #3725

Comments

maxrousseau commented Nov 26, 2024 • edited Loading

Expected behavior

Additional context

j316chuck commented Nov 26, 2024

maxrousseau commented Nov 26, 2024

j316chuck commented Nov 27, 2024 • edited Loading

maxrousseau commented Nov 30, 2024 • edited Loading

maxrousseau commented Nov 26, 2024 •

edited

Loading

j316chuck commented Nov 27, 2024 •

edited

Loading

maxrousseau commented Nov 30, 2024 •

edited

Loading