Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diverging runs when multiple resume #3725

Open
maxrousseau opened this issue Nov 26, 2024 · 4 comments
Open

Diverging runs when multiple resume #3725

maxrousseau opened this issue Nov 26, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@maxrousseau
Copy link

maxrousseau commented Nov 26, 2024

** Environment **

composer version 0.26.0
torch 2.4.0

** To reproduce

Steps to reproduce the behavior:

$ git clone https://github.com/maxrousseau/rafale.git
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt
$ uv pip install -e .
$ rafale-run test/pythia_tinystories.yaml
$ # cancel the current run
$ rafale-run test/pythia_tinystories.yaml # resumes from the "latest" checkpoint

Expected behavior

Near exact continuation of the training loss curve compared to the uninterrupted run. After the second or third resumptions, the loss begins to diverge (see plot below). I suspect that maybe gradient accumulation is causing an issue where the gradients are not stored in the checkpoint or that we are restarting mid-batch (and the accumulated gradients are lost) ?

Note: purple is the uninterrupted run which has lower training loss.
image

Additional context

I am using device_microbatch_size="auto" for my training run the configuration of the run is the following:

run:
    name: "pythia14m-tinystories" # name of your experiment, used for checkpointing
    seed: 42
    n_epochs: 1
    max_lr: 6e-04
    warmup_pct: 0.01
    schedule: "cosine-warmup" # linear, linear-warmup, cosine, cosine-warmup
    optimizer: "AdamW"
    eval_interval: "100ba"
    clip_type: "norm"
    clip_value: 1.0
    device_bs: "auto"
    save_interval: "50ba"
    train_key: "train"
    eval_key: "validation"
    model:
        config: "pythia14m" # config key
        type: "decoder"
        use_pretrained: True
        # mode: None
        # n_classes: None

data:
    pipeline: "tinystories_neox" # the preprocessing/tokenization pipeline
    config:
        name: "tinystories"
        num_processes: 8
        tokenizer_name: "neox"
        shuffle_dataset: True # this will shufflle the whole training dataset once
        input_id_key: "input_ids"
        train_batch_size: 1024
        eval_batch_size: 16
        shuffle_train: False
        dataset_path: "~/code/data/TinyStories"
        tokenizer_path: "EleutherAI/pythia-14m"
        max_sequence_length: 512
        pad_token_id: -100
        pad_inputs: True
        is_prepared: False
        subset_key_mappings: { "train": "train", "validation": "validation" } # (source: target)
@maxrousseau maxrousseau added the bug Something isn't working label Nov 26, 2024
@j316chuck
Copy link
Contributor

@maxrousseau have you tried this with a bigger model like llama3-8b or mpt-125m using our llm-foundry stack?

also, curious how are we re-starting mid-batch in your examples?

@maxrousseau
Copy link
Author

Hi, thanks for replying! No I have not tried using llm-foundry, I am developing my own infra for small-scale experiments using composer to have as much flexibility as possible (i.e. not limited to training LLMs).

Sorry if I was unclear about the mid-batch thing. It is not clear to me what a "step" corresponds to a) a full dataloader batch (in my case 1024 examples) or b) a device_microbatch (here 32 ro 64 samples on my RTX3090). In a case where it is the latter then it could be the source of the problem.

@j316chuck
Copy link
Contributor

j316chuck commented Nov 27, 2024

IIUC, gradient accumulation is not the issue as in the event loop of composer we only checkpoint after the gradient accumulation is done and the optimizer has taken it's step. Another way of saying it is, a step should be considered a full dataloader batch.

To test this hypothesis, you can set auto to a value divisible by your save_interval and you should still be able to reproduce the non-determinism. If not, there is likely a bug on our side and we would need to investigate further on our recent changes.

Alternatively to fix your non-determinism, I would double check if you are running forward passes with the exact same data points at each checkpoint resumption. We use streaming to ensure that our samples are exactly the same upon resumption.

@maxrousseau
Copy link
Author

maxrousseau commented Nov 30, 2024

So I just tried with this configuration:

  • training dataloader batch size = 1024
  • device microbatch size = 32
  • save interval = 32

But the number of steps is identical, so as you said a composer step is indeed a full dataloader batch.
In any case, after the second resumption from checkpoint after 128 steps the loss begins to diverge from the baseline.

see run here: wandb run

I will check if the batches are identical after resumption, however as I understand from your docs spin_dataloader should be set to "True" by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants