Skip to content

Issue reproducing fine-tuning results – NaN loss after 40 epochs #44

@orspiegel

Description

@orspiegel

Hello,
I’m trying to reproduce your fine-tuning results using the checkpoint multisubject_subj01_1024hid_nolow_300ep with subject01 data.
I’m training on a single L40S GPU with a batch size of 24.

However, after 40 epochs, I encountered the following error:

1173 Traceback (most recent call last):
1174   File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/Train.py", line 834, in <module>
1175     utils.check_loss(loss)
1176   File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/utils.py", line 208, in check_loss
1177     raise ValueError('NaN loss')
1178 ValueError: NaN loss

This occurs at:

utils.check_loss(loss)

I’m running the training (Train.ipynb) script using accel.slurm.

Do you have any insights on why this might be happening?
Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions