Issue reproducing fine-tuning results – NaN loss after 40 epochs

Hello,
I’m trying to reproduce your fine-tuning results using the checkpoint multisubject_subj01_1024hid_nolow_300ep with subject01 data.
I’m training on a single L40S GPU with a batch size of 24.

However, after 40 epochs, I encountered the following error:
```
1173 Traceback (most recent call last):
1174   File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/Train.py", line 834, in <module>
1175     utils.check_loss(loss)
1176   File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/utils.py", line 208, in check_loss
1177     raise ValueError('NaN loss')
1178 ValueError: NaN loss
```

This occurs at:
```
utils.check_loss(loss)
```
I’m running the training (Train.ipynb) script using accel.slurm.

Do you have any insights on why this might be happening?
Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue reproducing fine-tuning results – NaN loss after 40 epochs #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue reproducing fine-tuning results – NaN loss after 40 epochs #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions