Hello,
I’m trying to reproduce your fine-tuning results using the checkpoint multisubject_subj01_1024hid_nolow_300ep with subject01 data.
I’m training on a single L40S GPU with a batch size of 24.
However, after 40 epochs, I encountered the following error:
1173 Traceback (most recent call last):
1174 File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/Train.py", line 834, in <module>
1175 utils.check_loss(loss)
1176 File "/net/mraid20/ifs/wisdom/userdata/orsp/repos/MindEyeV2/src/utils.py", line 208, in check_loss
1177 raise ValueError('NaN loss')
1178 ValueError: NaN loss
This occurs at:
I’m running the training (Train.ipynb) script using accel.slurm.
Do you have any insights on why this might be happening?
Thanks in advance!
Hello,
I’m trying to reproduce your fine-tuning results using the checkpoint multisubject_subj01_1024hid_nolow_300ep with subject01 data.
I’m training on a single L40S GPU with a batch size of 24.
However, after 40 epochs, I encountered the following error:
This occurs at:
I’m running the training (Train.ipynb) script using accel.slurm.
Do you have any insights on why this might be happening?
Thanks in advance!