Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while training TDNN in stage 5 #13

Open
shuuennokage opened this issue Feb 26, 2021 · 6 comments
Open

Error while training TDNN in stage 5 #13

shuuennokage opened this issue Feb 26, 2021 · 6 comments

Comments

@shuuennokage
Copy link

shuuennokage commented Feb 26, 2021

Hello,
I encountered a problem these days while running the training scripts.
After initializing the dataset and model, the model started to train and a lot of errors popped out as following:

errorMessage

(I am using almost the same settings as the example of mini_librispeech with different datasets)

edit: I read about the closed issue "Loss nan", the issue had the same problem with me, but I'm having this problem all over my training set.
I also checked my data and they seemed alright, what could I do to remove these errors?

@shuuennokage
Copy link
Author

shuuennokage commented Mar 2, 2021

The "Loss nan" issue says that samples having too few #frames will cause this error (similar to the case in CTC where input_length < target_length).
This problem occurs when I use my own dataset, no matter the dataset is monolingual or multilingual.
How could I check where the error is occurring by myself? I'm not sure if the data or the settings are causing these problems.

@shuuennokage
Copy link
Author

shuuennokage commented Mar 4, 2021

Hello,
I dug deeper into the code and found the tensors became nan after passing the TDNN(1D dilation layer)
The input tensors still have values before going into the TDNN layer:
before TDNN

but after passing the TDNN layer, the tensors became nan:
after TDNN

I modified the code inside model/tdnn.py to see when did the tensors turn into nan:
TDNN
Another weird point is this problem only appears on step 1 and following steps but doesn't appear on step 0.
However, the log-prob-deriv sum is still nan and the loss is inf on step 0:
image

I tried modifying the learning rate and batch size, but nothing changed.
What should I do to train properly on my own dataset? Maybe I have something wrong with my dataset?

@YiwenShaoStephen
Copy link
Owner

Hi, thanks for the detailed information. It looks like you get 'NaN' gradients in your step 0 (your Loss for step 0 is inf) and then get 'NaN' in your network's parameters and make all activations become 'NaN' in the following steps.
I suggest you look into the values for step 0 to see how this happen. Since you didn't get error information from step 0, I suspect there might be some issues in computing the loss.

@shuuennokage
Copy link
Author

Hello,
I actually got an error message during step 0 (although the loss is inf, not nan)
image
I got this error message right after the model structure is printed (layers and # of parameters)

Anyway, thank you very much for the reply!
I'll try looking into the value and loss computation to see where the problem is.

@shuuennokage
Copy link
Author

Hello,
After several trial & errors, I decided to decrease the size of the dataset to speed up my debugging process (from 111656 to 19874)
And...the bug just disappeared itself, I have no idea what happened.
image
No more error messages, I've deleted the debug output so nothing is printed, everything is fine now.

But now I have a new question about why this is happening, and why decreasing the size of the training set fixed this.
Is this related to the batch size? (I did some research and found that the problem could be solved by decreasing the lr or increasing the batch size, so what I did is actually increasing the batch size, not just decreasing the size of the training set)

@shuuennokage
Copy link
Author

shuuennokage commented Mar 17, 2021

Hello,
I did some experiments and found that the system crashes when training data reaches about 80k~90k utterances.
If the amount of data reached this number, the system will crash no matter how I adjust the lr and batch size.
Is this example only suitable for small training sets?
Also, the same settings work on my GTX 1080Ti but crash on Titan RTX, I don't know if it is a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants