Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: index xxxxxx is out of bounds for dimension 0 with size 36 #14

Open
igor17400 opened this issue Apr 18, 2024 · 19 comments
Open
Labels
bug Something isn't working

Comments

@igor17400
Copy link
Contributor

Hi, when executing some models such as NRMS and MINER after a few epochs—2 for NRMS and 5 for MINER—I am encountering the following error message:

IndexError: index 1021546040 is out of bounds for dimension 0 with size 36

From my investigations, it seems that the loss might be exploding and the forward pass is returning nan vectors. For instance, before receiving the error, I receive warning messages like:

Encountered `nan` values in tensor. Will be removed.  warnings.warn(*args, **kwargs)  # noqa: B028

I am using the default configuration files:


I plan to add some print statements to better evaluate the bug.
Has anyone else faced this issue?

@andreeaiana
Copy link
Owner

Hi, I noticed that if the learning rate is too big, there might be nan values returned in the forward pass. However, I ran the NRMS model on both MINDlarge and Adressa 1-2 weeks ago, and I didn't have any issues myself.

Someone else faced a similar problem with an earlier version of the code and the MANNeR model here

@igor17400
Copy link
Contributor Author

Thank you for your response @andreeaiana!

I just finished collecting the prints I mentioned before. I added them inside the validation_step as follows:

    def validation_step(self, batch: RecommendationBatch, batch_idx: int):
        loss, preds, targets, cand_news_size, _, _, _, _, _, _, _ = self.model_step(batch)

        print("********* loss *********")
        print(loss.size())
        print(loss)
        print("********* preds *********")
        print(preds.size())
        print(preds)
        print("********* targets *********")
        print(targets.size())
        print(targets)

I tested for the NRMS model. After 2 epochs, during the validation part of the 2nd epoch it outputs the same error message:

IndexError: index 4607182419821563448 is out of bounds for dimension 0 with size 36

And as I suspected, indeed the tensors values are coming nan. As it can be seen below:

Print statements ``` ********* preds ********* torch.Size([454]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([454]) tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') ********* loss ********* torch.Size([]) tensor(nan, device='cuda:0') ********* preds ********* torch.Size([386]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([386]) tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') ********* loss ********* torch.Size([]) tensor(nan, device='cuda:0') ********* preds ********* torch.Size([146]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([146]) tensor([1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], device='cuda:0') ********* loss ********* torch.Size([]) tensor(nan, device='cuda:0') ********* preds ********* torch.Size([515]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([515]) tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') ********* loss ********* torch.Size([]) tensor(nan, device='cuda:0') ********* preds ********* torch.Size([500]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([500]) tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.], device='cuda:0') ********* loss ********* torch.Size([]) tensor(nan, device='cuda:0') ********* preds ********* torch.Size([47]) tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0', dtype=torch.float16) ********* targets ********* torch.Size([47]) tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.], device='cuda:0') ```

Regarding the learning rate you mentioned, I'm using the default configuration of the file nrms.yaml. That is,

optimizer:
  _target_: torch.optim.Adam
  _partial_: true
  lr: 0.0001

@igor17400
Copy link
Contributor Author

I'll add some print statements during the training part of epoch 2 to see what happens with the preds and loss

@igor17400
Copy link
Contributor Author

Apparently some predictions are being returned as nan. I've added the print statements inside model_step method to better visualized during training. Here is one example:

########### targets ###########
torch.Size([50])
tensor([0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],

########### batch[batch_cand] ###########
torch.Size([50])
tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7,
        7, 7], device='cuda:0')

########### user_ids ###########
tensor([38572, 68013, 60944,  5008, 58917, 88430, 30919, 56731],
       device='cuda:0')
tensor([38572, 68013, 60944,  5008, 58917, 88430, 30919, 56731],
       device='cuda:0')

########### cand_news_ids ###########
torch.Size([50])
tensor([31602, 62391, 50135, 59893, 38783, 50675, 24423, 62360, 24111, 49180,
        48019, 63970, 33619, 48046, 32544, 44422, 38263, 44290,  7419, 62563,
        43102, 20678, 33885, 58114, 30172, 51398, 27845, 39115, 25764, 41178,
        34876, 59673, 51048,   287, 45266, 55689, 35729, 55689, 59981,  7809,
        32544,  7319, 41020, 50675, 31947, 43432, 43432, 43432, 43432, 55689],
       device='cuda:0')

########### loss ###########
torch.Size([])
tensor(nan, device='cuda:0', grad_fn=<DivBackward1>)

########### preds ###########
torch.Size([80])
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
 nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
 nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
 nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan], device='cuda:0',
       dtype=torch.float16, grad_fn=<CatBackward0>)

########### y_true ###########
torch.Size([8, 5])
tensor([[1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.]], device='cuda:0')

########### targets ###########
torch.Size([40])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 1., 0., 0.], device='cuda:0')

########### batch[batch_cand] ###########
torch.Size([40])
tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
        4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7], device='cuda:0')

########### user_ids ###########
tensor([28439, 40880, 54316, 40815, 32693, 44860, 23931, 14864],
       device='cuda:0')
tensor([28439, 40880, 54316, 40815, 32693, 44860, 23931, 14864],
       device='cuda:0')

########### cand_news_ids ###########
torch.Size([40])
tensor([35729, 59981, 59981, 59981, 59981, 40839, 40839, 58363, 40839, 64851,
        40318, 32791, 48759, 63550, 13801, 12042, 56598, 35729, 35172, 64542,
        13930, 45270, 55204, 13930, 55689, 57651, 57651, 49685, 57651, 57651,
        37660, 22417, 14029, 17117, 36261,  8643, 23508, 63958, 64968, 10913],
       device='cuda:0')

@igor17400
Copy link
Contributor Author

I'm struggling to understand why this is happening 🧐
The same error happened with MINS model as well on the 2nd epoch.

I found this issue regarding nn.MultiheadAttention. Maybe that's the problem?

@igor17400
Copy link
Contributor Author

@andreeaiana, could you please check if our library versions match?
Alternatively, if you want to send me your list of versions, and I can compare them, whatever its best.

  • Result for python --version: Python 3.9.19

  • Result for conda list > package_list.txt: packages_list.txt

@andreeaiana
Copy link
Owner

@igor17400 sure, I apologize for the slow reply

  • Result for python --version: Python 3.9.16
  • Result for conda list > package_list.txt: package_list.txt

@igor17400
Copy link
Contributor Author

@andreeaiana no worries!
Thank you very much, I'll let you know of any progress.

@igor17400
Copy link
Contributor Author

@andreeaiana here are some findings until now.

  1. NRMS

As I mentioned before, apparently there is some bug with nn.MultiheadAttention. I then decided to make a test and create a new module called custom_transformer.py following the comment .

You can better understand what I did by looking at this file https://github.com/andreeaiana/newsreclib/blob/b7e3357b9247a2efbb57766b5383b8a442d3f531/newsreclib/models/components/encoders/user/nrms.py

After that modification, I was able to successfully train NRMS for all epochs (10 in total), however I don't know if this was a coincidence.

  1. MINER

This case is a bit different because it isn't using nn.MultiheadAttention and even so is still computing nan vectors. Let me show what I found until now.

Apparently the news_encoder is returning a vector of nan values. I added some print statements to all variables in the forward function as shown below:

if torch.isnan(scores).any():
            print("******* forward *********")
            print("------- batch[x_hist] --------")
            print(batch["x_hist"])
            print("------- hist_news_vector --------")
            print(hist_news_vector.size())
            print(hist_news_vector)

            print("------- batch[x_cand] --------")
            print(batch["x_cand"])
            print("------- cand_news_vector --------")
            print(cand_news_vector.size())
            print(cand_news_vector)

            print("------- hist_categ_vector --------")
            print(hist_categ_vector.size())
            print(hist_categ_vector)

            print("------- cand_categ_vector --------")
            print(cand_categ_vector.size())
            print(cand_categ_vector)

            print("------- categ_bias_agg --------")
            print(categ_bias_agg.size())
            print(categ_bias_agg)

            print("------- user_vector --------")
            print(user_vector.size())
            print(user_vector)

            print("------- scores --------")
            print(scores.size())
            print(scores)
            print("**********************************")

In addition, I would just like to highlight the following code logic:

hist_news_vector = self.news_encoder(batch["x_hist"])
hist_news_vector_agg, mask_hist = to_dense_batch(
    hist_news_vector, batch["batch_hist"]
)

Then, at some point I received the warning: UserWarning: Encountered nan values in tensor. Will be removed. warnings.warn(*args, **kwargs) # noqa: B028

And I checked the prints as shown below:

------- batch[x_hist] --------
{'news_ids': tensor([30160, 24917, 30680, 10359, 36312, 21685, 57967, 24374, 40163, 17968,
        29276, 61055, 31599, 33203, 62931, 41777, 17825, 19769,  5642, 59546,
         7158, 51942, 54624, 51221, 63049,   477, 26799, 46866, 30727,  3259,
        52551, 46795, 37509, 36754, 27922, 27140,  2735, 53494,  1267, 15253,
        36053,  4166, 10919, 50635, 43142, 43623, 54469, 22570,  6523, 23571,
        21977, 33707, 45729, 10059, 41997, 64408,  4593, 40716,   250,  5978,
        63229,  9101, 63123, 42274, 16781, 51667, 35601, 34930,    50, 46811,
        20344, 24691, 15253, 58030, 24298, 60991, 25450, 14349, 10470, 46039,
        29730,   719,  2203, 31191, 20216, 16233,  6233, 64503,  9653, 17799,
        30974, 42281, 46513, 44396,  5978, 13925, 40716, 23653,  9803, 60184,
        61342, 42620, 46267, 52551, 62058, 23958, 28257, 15676],
       device='cuda:0'), 'title': {'input_ids': tensor([[    0,  6179,   598,  ...,     1,     1,     1],
        [    0,  2264,   294,  ...,     1,     1,     1],
        [    0,  1749,  9899,  ...,     1,     1,     1],
        ...,

Look at how the hist_news_vector is being printed:

------- hist_news_vector --------
torch.Size([108, 256])
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       dtype=torch.float16, grad_fn=<NativeDropoutBackward0>)

Apparently there is something going on with the user_encoder, but in the case of MINER it's using the PolyAttention module. So I'm wondering why is this happening.

  1. MANNeR

I just noticed that it uses MHSAAddAtt which then uses nn.MultiheadAttention, I'll check if making a replacement similar to what was done in NRMS solves the bug.

@igor17400
Copy link
Contributor Author

Apparently the error, at least in the case of MINS, is coming from the news_encoder when passing the input_ids and attention_mask to the PLM model. In the case I'm testing it's roberta-base.

The prints I added were the following:

if self.encode_text:
            text_vectors = [
                encoder(news[name]) for name, encoder in self.text_encoders.items()
            ]
            if torch.isnan(text_vectors[0]).any() or torch.isnan(text_vectors[1]).any():
                print("@@@@@@ self.text_encoders.items() @@@@@@")
                print(self.text_encoders.items())
                print("********")
                for name, encoder in self.text_encoders.items():
                    print("------- encoder --------")
                    print(encoder)
                    print("------- name --------")
                    print(name)
                    print("------- news[name] --------")
                    print(news[name].keys())
                    print(news[name])
                    torch.save(news[name]["input_ids"], f"{name}_input_ids.pth")
                    torch.save(
                        news[name]["attention_mask"], f"{name}_attention_mask.pth"
                    )
                    print("------- encoder(news[name]) --------")
                    print(encoder(news[name]))
                    print("---------------")
                print("********")

Here is one example of nan during the encoding:

********
------- encoder --------
PLM(
  (plm_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): RobertaPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (multihead_attention): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
  )
  (additive_attention): AdditiveAttention(
    (linear): Linear(in_features=768, out_features=200, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
)
------- name --------
abstract
------- news[name] --------
dict_keys(['input_ids', 'attention_mask'])
{'input_ids': tensor([[    0, 37545,  2839,  ...,     1,     1,     1],
        [    0,   970,    18,  ...,     1,     1,     1],
        [    0,   133, 15091,  ...,     1,     1,     1],
        ...,
        [    0,   133,  5474,  ...,     1,     1,     1],
        [    0, 16035, 48032,  ...,     1,     1,     1],
        [    0, 20861,  4121,  ...,     1,     1,     1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}
------- encoder(news[name]) --------
tensor([[-0.0431,  0.2167,  0.0773,  ...,  0.2761,  0.1248, -0.0325],
        [-0.0584,  0.2065,  0.0724,  ...,  0.2703,  0.1244, -0.0341],
        [-0.1138,  0.2350,  0.0594,  ...,  0.3142,  0.1550, -0.0509],
        ...,
        [-0.0830,  0.2133,  0.0742,  ...,  0.2913,  0.1406, -0.0444],
        [-0.0589,  0.2003,  0.0781,  ...,  0.3198,  0.1290, -0.0375],
        [-0.0709,  0.2073,  0.0741,  ...,  0.3025,  0.1254, -0.0388]],
       device='cuda:0', dtype=torch.float16, grad_fn=<SqueezeBackward1>)
---------------
------- encoder --------
PLM(
  (plm_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): RobertaPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (multihead_attention): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
  )
  (additive_attention): AdditiveAttention(
    (linear): Linear(in_features=768, out_features=200, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
)
------- name --------
title
------- news[name] --------
dict_keys(['input_ids', 'attention_mask'])
{'input_ids': tensor([[    0, 37545,  2839,  ...,     1,     1,     1],
        [    0,  7608,  5105,  ...,     1,     1,     1],
        [    0,   510, 43992,  ...,     1,     1,     1],
        ...,
        [    0,   673, 10188,  ...,     1,     1,     1],
        [    0, 16035, 48032,  ...,     1,     1,     1],
        [    0,  6407,   740,  ...,     1,     1,     1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}
------- encoder(news[name]) --------
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       dtype=torch.float16, grad_fn=<SqueezeBackward1>)
---------------
********

As it can be seen, the encoding for the title is returning an all nan vector. But I don't know why this is happening.

@andreeaiana
Copy link
Owner

@andreeaiana here are some findings until now.

  1. NRMS

As I mentioned before, apparently there is some bug with nn.MultiheadAttention. I then decided to make a test and create a new module called custom_transformer.py following the comment .

You can better understand what I did by looking at this file https://github.com/andreeaiana/newsreclib/blob/b7e3357b9247a2efbb57766b5383b8a442d3f531/newsreclib/models/components/encoders/user/nrms.py

After that modification, I was able to successfully train NRMS for all epochs (10 in total), however I don't know if this was a coincidence.

  1. MANNeR

I just noticed that it uses MHSAAddAtt which then uses nn.MultiheadAttention, I'll check if making a replacement similar to what was done in NRMS solves the bug.

I think if your custom_transformer.py fixes the nn.MultiheadAttention bug, we can replace the call to nn.MultiheadAttention with our custom implementation everywhere. I'll try to have a closer look at all the changes towards the end of this week (very sorry about this, super packed schedule at the moment) and try running some of the models again myself with an updated conda environment.

@igor17400
Copy link
Contributor Author

Hi @andreeaiana, sorry for the late response.
I think the substitution to the new module might have been a coincidence, but hard to affirm. I believe this behaviors seems random depending on the samples selected during the epochs. The reason why I say this, is because I didn't identify any pattern to be honest. I'm running the MINS model, which has previously encountered errors, and until now no error has been thrown 🧐

@igor17400
Copy link
Contributor Author

I just received the error again, even when changing the line

self.multihead_attention = nn.MultiheadAttention(
to the new MultiheadAttention declared locally.

MINS model:

warnings.warn(*args, **kwargs)  # noqa: B028
Epoch 3:  61%|▌| 9397/15529 [44:41<29:10,  3.50it/s, v_num=e96d, val/loss=5.440, val/loss_best=5.430, val/auc=0.468, val/mrr=0.172, val/ndcg@10=0.20

@igor17400
Copy link
Contributor Author

I made some more tests, and I believe the error might be associated with the precision:

I was receiving the warning warnings.warn(*args, **kwargs) # noqa: B028 when running MINER and then I changed to precision: 32 and apparently is working. However it takes longer to train the model. I believe if we use precision: bf16-true it should fix this issue. However, in the case of MINS apparently GRU doesn't accept this precision yet: pytorch/pytorch#116763

@andreeaiana
Copy link
Owner

Maybe we can then switch to precision: bf16-true for all the cases where this works. Until this is also supported for GRU, an intermediary solution would be to train it with precision: 32 in case precision: 16 results in the aforementioned errors.

@andreeaiana andreeaiana added the bug Something isn't working label Apr 30, 2024
@igor17400
Copy link
Contributor Author

Hi @andreeaiana, I'm doing some tests to evaluate if this solutions is stable and until now I was able to train NRMS for 10 epochs with different parameters without any errors using precision: bf16-true. I'll conduct some tests with other models to see if this solution holds.

In addition I would like to ask you another question if possible, is it possible to use pre-trained embeddings with MANNeR? I'm asking you this because it seems that the code only supports PLM.

@andreeaiana
Copy link
Owner

Hi @andreeaiana, I'm doing some tests to evaluate if this solutions is stable and until now I was able to train NRMS for 10 epochs with different parameters without any errors using precision: bf16-true. I'll conduct some tests with other models to see if this solution holds.

Great, thanks a lot!

In addition I would like to ask you another question if possible, is it possible to use pre-trained embeddings with MANNeR? I'm asking you this because it seems that the code only supports PLM.

Similar to MINER, MANNeR was designed from the start to work with PLMs, so I implemented it as such in NewsRecLib. I think changing the code to work also with pre-trained embeddings should be quite straightforward, but I'm not sure how these would affect the model's performance. I expect that MANNeR would still work well with pre-trained embeddings, but some extra experiments would be needed to confirm this.

@Lr-2002
Copy link

Lr-2002 commented May 21, 2024

Hello, I'm using the bfloat16 with gru ,Have this been solved? and Which version pytorch have the support for the gru bfloat16?

@andreeaiana
Copy link
Owner

Hi @Lr-2002,

The support of BFloat16 in GRU is an issue of PyTorch. According to this discusion, if you build pytorch using commit 9504182 it should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants