-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing results from the paper with roberta using fairseq #11
Comments
Hi Benjamin, Sorry for the confusion. I took a look at the hyperparameters. The result reported paper was actually using Also, as the smallest dataset, RTE's result might also have the highest variance. You could try more runs or tune adv_lr around this value a little bit. |
Thanks! I tried five runs with that setup:
and get the scores That is significantly better than before but still 1.8% worse than the mean in the paper, which is quite high for just being variance. I guess I'll try tuning the parameters a bit, especially |
Another potential issue is how the scores are defined. In the paper, the scores are the highest result evaluated at multiple checkpoints for each run, but it is possible that you are looking at the results from the last checkpoint. Regarding the variance, if you compare your scores of the 5 runs, the variance is somewhat large, especially for the RoBERTa baseline. |
Hmm yes, that's strange. I did use the highest scores from multiple checkpoints (shown as e. g. |
I can reproduce the results now! I ran five more seeds with the same setup:
and get the results: The only thing I changed are the seeds. From these scores I'd be inclined to think that low seeds behave strangely because these scores are consistently better than the previous ones (although that is surely impossible). Probably just high variance because of the dataset size, as you mentioned. Feel free to close this issue. Thanks for your help. |
Hi! Thanks for this repository.
I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in
fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh
which I would've expected to score 88.13 on average as shown in Table 1 in the paper.I tried:
and got the scores
0.8597, 0.8884, 0.8057, 0.8669, 0.8633
(mean 0.8568). logs here.and got the scores
0.8741, 0.7949, 0.8417, 0.6330, 0.6007
(mean 0.7488). logs here.Appreciate any help!
The text was updated successfully, but these errors were encountered: