Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing results from the paper with roberta using fairseq #11

Closed
bminixhofer opened this issue Jul 18, 2020 · 5 comments
Closed

Reproducing results from the paper with roberta using fairseq #11

bminixhofer opened this issue Jul 18, 2020 · 5 comments

Comments

@bminixhofer
Copy link

Hi! Thanks for this repository.

I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh which I would've expected to score 88.13 on average as shown in Table 1 in the paper.

I tried:

  1. five seeds with the setup currently checked into the repo:
# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     1  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     2  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     3  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     4  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     9016  1.4e-1

and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633 (mean 0.8568). logs here.

  1. five seeds with the parameters from Table 7 in the paper, using the default fairseq parameters (from https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md#3-fine-tuning-on-glue-task) for parameters which are not specified:
# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     1  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     2  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     3  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     4  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     5  0

and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007 (mean 0.7488). logs here.

Appreciate any help!

@zhuchen03
Copy link
Owner

zhuchen03 commented Jul 18, 2020

Hi Benjamin,

Sorry for the confusion. I took a look at the hyperparameters. The result reported paper was actually using adv_lr=3e-2, but I did not check it carefully and included some other hyperparameters I tried for grid search in the released launch script. In other words, could you try
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 1 1.4e-1
and see if it reproduces the results?

Also, as the smallest dataset, RTE's result might also have the highest variance. You could try more runs or tune adv_lr around this value a little bit.

@zhuchen03 zhuchen03 pinned this issue Jul 18, 2020
@bminixhofer
Copy link
Author

bminixhofer commented Jul 19, 2020

Thanks! I tried five runs with that setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         1     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         2     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         3     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         4     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         5     1.4e-1

and get the scores 0.8597, 0.8597, 0.8669, 0.8776, 0.8525 (mean 0.8633). logs here.

That is significantly better than before but still 1.8% worse than the mean in the paper, which is quite high for just being variance.

I guess I'll try tuning the parameters a bit, especially adv_lr (and I'll do some more runs with the parameters from above as well to make sure it isn't just "bad" random seeds).

@zhuchen03
Copy link
Owner

Another potential issue is how the scores are defined. In the paper, the scores are the highest result evaluated at multiple checkpoints for each run, but it is possible that you are looking at the results from the last checkpoint.

Regarding the variance, if you compare your scores of the 5 runs, the variance is somewhat large, especially for the RoBERTa baseline.

@bminixhofer
Copy link
Author

Hmm yes, that's strange. I did use the highest scores from multiple checkpoints (shown as e. g. Best metric: 0.8525179856115108 in the log files) not the last score so that is not an issue.

@bminixhofer
Copy link
Author

bminixhofer commented Jul 19, 2020

I can reproduce the results now! I ran five more seeds with the same setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         123     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         456     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         789     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         10112   1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         131415  1.4e-1

and get the results: 0.8849, 0.8812, 0.8669, 0.8849, 0.8777 (mean 0.8791) which is definitely within margin of error of the paper.

The only thing I changed are the seeds. From these scores I'd be inclined to think that low seeds behave strangely because these scores are consistently better than the previous ones (although that is surely impossible).

Probably just high variance because of the dataset size, as you mentioned.

Feel free to close this issue. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants