Reproducing results from the paper with roberta using fairseq #11

bminixhofer · 2020-07-18T14:56:58Z

Hi! Thanks for this repository.

I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh which I would've expected to score 88.13 on average as shown in Table 1 in the paper.

I tried:

five seeds with the setup currently checked into the repo:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     1  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     2  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     3  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     4  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     9016  1.4e-1

and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633 (mean 0.8568). logs here.

five seeds with the parameters from Table 7 in the paper, using the default fairseq parameters (from https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md#3-fine-tuning-on-glue-task) for parameters which are not specified:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     1  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     2  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     3  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     4  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     5  0

and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007 (mean 0.7488). logs here.

Appreciate any help!

The text was updated successfully, but these errors were encountered:

zhuchen03 · 2020-07-18T18:45:02Z

Hi Benjamin,

Sorry for the confusion. I took a look at the hyperparameters. The result reported paper was actually using adv_lr=3e-2, but I did not check it carefully and included some other hyperparameters I tried for grid search in the released launch script. In other words, could you try
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 1 1.4e-1
and see if it reproduces the results?

Also, as the smallest dataset, RTE's result might also have the highest variance. You could try more runs or tune adv_lr around this value a little bit.

bminixhofer · 2020-07-19T08:28:42Z

Thanks! I tried five runs with that setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         1     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         2     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         3     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         4     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         5     1.4e-1

and get the scores 0.8597, 0.8597, 0.8669, 0.8776, 0.8525 (mean 0.8633). logs here.

That is significantly better than before but still 1.8% worse than the mean in the paper, which is quite high for just being variance.

I guess I'll try tuning the parameters a bit, especially adv_lr (and I'll do some more runs with the parameters from above as well to make sure it isn't just "bad" random seeds).

zhuchen03 · 2020-07-19T08:38:29Z

Another potential issue is how the scores are defined. In the paper, the scores are the highest result evaluated at multiple checkpoints for each run, but it is possible that you are looking at the results from the last checkpoint.

Regarding the variance, if you compare your scores of the 5 runs, the variance is somewhat large, especially for the RoBERTa baseline.

bminixhofer · 2020-07-19T08:48:21Z

Hmm yes, that's strange. I did use the highest scores from multiple checkpoints (shown as e. g. Best metric: 0.8525179856115108 in the log files) not the last score so that is not an issue.

bminixhofer · 2020-07-19T15:23:16Z

I can reproduce the results now! I ran five more seeds with the same setup:

run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         123     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         456     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         789     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         10112   1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         131415  1.4e-1

and get the results: 0.8849, 0.8812, 0.8669, 0.8849, 0.8777 (mean 0.8791) which is definitely within margin of error of the paper.

The only thing I changed are the seeds. From these scores I'd be inclined to think that low seeds behave strangely because these scores are consistently better than the previous ones (although that is surely impossible).

Probably just high variance because of the dataset size, as you mentioned.

Feel free to close this issue. Thanks for your help.

zhuchen03 pinned this issue Jul 18, 2020

GodXuxilie mentioned this issue Sep 20, 2020

Having issues with training RoBERTa. Loss not decreasing #12

Closed

zhuchen03 closed this as completed Sep 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results from the paper with roberta using fairseq #11

Reproducing results from the paper with roberta using fairseq #11

bminixhofer commented Jul 18, 2020

zhuchen03 commented Jul 18, 2020 •

edited

Loading

bminixhofer commented Jul 19, 2020 •

edited

Loading

zhuchen03 commented Jul 19, 2020

bminixhofer commented Jul 19, 2020

bminixhofer commented Jul 19, 2020 •

edited

Loading

Reproducing results from the paper with roberta using fairseq #11

Reproducing results from the paper with roberta using fairseq #11

Comments

bminixhofer commented Jul 18, 2020

zhuchen03 commented Jul 18, 2020 • edited Loading

bminixhofer commented Jul 19, 2020 • edited Loading

zhuchen03 commented Jul 19, 2020

bminixhofer commented Jul 19, 2020

bminixhofer commented Jul 19, 2020 • edited Loading

zhuchen03 commented Jul 18, 2020 •

edited

Loading

bminixhofer commented Jul 19, 2020 •

edited

Loading

bminixhofer commented Jul 19, 2020 •

edited

Loading