STILTs = Supplementary Training on Intermediate Labeled-data Tasks
This repository contains code for BERT on STILTs. It is a fork of the Hugging Face implementation of BERT.
STILTs is a method for supplementary training on an intermediate task before fine-tuning for a downstream target task. We show in our paper that STILTs can improve the performance and stability of the final model on the target task.
BERT on STILTs achieves a GLUE score of 82.0, compared to 80.5 of BERT without STILTs.
Base Model | Intermediate Task | Target Task | Download | Val Score | Test Score |
---|---|---|---|---|---|
BERT-Large | - | CoLA | Link | 65.3 | 61.2 |
BERT-Large | MNLI | SST | Link | 93.9 | 95.1 |
BERT-Large | MNLI | MRPC | Link | 90.4 | 88.6 |
BERT-Large | MNLI | STS-B | Link | 90.7 | 89.0 |
BERT-Large | - | QQP | Link | 90.0 | 81.2 |
BERT-Large | - | MNLI | Link | 86.7 | 86.2 |
BERT-Large | MNLI | QNLI | Link | 92.3 | 92.8 |
BERT-Large | MNLI | RTE | Link | 84.1 | 79.0 |
BERT-Large | - | WNLI* | N/A | 56.3 | 65.1 |
Overall GLUE Score: 82.0
Models differ slightly from published results because they were retrained.
You will need to download the GLUE data to run our tasks. See here.
You will also need to set the two following environment variables:
GLUE_DIR
: This should point to the location of the GLUE data downloaded fromjiant
.BERT_ALL_DIR
: SetBERT_ALL_DIR=/PATH_TO_THIS_REPO/cache/bert_metadata
- For mor general use:
BERT_ALL_DIR
should point to the location of BERT downloaded from here. Importantly, theBERT_ALL_DIR
needs to contain the filesuncased_L-24_H-1024_A-16/bert_config.json
anduncased_L-24_H-1024_A-16/vocab.txt
.
- For mor general use:
To generate validation/test predictions, as well as validation metrics, run something like the following:
export TASK=rte
export BERT_LOAD_PATH=path/to/mnli__rte.p
export OUTPUT_PATH=rte_output
python glue/train.py \
--task_name $TASK \
--do_val --do_test \
--do_lower_case \
--bert_model bert-large-uncased \
--bert_load_mode full_model_only \
--bert_load_path $BERT_LOAD_PATH \
--eval_batch_size 64 \
--output_dir $OUTPUT_PATH
We recommend training with a batch size of 16/24/32.
export TASK=mnli
export OUTPUT_PATH=mnli_output
python glue/train.py \
--task_name $TASK \
--do_train --do_val --do_test --do_val_history \
--do_save \
--do_lower_case \
--bert_model bert-large-uncased \
--bert_load_mode from_pretrained \
--bert_save_mode model_all \
--train_batch_size 24 \
--learning_rate 2e-5 \
--output_dir $OUTPUT_PATH
export PRETRAINED_MODEL_PATH=/path/to/mnli.p
export TASK=rte
export OUTPUT_PATH=rte_output
python glue/train.py \
--task_name $TASK \
--do_train --do_val --do_test --do_val_history \
--do_save \
--do_lower_case \
--bert_model bert-large-uncased \
--bert_load_path $PRETRAINED_MODEL_PATH \
--bert_load_mode model_only \
--bert_save_mode model_all \
--train_batch_size 24 \
--learning_rate 2e-5 \
--output_dir $OUTPUT_PATH
export TASK_A=mnli
export TASK_B=rte
export OUTPUT_PATH_A=mnli
export OUTPUT_PATH_B=mnli__rte
# MNLI
python glue/train.py \
--task_name $TASK_A \
--do_train --do_val \
--do_save \
--do_lower_case \
--bert_model bert-large-uncased \
--bert_load_mode from_pretrained \
--bert_save_mode model_all \
--train_batch_size 24 \
--learning_rate 2e-5 \
--output_dir $OUTPUT_PATH_A
# MNLI -> RTE
python glue/train.py \
--task_name $TASK_B \
--do_train --do_val --do_test --do_val_history \
--do_save \
--do_lower_case \
--bert_model bert-large-uncased \
--bert_load_path $OUTPUT_PATH_A/all_state.p \
--bert_load_mode state_model_only \
--bert_save_mode model_all \
--train_batch_size 24 \
--learning_rate 2e-5 \
--output_dir $OUTPUT_PATH_B
We have included helper scripts for exporting submissions to the GLUE leaderboard. To prepare for submission, copy the template from cache/submission_template
to a given new output folder:
cp -R cache/submission_template /path/to/new_submission
After running a fine-tuned/pretrained model on a task with the --do_test
argument, a folder (e.g. rte_output
) will be created containing test_preds.csv
among other files. Run the following command to convert test_preds.csv
to the submission format in the output folder.
python glue/format_for_glue.py
--task-name rte \
--input-base-path /path/to/rte_output \
--output-base-path /path/to/new_submission
Once you have exported submission predictions for each task, you should have 11 .tsv
files in total. If you run wc -l *.tsv
, you should see something like the following:
1105 AX.tsv
1064 CoLA.tsv
9848 MNLI-mm.tsv
9797 MNLI-m.tsv
1726 MRPC.tsv
5464 QNLI.tsv
390966 QQP.tsv
3001 RTE.tsv
1822 SST-2.tsv
1380 STS-B.tsv
147 WNLI.tsv
426597 total
Next run zip -j -D submission.zip *.tsv
in the folder to generate the submission zip file. Upload the zip file to https://gluebenchmark.com/submit to submit to the leaderboard.
This repository also supports the use of Adapter layers for BERT.
Q: What does STILTs stand for?
STILTs stand for Supplementary Training on Intermediate Labeled-data Tasks.
Q: That's it? You finetune on one task, then finetune on another?
Yes—in some sense, this is the simplest possible approach to pretrain/multi-task-train on another task. We do not perform multi-task training or balancing of multi-task losses, any additional hyperparameter search, early stopping or any other modification to the training procedure. We simply apply the standard BERT training procedure twice.
Even so, this simple method of supplementary training is still competitive with more complex multi-task methods such as MT-DNN and ALICE.
So far, we have observed three main benefits of STILTs:
- STILTs tends to improve performance on tasks with little data (<10k training examples)
- STILTs tends to stabilize training on tasks with little data
- In cases where the intermediate and target tasks are closely related (e.g. MNLI/RTE), we observe a significant improvement in performance.
Q: The paper/abstract mentions a GLUE score of 81.8, while the leaderboard shows 82.0. What's going on?
The GLUE benchmark underwent an update wherein QNLI(v1) was replaced by a QNLIv2, which has a different train/val/test split. The GLUE leaderboard currently reports QNLIv2 scores. On the other hand, all experiments in the paper were run on QNLIv1, so we chose to report the GLUE score based on QNLIv1 in the paper.
In short, with QNLIv1, we get a GLUE score of 81.8. With QNLIv2, we got a score of 82.0.
Q: When I run finetuning on CoLA/MRPC/STS-B/RTE, I get terrible results. Why?
Finetuning BERT-Large on tasks with little training data (<10k) tends to be unstable. This is referenced in the original BERT paper.
Q: Where are the other models (GPT, ELMo) referenced in the paper?
Those results were obtained using the jiant framework. We currently have no plans to publish the trained models for those experiments.
@article{phang2018stilts,
title={Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks},
author={Phang, Jason and F\'evry,, Thibault and Bowman, Samuel R.},
journal={arXiv preprint arXiv:1811.01088v2},
year={2018}
}