A test of the Attention Is Off By One hypothesis with RoBERTA and Esperanto.
The dataset is the Esperanto portion of the OSCAR corpus from INRIA, which is a part of Common Crawl.
Additionally, the dataset contains the Esperanto sub-corpus of the Leipzig Corpora Collection.
In particular, along with OSCAR, I use the following epo_*-sentences.txt
files from the Leipzig Corpora:
Dataset | Year | # of Sentences |
---|---|---|
OSCAR | 2020 | 974k |
LCC - Literature | 2011 | 300k |
LCC - Mixed | 2012 | 1M |
LCC - Newscrawl | 2017 | 1M |
LCC - Web | 2012 | 1M |
LCC - Wikipedia | 2021 | 300k |
total | - | 4.57M |
The dataset is 473 MB and is available on Hugging Face at chriswmurphy/esperanto.
The models are RoBERTa with 84M parameters. The first model is a baseline that uses the default softmax in its Attention mechanism, and a challenger that instead uses the proposed softmax1. The models are also available on Hugging Face at chriswmurphy/esperberto-softmax0 and chriswmurphy/esperberto-softmax1. The idea to use RoBERTa with Esperanto came from this blog post.
As expected, softmax1 does not impact model performance at single-precision.
Model | Loss | Runtime | Cost |
---|---|---|---|
EsperBERTo w/ softmax0 | 4.46 | 9h 16m | $11.22 |
EsperBERTo w/ softmax1 | 4.44 | 9h 16m | $11.24 |
Here we report the average excess kurtosis in the Attention output weights from our initial run. The weights in the dense Attention layers are Gaussian to a good approximation.
Model | Dense Weight | Dense Bias | LayerNorm Weight | LayerNorm Bias |
---|---|---|---|---|
EsperBERTo w/ softmax0 | ||||
EsperBERTo w/ softmax1 |
Finally, we report the average excess kurtosis in the Attention output activations from our initial run. Once again, there is no meaningful difference between the softmax0 and softmax1 models here, and the kurtosis in the activation of the Attention output is consistent with being Gaussian.
Model | Dense | Dropout | Output (LayerNorm) |
---|---|---|---|
EsperBERTo w/ softmax0 | |||
EsperBERTo w/ softmax1 |
I'm running on an AWS g5.2xlarge EC2 instance with 1x Nvidia A10G GPU.
You can use the following script to reproduce my results: screen -S run "bash runner.sh"
Don't forget to add your Hugging Face token and username to the env vars before running, e.g.
echo "HUGGINGFACE_TOKEN=<mySecretTokenVariable>" > .env
echo "HUGGINGFACE_USER=<myHFUserName>" >> .env
To test before running in earnest, add the test_pipeline flag, e.g. python train_model.py --test_pipeline
.
To run the unit tests do pytest tests
.