In this repository you will find code for the final project of the the M.Inf.2202 Deep Learning for Natural Language Processing course at the University of Göttingen. You can find the handout here.
In this project, important components of the BERT model are implemented. Embeddings produced by our BERT model were used for three downstream tasks: sentiment classification, paraphrase detection and semantic similarity.
We implemented a Multi-task learning methodology to improve the baseline model.
- Make sure you have Anaconda or Miniconda installed.
- Run setup.sh to set up a conda environment and install dependencies:
source setup.sh
The setup.sh file contains the following:
#!/usr/bin/env bash
conda create -n dnlp python=3.8
conda activate dnlp
conda install pytorch==1.8.0 torchvision torchaudio cudatoolkit=10.1 -c pytorch
pip install tqdm==4.58.0
pip install requests==2.25.1
pip install importlib-metadata==3.7.0
pip install filelock==3.0.12
pip install sklearn==0.0
pip install tokenizers==0.10.1
pip install explainaboard_client==0.0.7
Activate the dnlp environment by running:
conda activate dnlp
Details of the environment so that used libraries and dependencies can be found in environment.yml file
Since the model is designed to perform three different NLP tasks, we train on three different datasets. More precisely, we use the Standford Sentiment Treebank (Socher et al., 2013), consisting of 11,855 single sentences from movie reviews extracted from movie reviews, to classify a text's sentiment. Further, we use the Quora dataset consisting of 400,000 question pairs to train for paraphrase detection. Lastly, we seek to measure the degree of semantic equivalence, using data from the SemEval STS Benchmark dataset consisting of 8,628 sentence pairs of varying similarity.
To train the model, run this command:
python -u multitask_classifier.py --option finetune --lr 1e-5 --batch_size 64 --local_files_only
or just run the provided submit_train.sh file containing the command above using sbash.
sbash submit_train.sh
Note: The hyperparameters for our best model are already included in the above code snippets.
Disclaimer: We trained on a computer cluster provided by university IT infrastructure. You may run into issues with slow training speed when using commonly used hardware systems.
The evaluation of the model is conducted after each epoch and done so automatically when training. The method model_eval_multitask() imported from evaluation.py evaluates each epoch.
We evaluate each task separately and create a weighted metric summary for all
- Paraphrase Detection: Proportion of correctly classified paraphrase pairs
- Sentiment Classification: Proportion of correct classification of sentiment
- Semantic Textual Similarity: Pearson correlation coefficient of predicted and true values
- Best Metric: Weighted average of each tasks
For this project we used the so called bert-base-uncased pre-trained model implementation that loads pre-trained weights. Within the multitask_classifier.py file we call the from_pretrained() method from base_bert.py to retrieve the pre-trained model. In other words, when trying to run the training, you mind want to replace "finetune" to "pretrain", i.e.
python -u multitask_classifier.py --option pretrain --lr 1e-5 --batch_size 64 --local_files_only
Our model achieves the following performance on :
Model | Paraphrase Accuarcy | Sentiment Accuracy | Semantic Text Similarity Correlation |
---|---|---|---|
Multitask Classifier BERT Model | 89.01% | 49.59% | 88.00% |
During training, we create a text file with information about our best model from every epoch.
Model | Paraphrase Accuarcy | Sentiment Accuracy | Semantic Text Similarity Correlation |
---|---|---|---|
Multitask Classifier pretrain BERT Model | 26.34% | 62.47% | 37.30% |
For experiment we checked the model performance when setting it to 'pretrain' instead of 'finetune'. As expected the results were not satisfying, since we have freezed weights which are not updating the model can't learn efficiently and only task specific layers are updating weights.