Practical 1 of Advanced Topics in Computational Semantics (first year master AI @ UvA).
In this project, we test multiple models proposed by Conneau et al.. The following models are considered:
- Baseline: averaging word embeddings to obtain sentence representations.
- Unidirectional LSTM applied on the word embeddings, where the last hidden state is considered as sentence representation.
- Simple bidirectional LSTM (BiLSTM), where the last hidden state of forward and backward layers are concatenated as sentence representations.
- BiLSTM with max pooling applied to the concatenation of word-level hidden states from both directions to retrieve sentence representations.
- Anaconda. Available at: https://www.anaconda.com/distribution/
- Open Anaconda prompt and clone this repository (or download and unpack zip):
git clone https://github.com/Luuk99/ATCS_Practical_1.git
- Create the environment:
conda env create -f environment.yml
- Activate the environment:
conda activate ATCS
- View the notebook with the experimental results:
jupyter notebook results.ipynb
Training a model:
- Do step 1-3 of the above section.
- Download en from spacy for the tokenizer:
python -m spacy download en
- Create a .data folder inside the root folder and place the SNLI data from the SNLI website in this folder.
- Run the training of the models:
python main.py --model MODEL
Running SentEval:
- Clone the SentEval project:
git clone https://github.com/facebookresearch/SentEval.git
- Navigate to the SentEval folder.
- Install SentEval
python setup.py install
- Open GitBash, navigate to the data/downstream folder and download data:
get_transfer_data.bash
- Download Glove embeddings from the Stanford website.
- Move the .zip file to the SentEval/pretrained folder and unzip here. (make sure the .txt file is in the pretrained folder directly)
- Move the entire SentEval folder inside the ATCS_Practical_1 folder.
- Run SentEval from the ATCS_Practical_1 folder:
python senteval.py --model MODEL
- If you want to make use of the --development feature to run on a smaller dataset when making changes:
- Create a folder .development_data in the root folder.
- Copy the SNLI dataset from .data to .development_data.
- Limit the .json files to your taste. Since I used 64 as batch size, I use the following limits:
- 64x400 for train
- 64x100 for dev
- 64x100 for test
- Add the --progress_bar argument to the training to see the training progress.
- If you want to use a checkpoint, use the --checkpoint_dir argument and provide the path to the checkpoint file. (add the .ckpt file at the end of the path)
- Use our trained models instead of training yourself (can take very long).
- Download the models from this Drive folder.
- Move the individual model folders inside your pl_logs/lightning_logs/ folder.
- Use our SentEval results instead of running yourself (takes about 3 hours per model). The results can be found in the senteval_outputs folder.
- Use the enviroment_Lisa.yml file to create the correct environment.
- NO need to download en from spacy, this is done in the .job files.
- Run the provided .job files for the different models.
- If you alter the .job files, do keep in mind to not use --progress_bar as argument. This does not fare well on Lisa.
The models can be trained with the following command line arguments:
usage: main.py [-h] [--model MODEL] [--lr LR] [--lr_decay LR_DECAY]
[--lr_decrease_factor LR_DECREASE_FACTOR] [--lr_threshold LR_THRESHOLD]
[--batch_size BATCH_SIZE] [--checkpoint_dir CHECKPOINT_DIR]
[--seed SEED] [--log_dir LOG_DIR] [--progress_bar] [--development]
optional arguments:
-h, --help Show help message and exit.
--model MODEL What model to use. Options: ['AWE', 'UniLSTM', 'BiLSTM', 'BiLSTMMax']. Default is 'AWE'.
--lr LR Learning rate to use. Default is 0.1.
--lr_decay LR_DECAY Learning rate decay after each epoch. Default is 0.99.
--lr_decrease_factor LR_DECREASE_FACTOR Factor to divide learning rate by when dev accuracy decreases. Default is 5.
--lr_threshold LR_THRESHOLD Learning rate threshold to stop at. Default is 10e-5.
--batch_size BATCH_SIZE Minibatch size. Default is 64.
--checkpoint_dir CHECKPOINT_DIR Directory where the pretrained model checkpoint is located. Default is None (no checkpoint used).
--seed SEED Seed to use for reproducing results. Default is 1234.
--log_dir LOG_DIR Directory where the PyTorch Lightning logs should be created. Default is 'pl_logs'.
--progress_bar Use a progress bar indicator for interactive experimentation. Not to be used in conjuction with SLURM jobs.
--development Limit the size of the datasets in development.
- Luuk Kaandorp - [email protected]
- SentEval was cloned from the original project GitHub.
- Pytorch Lightning implementation was developed using information available in the Deep Learning Course of the UvA (https://uvadlc.github.io/).