ATCS_Practical_1

Practical 1 of Advanced Topics in Computational Semantics (first year master AI @ UvA).

Content

In this project, we test multiple models proposed by Conneau et al.. The following models are considered:

Baseline: averaging word embeddings to obtain sentence representations.
Unidirectional LSTM applied on the word embeddings, where the last hidden state is considered as sentence representation.
Simple bidirectional LSTM (BiLSTM), where the last hidden state of forward and backward layers are concatenated as sentence representations.
BiLSTM with max pooling applied to the concatenation of word-level hidden states from both directions to retrieve sentence representations.

Prerequisites

Anaconda. Available at: https://www.anaconda.com/distribution/

Getting Started

Open Anaconda prompt and clone this repository (or download and unpack zip):

git clone https://github.com/Luuk99/ATCS_Practical_1.git

Create the environment:

conda env create -f environment.yml

Activate the environment:

conda activate ATCS

View the notebook with the experimental results:

jupyter notebook results.ipynb

Replicating Results

Training a model:

Do step 1-3 of the above section.
Download en from spacy for the tokenizer:

python -m spacy download en

Create a .data folder inside the root folder and place the SNLI data from the SNLI website in this folder.
Run the training of the models:

python main.py --model MODEL

Running SentEval:

Clone the SentEval project:

git clone https://github.com/facebookresearch/SentEval.git

Navigate to the SentEval folder.
Install SentEval

python setup.py install

Open GitBash, navigate to the data/downstream folder and download data:

get_transfer_data.bash

Download Glove embeddings from the Stanford website.
Move the .zip file to the SentEval/pretrained folder and unzip here. (make sure the .txt file is in the pretrained folder directly)
Move the entire SentEval folder inside the ATCS_Practical_1 folder.
Run SentEval from the ATCS_Practical_1 folder:

python senteval.py --model MODEL

Tips

If you want to make use of the --development feature to run on a smaller dataset when making changes:
1. Create a folder .development_data in the root folder.
2. Copy the SNLI dataset from .data to .development_data.
3. Limit the .json files to your taste. Since I used 64 as batch size, I use the following limits:
  - 64x400 for train
  - 64x100 for dev
  - 64x100 for test
Add the --progress_bar argument to the training to see the training progress.
If you want to use a checkpoint, use the --checkpoint_dir argument and provide the path to the checkpoint file. (add the .ckpt file at the end of the path)
Use our trained models instead of training yourself (can take very long).
1. Download the models from this Drive folder.
2. Move the individual model folders inside your pl_logs/lightning_logs/ folder.
Use our SentEval results instead of running yourself (takes about 3 hours per model). The results can be found in the senteval_outputs folder.

Using Lisa Cluster

Use the enviroment_Lisa.yml file to create the correct environment.
NO need to download en from spacy, this is done in the .job files.
Run the provided .job files for the different models.
If you alter the .job files, do keep in mind to not use --progress_bar as argument. This does not fare well on Lisa.

Arguments

The models can be trained with the following command line arguments:

usage: main.py [-h] [--model MODEL] [--lr LR] [--lr_decay LR_DECAY]
		    [--lr_decrease_factor LR_DECREASE_FACTOR] [--lr_threshold LR_THRESHOLD] 
		    [--batch_size BATCH_SIZE] [--checkpoint_dir CHECKPOINT_DIR]
		    [--seed SEED] [--log_dir LOG_DIR] [--progress_bar] [--development]

optional arguments:
  -h, --help            			Show help message and exit.
  --model MODEL					What model to use. Options: ['AWE', 'UniLSTM', 'BiLSTM', 'BiLSTMMax']. Default is 'AWE'.
  --lr LR					Learning rate to use. Default is 0.1.
  --lr_decay LR_DECAY				Learning rate decay after each epoch. Default is 0.99.
  --lr_decrease_factor LR_DECREASE_FACTOR	Factor to divide learning rate by when dev accuracy decreases. Default is 5.
  --lr_threshold LR_THRESHOLD			Learning rate threshold to stop at. Default is 10e-5.
  --batch_size BATCH_SIZE			Minibatch size. Default is 64.
  --checkpoint_dir CHECKPOINT_DIR		Directory where the pretrained model checkpoint is located. Default is None (no checkpoint used).
  --seed SEED					Seed to use for reproducing results. Default is 1234.
  --log_dir LOG_DIR				Directory where the PyTorch Lightning logs should be created. Default is 'pl_logs'.
  --progress_bar				Use a progress bar indicator for interactive experimentation. Not to be used in conjuction with SLURM jobs.
  --development					Limit the size of the datasets in development.

Authors

Luuk Kaandorp - [email protected]

Acknowledgements

SentEval was cloned from the original project GitHub.
Pytorch Lightning implementation was developed using information available in the Deep Learning Course of the UvA (https://uvadlc.github.io/).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dataset		dataset
models		models
senteval_jobs		senteval_jobs
senteval_outputs		senteval_outputs
training_jobs		training_jobs
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
environment_Lisa.yml		environment_Lisa.yml
main.py		main.py
results.ipynb		results.ipynb
senteval.py		senteval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATCS_Practical_1

Content

Prerequisites

Getting Started

Replicating Results

Tips

Using Lisa Cluster

Arguments

Authors

Acknowledgements

About

Releases

Packages

Languages

Luuk99/ATCS_Practical_1

Folders and files

Latest commit

History

Repository files navigation

ATCS_Practical_1

Content

Prerequisites

Getting Started

Replicating Results

Tips

Using Lisa Cluster

Arguments

Authors

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages