This repository contains all data and documentation for building a neural machine translation system for Macedonian to English. This work was done during the the M.Sc. course (summer term) Machine Translation held by Prof. Dr. Alex Fraser.
The SETimes corpus contains of 207,777 parallel sentences for the Macedonian and English language pair.
For all experiments the corpus was split into training, development and test set:
Data set | Sentences | Download |
---|---|---|
Training | 205,777 | via GitHub or located in data/setimes.mk-en.train.tgz |
Development | 1,000 | via GitHub or located in data/setimes.mk-en.dev.tgz |
Test | 1,000 | via GitHub or located in data/setimes.mk-en.test.tgz |
The first NMT system for Macedonian to English is built with fairseq. We trained three systems with different architectures:
- Standard Bi-LSTM
- CNN as encoder, LSTM as decoder
- Fully convolutional
All necessary scripts can be found in the scripts
folder of this repository.
In the first step, we need to download and extract the parallel SETimes corpus for Macedonian to English:
wget http://nlp.ffzg.hr/data/corpora/setimes/setimes.en-mk.txt.tgz
tar -xf setimes.en-mk.txt.tgz
The data_preparation.sh
scripts performs the following steps on the corpus:
- download of the MOSES tokenizer script; tokenization of the whole corpus
- download of the BPE scripts; learning and applying BPE on the corpus
./data_preparation setimes.en-mk.mk.txt setimes.en-mk.en.txt
After that the corpus is split into training, development and test set:
./split_dataset corpus.clean.bpe.32000.mk corpus.clean.bpe.32000.en
The following folder structure needs to be created:
mkdir {train,dev,test}
mv dev.* dev
mv train.* train
mv test.* test
mkdir model-data
After that the fairseq
tool can be invoked to preprocess the corpus:
fairseq preprocess -sourcelang mk -targetlang en -trainpref train/train \
-validpref dev/dev -testpref test/test -thresholdsrc 3 \
-thresholdtgt 3 -destdir model-data
After the preprossing steps the three models can be trained.
With the following command the bi-lstm model can be trained:
fairseq train -sourcelang mk -targetlang en -datadir model-data -model blstm \
-nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \
-savedir model-blstm
With the following command the CNN as encoder, LSTM as decoder model can be trained:
fairseq train -sourcelang mk -targetlang en -datadir model-data -model conv \
-nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir model-conv
With the following command the fully convolutional model can be trained:
fairseq train -sourcelang mk -targetlang en -datadir model-data -model fconv \
-nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 \
-clip 0.1 -momentum 0.99 -timeavg -bptt 0 -savedir model-fconv
With the following command the bi-lstm model can decode the test set:
fairseq generate -sourcelang mk -targetlang en \
-path model-blstm/model_best.th7 -datadir model-data -beam 10 \
-nbest 1 -dataset test > model-blstm/system.output
With the following command the CNN as encoder, LSTM as decoder model can decode the test set:
fairseq generate -sourcelang mk -targetlang en -path model-conv/model_best.th7 \
-datadir model-data -beam 10 -nbest 1 \
-dataset test > model-conv/system.output
With the following command the fully convolutional model can decode the test set:
fairseq generate -sourcelang mk -targetlang en -path model-fconv/model_best.th7 \
-datadir model-data -beam 10 -nbest 1 \
-dataset test > model-fconv/system.output
With the helper script fairseq_bleu.sh
the BLEU-score of all models can be
calculated very easy. The script expects the system output file as command
line argument:
./fairseq_bleu.sh model-blstm/system.output
We use different BPE merge operations: 16.000 and 32.000. Here are the results on the final test set:
Model | BPE merge operations | BLEU-Score |
---|---|---|
Bi-LSTM | 32.000 | 46,84 |
Bi-LSTM | 16.000 | 47,57 |
CNN encoder, LSTM decoder | 32.000 | 19,83 |
CNN encoder, LSTM decoder | 16.000 | 9,59 |
Fully convolutional | 32.000 | 48,81 |
Fully convolutional | 16.000 | 49,03 |
The best bleu-score was obtained with the fully convolutional model with 16.000 merge operations.
The second NMT system for Macedonian to English is built with the tensor2tensor library. We trained two systems: one subword-based system and one character-based NMT system.
Notice: The problem description for this task is found in translate_enmk.py
in the root repository. This problem was once directly included and available
in tensor2tensor. But I decided to replace the integrated tensor2tensor
problem for Macedonian to English with a more challenging one. To replicate
all experiments in this repository, the translate_enmk.py
problem is now a
user-defined problem and must be included in the following way:
cp translate_enmk.py /tmp
echo "from . import my_submodule" > /tmp/__init__.py
To use this problem, the --t2t_usr_dir
commandline option must point to the
appropriate folder (in this example /tmp
). For more information about
user-defined problems, see offical
documentation.
The following training steps are tested with tensor2tensor in version 1.5.1.
First, we create the initial directory structure:
mkdir -p t2t_data t2t_datagen t2t_train t2t_output
In the next step, the training and development datasets are downloaded and prepared:
t2t-datagen --data_dir=t2t_data --tmp_dir=t2t_datagen/ \
--problem=translate_enmk_setimes32k --t2t_usr_dir /tmp
Then the training step can be started:
t2t-trainer --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
--model=transformer --hparams_set=transformer_base --output_dir=t2t_output \
--t2t_usr_dir /tmp
The number of GPUs used for training can be specified with the --worker_gpu
option.
In the next step, the test dataset is downloaded and extracted:
wget "https://github.com/stefan-it/nmt-mk-en/raw/master/data/setimes.mk-en.test.tgz"
tar -xzf setimes.mk-en.test.tgz
Then the decoding step for the test dataset can be started:
t2t-decoder --data_dir=t2t_data --problems=translate_enmk_setimes32k_rev \
--model=transformer --decode_hparams="beam_size=4,alpha=0.6" \
--decode_from_file=test.mk --decode_to_file=system.output \
--hparams_set=transformer_big --output_dir=t2t_output/ \
--t2t_usr_dir /tmp
The BLEU-score can be calculated with the built-in t2t-bleu
tool:
t2t-bleu --translation=system.output --reference=test.en
The following results can be achieved using the Transformer model. A character-based model was also trained and measured. A big transformer model was also trained using tensor2tensor in version 1.2.9 (latest version has a bug, see this issue).
Model | BLEU-Score |
---|---|
Transformer | 54,00 (uncased) |
Transformer (big) | 43,74 (uncased) |
Transformer (char-based) | 37.43 (uncased) |
We want to train a char-based NMT system with the dl4mt-c2c library in near future.
We would like to thank the Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften (LRZ) for giving us access to the NVIDIA DGX-1 supercomputer.
- Short-presentation at Deep Learning Workshop @ LRZ, can be found here.