Code and dataset for the paper Cross-Lingual Natural Language Generation via Pre-Training (AAAI-20).
With the pre-trained XNLG, the supervision signals of NLG can be transferred to other languages. For example, finetuning XNLG with English Abstractive Summarization (AS) data and directly performing French AS or even Chinese-French AS.
This repo is based on XLM.
- numpy
- nlgeval (for calculating BLEU scores)
- pytorch 1.1.0
- fastBPE (generate and apply BPE cpdes)
- Moses (for tokenization)
- apex (for fp16 training)
- tqdm
- gdown (for downloading from Google Drive)
You can install some of the required tools through bash ./preprocess/install-tools.sh
You can directly use pre-trained XLM as the pre-trained model for stage #1.
In the paper, we used the pre-trained model provided by XLM.
Languages | Layers | Model | BPE codes | Vocabulary |
---|---|---|---|---|
XNLI-15 | 12 | Model | BPE codes | Vocabulary |
Monolingual data
In the paper, we use the Wikipedias as the monolingual training data. You can get monolingual training data by get-data-wiki.sh [lang]
.
E.g., bash ./preprocess/get-data-wiki.sh en
.
Parallel data
In the paper, we use MultiUN as the parallel corpus for en-zh and en-fr.
You can get monolingual training data by get-data-wiki.sh [lang1-lang2]
.
E.g., bash ./preprocess/get-data-wiki.sh en-fr
.
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU python xnlg-train.py
--exp_name stage1_en-zh-fr # experiment name
--dump_path ./dump # where to store the experiment
--data_path ./data/processed/XNLG # data location
--lgs 'en-fr-zh' # considered languages
--mlm_steps 'en,zh,fr,en-fr,en-zh' # MLM/XMLM objective
--emb_dim 1024 # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12 # number of layers
--n_heads 16 # number of heads
--dropout 0.1 # dropout
--attention_dropout 0.1 # attention dropout
--gelu_activation true # GELU instead of ReLU
--batch_size 32 # sequences per batch
--bptt 256 # sequences length (streams of 256 tokens for MLM)
--optimizer adam,lr=0.0001 # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000 # number of sentences per epoch
--max_epoch 100000 # max number of epochs (~infinite here)
--validation_metrics _valid_mlm_ppl # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,25 # stopping criterion (if criterion does not improve 25 times)
--fp16 true
We provide the pre-trained XNLG used in the paper:
Languages | Layers | Validation | Model | BPE codes | Vocabulary |
---|---|---|---|---|---|
en,zh | 10-6 | en-zh | Model | BPE codes | Vocabulary |
en,fr,zh | 10-6 | en-fr | Model | BPE codes | Vocabulary |
en,fr,zh | 10-6 | en-zh | Model | BPE codes | Vocabulary |
At Stage #2, the model is trained with the same data with #1.
Notes:
- To load the pre-trained model at Stage #1, use
--reload_model
.--reload_model [NAME1].pth,[NAME2].pth
means initializing encoder and decoder with[NAME1]
and[NAME2]
, respectively. - In the paper, we used a 10-layer encoder and a 6-layer decoder, so you can use
--n_layers
to set the number of decoder layers and use--n_enc_layers
to set the number of encoder layers. (When a 10-layer Transformer is loaded from a 12-layer Transformer, it will use the parameters of the first 10 layers of the 12-layer one.) - During Stage #2, the encoder parameters are frozen, and we only update the decoder parameters. You can use
--train_model_names decoder
.
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=4 xnlg-train.py
--exp_name stage2_en-zh-fr
--dump_path ./dump
--data_path ./data/processed/XNLG
--lgs 'ar-bg-de-el-en-es-fr-hi-ru-sw-th-tr-ur-vi-zh'
--mt_steps 'en-zh,zh-en,en-fr,fr-en'
--ae_steps 'en,zh,fr'
--reload_model /path/to/mlm_tlm_xnli15_1024.pth,/path/to/mlm_tlm_xnli15_1024.pth
--emb_dim 1024
--n_layers 6
--n_heads 8
--dropout 0.1
--attention_dropout 0.1
--gelu_activation True
--batch_size 16
--bptt 256
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001
--epoch_size 10000
--max_vocab 95000
--encoder_only False
--train_model_names decoder
--stopping_criterion 'valid_en-zh_mt_bleu,25'
--validation_metrics 'valid_en-zh_mt_bleu,valid_en-fr_mt_bleu'
--eval_bleu True
--word_shuffle 3
--word_dropout 0.1
--word_blank 0.1
--lambda_ae 0.5
--n_enc_layers 10
We use SQuAD 1.1 as the English QG dataset and WebQA as the Chinese QG dataset. You can get our processed the dataset by:
bash ./preprocess/get-data-xqg.sh
or directly download at here.
When decoding for QG, we use a decoding vocabulary, which can be downloaded at here.
python xnlg-ft.py
--exp_name xqg
--dump_path ./dump
--model_path /path/to/pre-trained/XNLG/model
--data_path ./data/processed/XNLG
--transfer_tasks XQG
--optimizer adam,lr=0.000005
--batch_size 16
--n_epochs 200
--epoch_size 4000
--max_len_q 256
--max_len_a 20
--max_len_e 230
--max_vocab 95000
--train_layers 1,10 # Use `1,10` or `encoder` for zero-shot QG
--vocab_path ./data/xqg-decoding-vocab
--decode_with_vocab True # When evaluating on Chinese, set True.
--decode_vocab_sizes 95000,95000
--n_enc_layers 10
--n_dec_layers 6
--beam_size 3
--ds_name xqg
--train_directions en-en
--eval_directions en-en,zh-zh
For supervised QG, --train_layers
should be set as all
. For supervised Chinese QG, just set --train_directions
and --eval_directions
as zh-zh
.
With a fine-tuned model, you can generate questions in a specific language by controlling the generation direction:
python qg.py
--vocab_path /path/to/vocab/folder
--data_path ./data/processed/XNLG
--model_dir /path/to/exp
--job_name [exp-index] # a hash code like `a23h1yv1`
--direction en-zh # en-en, en-zh, zh-en or zh-zh
Calculate BLEU and METEOR scores:
python calc_nlg_scores.py
-i /path/to/generated/questions
--lang zh
--dataset_dir /path/to/eval-dataset
NOTE: The Chinese training data are stored in format like 中国 商代 最后 一 个 君王 是 谁 ?
. But when evaluation, the Chinese questions in eval-dataset
should be split character by character like 中 国 商 代 最 后 一 个 君 王 是 谁 ?
.
You can split it by:
fn=test.q.zh.lc; cat ./data/xqg/$fn | python -u ./tools/zh_split_words.py > ./data/xqg-eval/$fn
Calculate ROUGE scores for Chinese:
python ./xnlg/calc_rouge.py
--ref /path/to/ground_truth
--hyp /path/to/generated_sentences
--zh True
Calculate ROUGE scores for other languages:
python ./xnlg/calc_rouge.py
--ref /path/to/ground_truth
--hyp /path/to/generated_sentences
We use English/French/Chinese Gigaword () processed by extracting the first sentence and the headline of each article, as the source and target sentence. You can get our processed the dataset by:
bash ./preprocess/get-data-xsumm.sh
or directly download at here.
python xnlg-ft.py
--exp_name xsumm
--dump_path ./dump
--model_path /path/to/pre-trained/XNLG/model
--data_path ./data/processed/XNLG
--transfer_tasks XSumm
--optimizer adam,lr=0.000005
--batch_size 32
--n_epochs 200
--epoch_size 4000
--max_len 120
--max_vocab 95000
--train_layers 1,10
--decode_with_vocab False
--n_enc_layers 10
--n_dec_layers 6
--beam_size 3
--ds_name xgiga
--train_directions en-en
--eval_directions zh-zh
For supervised AS, --train_layers
should be set as all
. For supervised French AS, just set --train_directions
and --eval_directions
as fr-fr
.
python summ.py
--data_path ./data/processed/XNLG
--model_dir /path/to/exp
--job_name [exp-index] # a hash code like `a23h1yv1`
--direction en-fr # en-en/fr-fr/zh-zh/en-zh/fr-en/...
Please cite the paper Cross-Lingual Natural Language Generation via Pre-Training if you found the resources in the repository useful.
@article{xnlg,
title={Cross-Lingual Natural Language Generation via Pre-Training},
author={Chi, Zewen and Dong, Li and Wei, Furu and Wang, Wenhui and Mao, Xian-Ling and Huang, Heyan},
journal={arXiv preprint arXiv:1909.10481},
year={2019}
}