Code and dataset for the paper Cross-Lingual Natural Language Generation via Pre-Training (AAAI-20).
-
XLM-Align (ACL 2021, paper, repo, model) Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment
-
InfoXLM (NAACL 2021, paper, repo, model) InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training.
-
XNLG (AAAI 2020, paper, repo) multilingual/cross-lingual pre-trained model for natural language generation, e.g., finetuning XNLG with English abstractive summarization (AS) data and directly performing French AS or even Chinese-French AS.
-
mT6 (paper) mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
-
XLM-E (paper) XLM-E: Cross-lingual Language Model Pre-training via ELECTRA
-
August 5, 2021: Code and models of InfoXLM and XLM-Align are released.
-
May 6, 2021: XLM-Align (InfoXLMv2) and xTune were accepted by ACL 2021.
-
April 18, 2021: mT6 (arXiv).
-
March 11, 2021: InfoXLM was accepted by NAACL 2021.
- numpy
- nlgeval (for calculating BLEU scores)
- pytorch 1.1.0
- fastBPE (generate and apply BPE codes)
- Moses (for tokenization)
- apex (for fp16 training)
- tqdm
- gdown (for downloading from Google Drive)
- pythainlp 2.0.6
You can install some of the required tools through bash ./preprocess/install-tools.sh
You can directly use pre-trained XLM as the pre-trained model for stage #1.
In the paper, we used the pre-trained model provided by XLM.
Languages | Layers | Model | BPE codes | Vocabulary |
---|---|---|---|---|
XNLI-15 | 12 | Model | BPE codes | Vocabulary |
Monolingual data
In the paper, we use the Wikipedias as the monolingual training data. You can get monolingual training data by get-data-wiki.sh [lang]
.
E.g., bash ./preprocess/get-data-wiki.sh en
.
Parallel data
In the paper, we use MultiUN as the parallel corpus for en-zh and en-fr.
You can get monolingual training data by get-data-wiki.sh [lang1-lang2]
.
E.g., bash ./preprocess/get-data-para.sh en-fr
.
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU python xnlg-train.py
--exp_name stage1_en-zh-fr # experiment name
--dump_path ./dump # where to store the experiment
--data_path ./data/processed/XNLG # data location
--lgs 'en-fr-zh' # considered languages
--mlm_steps 'en,zh,fr,en-fr,en-zh' # MLM/XMLM objective
--emb_dim 1024 # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12 # number of layers
--n_heads 16 # number of heads
--dropout 0.1 # dropout
--attention_dropout 0.1 # attention dropout
--gelu_activation true # GELU instead of ReLU
--batch_size 32 # sequences per batch
--bptt 256 # sequences length (streams of 256 tokens for MLM)
--optimizer adam,lr=0.0001 # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000 # number of sentences per epoch
--max_epoch 100000 # max number of epochs (~infinite here)
--validation_metrics _valid_mlm_ppl # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,25 # stopping criterion (if criterion does not improve 25 times)
--fp16 true
We provide the pre-trained XNLG used in the paper:
Languages | Layers | Validation | Model | BPE codes | Vocabulary |
---|---|---|---|---|---|
en,zh | 10-6 | en-zh | Model | BPE codes | Vocabulary |
en,fr,zh | 10-6 | en-fr | Model | BPE codes | Vocabulary |
en,fr,zh | 10-6 | en-zh | Model | BPE codes | Vocabulary |
At Stage #2, the model is trained with the same data with #1.
Notes:
- To load the pre-trained model at Stage #1, use
--reload_model
.--reload_model [NAME1].pth,[NAME2].pth
means initializing encoder and decoder with[NAME1]
and[NAME2]
, respectively. - In the paper, we used a 10-layer encoder and a 6-layer decoder, so you can use
--n_layers
to set the number of decoder layers and use--n_enc_layers
to set the number of encoder layers. (When a 10-layer Transformer is loaded from a 12-layer Transformer, it will use the parameters of the first 10 layers of the 12-layer one.) - During Stage #2, the encoder parameters are frozen, and we only update the decoder parameters. You can use
--train_model_names decoder
.
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=4 xnlg-train.py
--exp_name stage2_en-zh-fr
--dump_path ./dump
--data_path ./data/processed/XNLG
--lgs 'ar-bg-de-el-en-es-fr-hi-ru-sw-th-tr-ur-vi-zh'
--mt_steps 'en-zh,zh-en,en-fr,fr-en'
--ae_steps 'en,zh,fr'
--reload_model /path/to/mlm_tlm_xnli15_1024.pth,/path/to/mlm_tlm_xnli15_1024.pth
--emb_dim 1024
--n_layers 6
--n_heads 8
--dropout 0.1
--attention_dropout 0.1
--gelu_activation True
--batch_size 16
--bptt 256
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001
--epoch_size 10000
--max_vocab 95000
--encoder_only False
--train_model_names decoder
--stopping_criterion 'valid_en-zh_mt_bleu,25'
--validation_metrics 'valid_en-zh_mt_bleu,valid_en-fr_mt_bleu'
--eval_bleu True
--word_shuffle 3
--word_dropout 0.1
--word_blank 0.1
--lambda_ae 0.5
--n_enc_layers 10
We use SQuAD 1.1 as the English QG dataset and WebQA as the Chinese QG dataset. You can get our processed the dataset by:
bash ./preprocess/get-data-xqg.sh
or directly download at here.
When decoding for QG, we use a decoding vocabulary, which can be downloaded at here.
python xnlg-ft.py
--exp_name xqg
--dump_path ./dump
--model_path /path/to/pre-trained/XNLG/model
--data_path ./data/processed/XNLG
--transfer_tasks XQG
--optimizer adam,lr=0.000005
--batch_size 16
--n_epochs 200
--epoch_size 4000
--max_len_q 256
--max_len_a 20
--max_len_e 230
--max_vocab 95000
--train_layers 1,10 # Use `1,10` or `encoder` for zero-shot QG
--vocab_path ./data/xqg-decoding-vocab
--decode_with_vocab True # When evaluating on Chinese, set True.
--decode_vocab_sizes 95000,95000
--n_enc_layers 10
--n_dec_layers 6
--beam_size 3
--ds_name xqg
--train_directions en-en
--eval_directions en-en,zh-zh
For supervised QG, --train_layers
should be set as all
. For supervised Chinese QG, just set --train_directions
and --eval_directions
as zh-zh
.
With a fine-tuned model, you can generate questions in a specific language by controlling the generation direction:
python qg.py
--vocab_path /path/to/vocab/folder
--data_path ./data/processed/XNLG
--model_dir /path/to/exp
--job_name [exp-index] # a hash code like `a23h1yv1`
--direction en-zh # en-en, en-zh, zh-en or zh-zh
Calculate BLEU and METEOR scores:
python calc_nlg_scores.py
-i /path/to/generated/questions
--lang zh
--dataset_dir /path/to/eval-dataset
NOTE: The Chinese training data are stored in format like 中国 商代 最后 一 个 君王 是 谁 ?
. But when evaluation, the Chinese questions in eval-dataset
should be split character by character like 中 国 商 代 最 后 一 个 君 王 是 谁 ?
.
You can split it by:
fn=test.q.zh.lc; cat ./data/xqg/$fn | python -u ./tools/zh_split_words.py > ./data/xqg-eval/$fn
Calculate ROUGE scores for Chinese:
python ./xnlg/calc_rouge.py
--ref /path/to/ground_truth
--hyp /path/to/generated_sentences
--zh True
Calculate ROUGE scores for other languages:
python ./xnlg/calc_rouge.py
--ref /path/to/ground_truth
--hyp /path/to/generated_sentences
We use English/French/Chinese Gigaword () processed by extracting the first sentence and the headline of each article, as the source and target sentence. You can get our processed the dataset by:
bash ./preprocess/get-data-xsumm.sh
or directly download at here.
python xnlg-ft.py
--exp_name xsumm
--dump_path ./dump
--model_path /path/to/pre-trained/XNLG/model
--data_path ./data/processed/XNLG
--transfer_tasks XSumm
--optimizer adam,lr=0.000005
--batch_size 32
--n_epochs 200
--epoch_size 4000
--max_len 120
--max_vocab 95000
--train_layers 1,10
--decode_with_vocab False
--n_enc_layers 10
--n_dec_layers 6
--beam_size 3
--ds_name xgiga
--train_directions en-en
--eval_directions zh-zh
For supervised AS, --train_layers
should be set as all
. For supervised French AS, just set --train_directions
and --eval_directions
as fr-fr
.
python summ.py
--data_path ./data/processed/XNLG
--model_dir /path/to/exp
--job_name [exp-index] # a hash code like `a23h1yv1`
--direction en-fr # en-en/fr-fr/zh-zh/en-zh/fr-en/...
Please cite the paper Cross-Lingual Natural Language Generation via Pre-Training if you found the resources in the repository useful.
@inproceedings{xnlg,
author = {Chi, Zewen and Dong, Li and Wei, Furu and Wang, Wenhui and Mao, Xian{-}Ling and Huang, Heyan},
title = {Cross-Lingual Natural Language Generation via Pre-Training},
booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence},
pages = {7570--7577},
publisher = {{AAAI} Press},
year = {2020},
url = {https://www.aaai.org/Papers/AAAI/2020GB/AAAI-ChiZ.7682.pdf}
}