功能脚本

TencentPretrain为预训练模型提供了丰富的脚本。这里首先列举项目包括的脚本以及它们的功能，然后详细介绍部分脚本的使用方式。

脚本名	功能描述
average_model.py	对多个模型的参数取平均，比如对不同训练步数的模型参数取平均，增加模型的鲁棒性
build_vocab.py	根据给定的语料和分词器构造词典
cloze_test.py	随机遮住单词并进行预测，返回前n个预测结果（完形填空）
convert_bart_from_huggingface_to_tencentpretrain.py	将Huggingface的BART预训练模型（PyTorch）转到本项目格式
convert_bart_from_tencentpretrain_to_huggingface.py	将本项目的BART预训练模型转到Huggingface格式（PyTorch）
convert_albert_from_huggingface_to_tencentpretrain.py	将Huggingface的ALBERT预训练模型（PyTorch）转到本项目格式
convert_albert_from_tencentpretrain_to_huggingface.py	将本项目的ALBERT预训练模型转到Huggingface格式（PyTorch）
convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py	将Huggingface的BERT阅读理解模型（PyTorch）转到本项目格式
convert_bert_extractive_qa_from_tencentpretrain_to_huggingface.py	将本项目的BERT阅读理解模型转到Huggingface格式（PyTorch）
convert_bert_from_original_tf_to_tencentpretrain.py	将Google的BERT预训练模型（TF）转到本项目格式
convert_bert_from_huggingface_to_tencentpretrain.py	将Huggingface的BERT预训练模型（PyTorch）转到本项目格式
convert_bert_from_tencentpretrain_to_original_tf.py	将本项目的BERT预训练模型转到Google格式（TF）
convert_bert_from_tencentpretrain_to_huggingface.py	将本项目的BERT预训练模型转到Huggingfac格式（PyTorch）
convert_bert_text_classification_from_huggingface_to_tencentpretrain.py	将Huggingface的BERT文本分类模型（PyTorch）转到本项目格式
convert_bert_text_classification_from_tencentpretrain_to_huggingface.py	将本项目的BERT文本分类模型转到Huggingface格式（PyTorch）
convert_bert_token_classification_from_huggingface_to_tencentpretrain.py	将Huggingface的BERT序列标注模型（PyTorch）转到本项目格式
convert_bert_token_classification_from_tencentpretrain_to_huggingface.py	将本项目的BERT序列标注模型转到Huggingface格式（PyTorch）
convert_gpt2_from_huggingface_to_tencentpretrain.py	将Huggingface的GPT-2预训练模型（PyTorch）转到本项目格式
convert_gpt2_from_tencentpretrain_to_huggingface.py	将本项目的GPT-2预训练模型转到Huggingface格式（PyTorch）
convert_pegasus_from_huggingface_to_tencentpretrain.py	将Huggingface的Pegasus预训练模型（PyTorch）转到本项目格式
convert_pegasus_from_tencentpretrain_to_huggingface.py	将本项目的Pegasus预训练模型转到Huggingface格式（PyTorch）
convert_t5_from_huggingface_to_tencentpretrain.py	将Huggingface的T5预训练模型（PyTorch）转到本项目格式
convert_t5_from_tencentpretrain_to_huggingface.py	将本项目的T5预训练模型转到Huggingface格式（PyTorch）
convert_xlmroberta_from_huggingface_to_tencentpretrain.py	将Huggingface的XLM-RoBERTa预训练模型（PyTorch）转到本项目格式
convert_xlmroberta_from_tencentpretrain_to_huggingface.py	将本项目的XLM-RoBERTa预训练模型转到Huggingface格式（PyTorch）
diff_vocab.py	输出两个词典的重合度
dynamic_vocab_adapter.py	根据词典调整模型，使模型和词典匹配
extract_embeddings.py	抽取预训练模型的embedding层
extract_features.py	通过预训练模型得到的文本表示
generate_lm.py	使用语言模型生成文本
generate_seq2seq.py	使用seq2seq模型生成文本
run_bayesopt.py	使用贝叶斯优化搜索LightGBM模型超参数
run_lgb.py	使用LightGBM进行模型融合（分类任务）
topn_words_dep.py	上下文相关的以词搜词，根据最后一层的隐层表示进行最近邻检索
topn_words_indep.py	上下文无关的以词搜词，根据embedding层词向量进行最近邻检索

完形填空

cloze_test.py 基于MLM任务，对遮住的词进行预测，返回topn最有可能的词。可以在cloze_test.py的基础上进行数据增强等操作。cloze_test.py 使用示例：

python3 scripts/cloze_test.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                              --config_path models/bert/base_config.json \
                              --test_path datasets/tencent_profile.txt --prediction_path output.txt

cloze_test.py 只支持目标任务为MLM的预训练模型。

特征抽取

extract_features.py 让文本经过词向量层，编码层，pooling层，得到文本表示。extract_features.py 使用示例：

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first

文本经过BERT模型的词向量层以及编码层，再取第一个位置，也就是[CLS]位置的向量（--pooling first），作为文本表示。但是当我们使用余弦衡量文本相似度的时候，上面这种文本表示方式效果不好。我们可以对文本表示进行白化操作：

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first --whitening_size 64

--whitening_size 64 表明会使用白化操作，并且向量经过变化后，维度变为64。如果不指定 --whitening_size ，则不会使用白化操作。推荐在特征抽取过程中使用白化操作。

词向量抽取

extract_embeddings.py 从预训练模型权重embedding层中抽取词向量。这里的词向量指传统的上下文无关词向量。抽取出的词向量可以用于初始化其他模型（比如CNN）。extract_embeddings.py 使用示例：

python3 scripts/extract_embeddings.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                      --word_embedding_path embeddings.txt

--word_embedding_path 指定输出词向量文件的路径。词向量文件的格式遵循这里，可以被主流项目直接使用。

以词搜词

预训练模型能够产生高质量的词向量。传统的词向量（比如word2vec和GloVe）给定一个单词固定的向量（上下文无关向量）。然而，一词多义是人类语言中的常见现象。一个单词的意思依赖于其上下文。我们可以使用预训练模型的隐层去表示单词。值得注意的是大多数的中文预训练模型是基于字的。如果需要真正的词向量而不是字向量，用户需要下载基于词的BERT模型和词典。上下文无关词向量以词搜词 scripts/topn_words_indep.py 使用示例（基于字和基于词）：

python3 scripts/topn_words_indep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --test_path target_words.txt

python3 scripts/topn_words_indep.py --load_model_path models/wiki_bert_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                    --test_path target_words.txt

上下文无关词向量来自于模型的embedding层， target_words.txt 的格式如下所示：

word-1
word-2
...
word-n

下面给出上下文相关词向量以词搜词 scripts/topn_words_dep.py 使用示例（基于字和基于词）：

python3 scripts/topn_words_dep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                  --cand_vocab_path models/google_zh_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer bert

python3 scripts/topn_words_dep.py --load_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                  --cand_vocab_path models/wiki_word_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer space

我们把目标词替换成词典中其它的词（候选词），将序列送入网络。我们把目标词/候选词对应位置的隐层（最后一层）看作是目标词/候选词的上下文相关词向量。如果两个单词在特定上下文中的隐层向量接近，那么它们可能在特定的上下文中有相似的意思。 --cand_vocab_path 指定候选词文件的路径。由于需要将目标词替换成所有的候选词，然后经过网络，因此我们可以选择较小的候选词词典。如果用户使用基于词的模型，需要对 target_words_with_sentences.txt 文件的句子进行分词。 target_words_with_sentences.txt 文件的格式如下：

word1 sent1
word2 sent2 
...
wordn sentn

单词与句子之间使用\t分隔。

模型平均

average_models.py 对多个有相同结构的预训练模型权重取平均。average_models.py 使用示例：

python3 scripts/average_models.py --model_list_path models/book_review_model.bin-4000 models/book_review_model.bin-5000 \
                                  --output_model_path models/book_review_model.bin

在预训练阶段，我们训练5000步，每隔1000步存储一次。我们对训练4000步和5000步的预训练模型进行平均。

文本生成（语言模型）

我们可以使用 generate_lm.py 来用语言模型生成文本。给定文本开头，模型根据开头续写。使用 generate_lm.py 加载GPT-2-distil模型并进行生成示例：

python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin-250000 --vocab_path models/google_zh_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_sentence.txt \
                               --config_path models/gpt2/distil_config.json --seq_length 128 \
                               --embedding word pos --remove_embedding_layernorm \
                               --encoder transformer --mask causal --layernorm_positioning pre \
                               --target lm --tie_weights

beginning.txt 包含了文本的开头。generated_sentence.txt 包含了文本的开头以及模型续写的内容。--load_model_path 指定预训练模型（LM预训练目标）路径。

文本生成（Seq2seq模型）

我们可以使用 generate_seq2seq.py 来用Seq2seq模型进行文本生成。给定中文，模型根据中文翻译英文。使用 generate_seq2seq.py 加载中译英模型iwslt_zh_en并进行生成示例：

python3 scripts/generate_seq2seq.py --load_model_path models/iwslt_zh_en_model.bin-50000 \
                                    --vocab_path models/google_zh_vocab.txt --tgt_vocab_path models/google_uncased_en_vocab.txt --tgt_tokenizer bert \
                                    --test_path input.txt --prediction_path output.txt \
                                    --config_path models/transformer/base_config.json --seq_length 64 --tgt_seq_length 64 \
                                    --embedding word sinusoidalpos --tgt_embedding word sinusoidalpos \
                                    --encoder transformer --mask fully_visible --decoder transformer

input.txt 包含了要翻译的内容。output.txt 包含了翻译后的内容。--load_model_path 指定预训练模型（seq2seq预训练目标）路径。

比较词典

我们可以使用 diff_vocab.py 来比较两个词典。使用 diff_vocab.py 来比较诗歌词典和古文词典：

python scripts/diff_vocab.py --vocab_1 models/google_zh_poem_vocab.txt --vocab_2 models/google_zh_ancient_vocab.txt

--vocab_1 和 --vocab_2 分别指定需要比较的词典路径。

运行以上命令将会得到输出：

vocab_1: models/google_zh_poem_vocab.txt, size: 22556
vocab_2: models/google_zh_ancient_vocab.txt, size: 25369
vocab_1 - vocab_2 = 114
vocab_2 - vocab_1 = 2927
vocab_1 & vocab_2 = 22442

分别展示了两个词典的差集大小和交集大小。

根据词典调整模型

我们可以使用 dynamic_vocab_adapter.py 来根据新词典对旧模型中的embedding层和softmax层进行调整，进而产生适配新词典的新模型。若新词典中的词在旧模型中存在，则直接复制；否则随机初始化相关参数。使用 dynamic_vocab_adapter.py 来将旧词典 models/google_zh_vocab.txt 对应的模型 models/google_zh_model.bin 按照新词典 models/google_zh_poem_vocab.txt 进行适配，并产生新模型 models/google_zh_poem_model.bin 。

python scripts/dynamic_vocab_adapter.py --old_model_path models/google_zh_model.bin \
                                        --old_vocab_path models/google_zh_vocab.txt \
                                        --new_vocab_path models/google_zh_poem_vocab.txt \
                                        --new_model_path models/google_zh_poem_model.bin

--old_model_path 和 --old_vocab_path 分别指定旧模型和旧词典的位置，--new_model_path 和 --new_vocab_path 分别指定新模型和新词典的位置。

模型转换

本项目的预训练模型转到Huggingface格式：

我们在Huggingface模型仓库（uer用户）中的每个项目下给出了详细的TencentPretrain到Huggingface转换脚本使用方法。

Huggingface的预训练模型转到本项目格式：

BART：以Huggingface中的bart-base-chinese-cluecorpussmall模型为例：

python3 scripts/convert_bart_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                    --output_model_path tencentpretrain_pytorch_model.bin \
                                                                    --layers_num 6

ALBERT：以Huggingface中的albert-base-chinese-cluecorpussmall模型为例：

python3 scripts/convert_albert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                      --output_model_path tencentpretrain_pytorch_model.bin

Roberta：以Huggingface中的chinese_roberta_L-2_H-128模型为例：

python3 scripts/convert_bert_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                    --output_model_path tencentpretrain_pytorch_model.bin \
                                                                    --layers_num 2 --type mlm

GPT-2：以Huggingface中的gpt2-chinese-cluecorpussmall模型为例：

python3 scripts/convert_gpt2_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                    --output_model_path tencentpretrain_pytorch_model.bin \
                                                                    --layers_num 12

RoBERTa（BERT）阅读理解模型：以Huggingface中的roberta-base-chinese-extractive-qa模型为例：

python3 scripts/convert_bert_extractive_qa_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                                  --output_model_path tencentpretrain_pytorch_model.bin \
                                                                                  --layers_num 12

RoBERTa（BERT）分类模型：以Huggingface中的roberta-base-finetuned-dianping-chinese模型为例：

python3 scripts/convert_bert_text_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                                        --output_model_path tencentpretrain_pytorch_model.bin \
                                                                                        --layers_num 12

RoBERTa（BERT）序列标注模型：以Huggingface中的roberta-base-finetuned-cluener2020-chinese模型为例：

python3 scripts/convert_bert_token_classification_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                                         --output_model_path tencentpretrain_pytorch_model.bin \
                                                                                         --layers_num 12

T5：以Huggingface中的t5-base-chinese-cluecorpussmall模型为例：

python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                  --output_model_path tencentpretrain_pytorch_model.bin \
                                                                  --layers_num 12 \
                                                                  --type t5

T5-v1_1：以Huggingface中的t5-v1_1-small-chinese-cluecorpussmall模型为例：

python3 scripts/convert_t5_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                  --output_model_path tencentpretrain_pytorch_model.bin \
                                                                  --layers_num 8 --decoder_layers_num 8 \
                                                                  --type t5-v1_1

Pegasus：以Huggingface中的pegasus-base-chinese-cluecorpussmall模型为例：

python3 scripts/convert_pegasus_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                       --output_model_path tencentpretrain_pytorch_model.bin \
                                                                       --layers_num 12 --decoder_layers_num 12

XLM-RoBERTa：以Huggingface中的xlm-roberta-base模型为例：

python3 scripts/convert_xlmroberta_from_huggingface_to_tencentpretrain.py --input_model_path pytorch_model.bin \
                                                                          --output_model_path tencentpretrain_pytorch_model.bin \
                                                                          --layers_num 12

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
  - 视觉任务评测基准
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly