Modelzoo

With the help of TencentPretrain, we pre-trained models of different properties (for example, models based on different corpora, encoders, and targets). All pre-trained weights introduced in this section are in TencentPretrain format and can be loaded by TencentPretrain directly. More pre-trained weights will be released in the near future. Unless otherwise noted, Chinese pre-trained models use BERT tokenizer and models/google_zh_vocab.txt as vocabulary (which is used in original BERT project). models/bert/base_config.json is used as configuration file in default. Commonly-used vocabulary and configuration files are included in models/ folder and users do not need to download them. In addition, We use scripts/convert_xxx_from_tencentpretrain_to_huggingface.py to convert pre-trained weights into format that Huggingface Transformers supports, and upload them to Huggingface model hub (uer). In the rest of the section, we provide download links of pre-trained weights and the right ways of using them. Notice that, for space constraint, more details of a pre-trained weight are discussed in corresponding Huggingface model hub. We will provide the link of Huggingface model hub when we introduce the pre-trained weight.

Chinese RoBERTa Pre-trained Weights

This is the set of 24 Chinese RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. We only provide configuration files for Tiny，Mini，Small，Medium，Base，and Large models. To load other models, we need to modify emb_size，feedforward_size，hidden_size，heads_num，layers_num in the configuration file. Notice that emb_size = emb_size, feedforward_size = 4 * hidden_size, heads_num = hidden_size / 64 . More details of these pre-trained weights are discussed here.

The pre-trained Chinese weight links of different layers (L) and hidden sizes (H):

	H=128	H=256	H=512	H=768
L=2	2/128 (Tiny)	2/256	2/512	2/768
L=4	4/128	4/256 (Mini)	4/512 (Small)	4/768
L=6	6/128	6/256	6/512	6/768
L=8	8/128	8/256	8/512 (Medium)	8/768
L=10	10/128	10/256	10/512	10/768
L=12	12/128	12/256	12/512	12/768 (Base)

Take the Tiny weight as an example, we download the Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                    --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

or use it on downstream classification dataset：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

In fine-tuning stage, pre-trained models of different sizes usually require different hyper-parameters. The example of using grid search to find best hyper-parameters:

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                        --vocab_path models/google_zh_vocab.txt \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/book_review/train.tsv \
                                        --dev_path datasets/book_review/dev.tsv \
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

We can reproduce the experimental results reported here through above grid search script.

Chinese word-based RoBERTa Pre-trained Weights

This is the set of 5 Chinese word-based RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. Google sentencepiece is used as tokenizer tool and models/cluecorpussmall_spm.model is used as sentencepiece model. Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. More details of these pre-trained weights are discussed here

The pre-trained Chinese weight links of different sizes:

Link
L=2/H=128 (Tiny)
L=4/H=256 (Mini)
L=4/H=512 (Small)
L=8/H=512 (Medium)
L=12/H=768 (Base)

Take the word-based Tiny weight as an example, we download the word-based Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                    --spm_model_path models/cluecorpussmall_spm.model --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

or use it on downstream classification dataset：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                   --spm_model_path models/cluecorpussmall_spm.model \
                                   --config_path models/bert/tiny_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

The example of using grid search to find best hyper-parameters for word-based model:

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                        --spm_model_path models/cluecorpussmall_spm.model \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/book_review/train.tsv \
                                        --dev_path datasets/book_review/dev.tsv
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

We can reproduce the experimental results reported here through above grid search script.

Chinese GPT-2 Pre-trained Weights

This is the set of Chinese GPT-2 pre-trained weights. Configuration files are in models/gpt2/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained GPT-2 weights:

Model link	Description link
CLUECorpusSmall GPT-2	https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
CLUECorpusSmall GPT-2-distil	https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
Poem GPT-2	https://huggingface.co/uer/gpt2-chinese-poem
Couplet GPT-2	https://huggingface.co/uer/gpt2-chinese-couplet
Lyric GPT-2	https://huggingface.co/uer/gpt2-chinese-lyric
Ancient GPT-2	https://huggingface.co/uer/gpt2-chinese-ancient

Notice that extended vocabularies (models/google_zh_poem_vocab.txt and models/google_zh_ancient_vocab.txt) are used in Poem and Ancient GPT-2 models. CLUECorpusSmall GPT-2-distil model uses models/gpt2/distil_config.json configuration file. models/gpt2/config.json are used for other weights.

Take the CLUECorpusSmall GPT-2-distil weight as an example, we download the CLUECorpusSmall GPT-2-distil weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 \
                      --seq_length 128 --data_processor lm 

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/gpt2/distil_config.json \
                    --output_model_path models/book_review_gpt2_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-5 --batch_size 64

or use it on downstream classification dataset：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/gpt2/distil_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-5 --epochs_num 8 --batch_size 64

GPT-2 model can be used for text generation. First of all, we create story_beginning.txt and enter the beginning of the text. Then we use scripts/generate_lm.py to do text generation:

python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                               --vocab_path models/google_zh_vocab.txt \
                               --config_path models/gpt2/distil_config.json \
                               --test_path story_beginning.txt --prediction_path story_full.txt \
                               --seq_length 128

Chinese ALBERT Pre-trained Weights

This is the set of Chinese ALBERT pre-trained weights. Configuration files are in models/albert/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained ALBERT weights:

Model link	Description link
CLUECorpusSmall ALBERT-base	https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
CLUECorpusSmall ALBERT-large	https://huggingface.co/uer/albert-large-chinese-cluecorpussmall

Take the CLUECorpusSmall ALBERT-base weight as an example, we download the CLUECorpusSmall ALBERT-base weight through the above link and put it in models/ folder. The example of using ALBERT-base on downstream dataset:

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/albert/base_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 2e-5 --epochs_num 3 --batch_size 64

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/albert/base_config.json \
                                          --test_path datasets/book_review/test_nolabel.tsv \
                                          --prediction_path datasets/book_review/prediction.tsv \
                                          --labels_num 2

Chinese T5 Pre-trained Weights

This is the set of Chinese T5 pre-trained weights. Configuration files are in models/t5/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained T5 weights:

Model link	Description link
CLUECorpusSmall T5-small	https://huggingface.co/uer/t5-small-chinese-cluecorpussmall
CLUECorpusSmall T5-base	https://huggingface.co/uer/t5-base-chinese-cluecorpussmall

Take the CLUECorpusSmall T5-small weight as an example, we download the CLUECorpusSmall T5-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5/small_config.json \
                    --output_model_path models/book_review_t5_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

or use it on downstream dataset：

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

Users can download tnews dataset of text2text format from here.

Chinese T5-v1_1 Pre-trained Weights

This is the set of Chinese T5-v1_1 pre-trained weights. Configuration files are in models/t5-v1_1/ folder.

The link and detailed description (Huggingface model hub) of different pre-trained T5-v1_1 weights:

Model link	Description link
CLUECorpusSmall T5-v1_1-small	https://huggingface.co/uer/t5-v1_1-small-chinese-cluecorpussmall
CLUECorpusSmall T5-v1_1-base	https://huggingface.co/uer/t5-v1_1-base-chinese-cluecorpussmall

Take the CLUECorpusSmall T5-v1_1-small weight as an example, we download the CLUECorpusSmall T5-v1_1-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5-v1_1/small_config.json \
                    --output_model_path models/book_review_t5-v1_1_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

or use it on downstream dataset：

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5-v1_1/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5-v1_1/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

PEGASUS Pre-trained Weights

This is the set of PEGASUS pre-trained weights. Configuration files are in models/pegasus/ folder.

The link and detailed description (Huggingface model hub) of PEGASUS weights:

Model link	Description link
CLUECorpusSmall PEGASUS-base	https://huggingface.co/uer/pegasus-base-chinese-cluecorpussmall
CLUECorpusSmall PEGASUS-large	https://huggingface.co/uer/pegasus-large-chinese-cluecorpussmall

BART Pre-trained Weights

This is the set of BART pre-trained weights. Configuration files are in models/bart/ folder.

The link and detailed description (Huggingface model hub) of BART weights:

Model link	Description link
CLUECorpusSmall BART-base	https://huggingface.co/uer/bart-base-chinese-cluecorpussmall
CLUECorpusSmall BART-large	https://huggingface.co/uer/bart-large-chinese-cluecorpussmall

Fine-tuned Chinese RoBERTa Weights

This is the set of fine-tuned Chinese RoBERTa weights. SBERT ChineseTextualInference NLI uses the models/sbert/base_config.json configuration file. The rest use the models/bert/base_config.json configuration file.

The link and detailed description (Huggingface model hub) of different fine-tuned RoBERTa weights:

Model link	Description link
JD full sentiment classification	https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
JD binary sentiment classification	https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
Dianping sentiment classification	https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
Ifeng news topic classification	https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
Chinanews news topic classification	https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese
CLUENER2020 NER	https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
Extractive QA	https://huggingface.co/uer/roberta-base-chinese-extractive-qa
ChineseTextualInference SBERT NLI	https://huggingface.co/uer/sbert-base-chinese-nli

One can load these pre-trained models for pre-training, fine-tuning, and inference.

Chinese Pre-trained Weights Besides Transformer

This is the set of pre-trained weights besides Transformer.

The link and detailed description of different pre-trained weights:

Model link	Configuration file	Model details	Training details
CLUECorpusSmall LSTM language model	models/rnn_config.json	--embedding word --remove_embedding_layernorm --encoder lstm --target lm	steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall GRU language model	models/rnn_config.json	--embedding word --remove_embedding_layernorm --encoder gru --target lm	steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall GatedCNN language model	models/gatedcnn_9_config.json	--embedding word --remove_embedding_layernorm --encoder gatedcnn --target lm	steps: 500000 learning rate: 1e-4 batch size: 64*8 (the number of GPUs) sequence length: 256
CLUECorpusSmall ELMo	models/birnn_config.json	--embedding word --remove_embedding_layernorm --encoder bilstm --target bilm	steps: 500000 learning rate: 5e-4 batch size: 64*8 (the number of GPUs) sequence length: 256

Chinese Pre-trained Weights from Other Organizations

Model link	Description	Description link
Google Chinese BERT-Base	Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
Google Chinese ALBERT-Base	Configuration file: models/albert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Large	Configuration file: models/albert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Xlarge	Configuration file: models/albert/xlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Xxlarge	Configuration file: models/albert/xxlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/albert
HFL Chinese BERT-wwm	Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese BERT-wwm-ext	Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-ext	Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-large-ext	Configuration file: models/bert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm

English Pre-trained Weights from Other Organizations

Model link	Description	Description link
English BERT-Base-uncased	Configuration file: models/bert/base_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English BERT-Base-cased	Configuration file: models/bert/base_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English BERT-Large-uncased	Configuration file: models/bert/large_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English BERT-Large-cased	Configuration file: models/bert/large_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English BERT-Large-WWM-uncased	Configuration file: models/bert/large_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English BERT-Large-WWM-cased	Configuration file: models/bert/large_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer	https://github.com/google-research/bert
English RoBERTa-Base	Configuration file: models/xlm-roberta/base_config.json Vocabulary: models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt Tokenizer: BPETokenizer	https://huggingface.co/roberta-base
English RoBERTa-Large	Configuration file: models/xlm-roberta/large_config.json Vocabulary: models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt Tokenizer: BPETokenizer	https://huggingface.co/roberta-large

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
  - 视觉任务评测基准
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modelzoo

Chinese RoBERTa Pre-trained Weights

Chinese word-based RoBERTa Pre-trained Weights

Chinese GPT-2 Pre-trained Weights

Chinese ALBERT Pre-trained Weights

Chinese T5 Pre-trained Weights

Chinese T5-v1_1 Pre-trained Weights

PEGASUS Pre-trained Weights

BART Pre-trained Weights

Fine-tuned Chinese RoBERTa Weights

Chinese Pre-trained Weights Besides Transformer

Chinese Pre-trained Weights from Other Organizations

English Pre-trained Weights from Other Organizations

Clone this wiki locally