-
Notifications
You must be signed in to change notification settings - Fork 142
Modelzoo
With the help of TencentPretrain, we pre-trained models of different properties (for example, models based on different corpora, encoders, and targets). All pre-trained weights introduced in this section are in TencentPretrain format and can be loaded by TencentPretrain directly. More pre-trained weights will be released in the near future. Unless otherwise noted, Chinese pre-trained models use BERT tokenizer and models/google_zh_vocab.txt as vocabulary (which is used in original BERT project). models/bert/base_config.json is used as configuration file in default. Commonly-used vocabulary and configuration files are included in models/ folder and users do not need to download them. In addition, We use scripts/convert_xxx_from_tencentpretrain_to_huggingface.py to convert pre-trained weights into format that Huggingface Transformers supports, and upload them to Huggingface model hub (uer). In the rest of the section, we provide download links of pre-trained weights and the right ways of using them. Notice that, for space constraint, more details of a pre-trained weight are discussed in corresponding Huggingface model hub. We will provide the link of Huggingface model hub when we introduce the pre-trained weight.
This is the set of 24 Chinese RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. We only provide configuration files for Tiny,Mini,Small,Medium,Base,and Large models. To load other models, we need to modify emb_size,feedforward_size,hidden_size,heads_num,layers_num in the configuration file. Notice that emb_size = emb_size, feedforward_size = 4 * hidden_size, heads_num = hidden_size / 64 . More details of these pre-trained weights are discussed here.
The pre-trained Chinese weight links of different layers (L) and hidden sizes (H):
H=128 | H=256 | H=512 | H=768 | |
---|---|---|---|---|
L=2 | 2/128 (Tiny) | 2/256 | 2/512 | 2/768 |
L=4 | 4/128 | 4/256 (Mini) | 4/512 (Small) | 4/768 |
L=6 | 6/128 | 6/256 | 6/512 | 6/768 |
L=8 | 8/128 | 8/256 | 8/512 (Medium) | 8/768 |
L=10 | 10/128 | 10/256 | 10/512 | 10/768 |
L=12 | 12/128 | 12/256 | 12/512 | 12/768 (Base) |
Take the Tiny weight as an example, we download the Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 --data_processor mlm
python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
--output_model_path models/output_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
--data_processor mlm --target mlm
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv \
--test_path datasets/book_review/test.tsv \
--learning_rate 3e-4 --epochs_num 8 --batch_size 64
In fine-tuning stage, pre-trained models of different sizes usually require different hyper-parameters. The example of using grid search to find best hyper-parameters:
python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/bert/tiny_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv \
--learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64
We can reproduce the experimental results reported here through above grid search script.
This is the set of 5 Chinese word-based RoBERTa weights. CLUECorpusSmall is used as training corpus. Configuration files are in models/bert/ folder. Google sentencepiece is used as tokenizer tool and models/cluecorpussmall_spm.model is used as sentencepiece model. Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. More details of these pre-trained weights are discussed here
The pre-trained Chinese weight links of different sizes:
Link |
---|
L=2/H=128 (Tiny) |
L=4/H=256 (Mini) |
L=4/H=512 (Small) |
L=8/H=512 (Medium) |
L=12/H=768 (Base) |
Take the word-based Tiny weight as an example, we download the word-based Tiny weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path models/cluecorpussmall_spm.model \
--dataset_path dataset.pt --processes_num 8 --data_processor mlm
python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model --config_path models/bert/tiny_config.json \
--output_model_path models/output_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
--data_processor mlm --target mlm
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model \
--config_path models/bert/tiny_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv \
--test_path datasets/book_review/test.tsv \
--learning_rate 3e-4 --epochs_num 8 --batch_size 64
The example of using grid search to find best hyper-parameters for word-based model:
python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
--spm_model_path models/cluecorpussmall_spm.model \
--config_path models/bert/tiny_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv
--learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64
We can reproduce the experimental results reported here through above grid search script.
This is the set of Chinese GPT-2 pre-trained weights. Configuration files are in models/gpt2/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained GPT-2 weights:
Notice that extended vocabularies (models/google_zh_poem_vocab.txt and models/google_zh_ancient_vocab.txt) are used in Poem and Ancient GPT-2 models. CLUECorpusSmall GPT-2-distil model uses models/gpt2/distil_config.json configuration file. models/gpt2/config.json are used for other weights.
Take the CLUECorpusSmall GPT-2-distil weight as an example, we download the CLUECorpusSmall GPT-2-distil weight through the above link and put it in models/ folder. We can either conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_vocab.txt \
--dataset_path dataset.pt --processes_num 8 \
--seq_length 128 --data_processor lm
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--output_model_path models/book_review_gpt2_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-5 --batch_size 64
or use it on downstream classification dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv \
--test_path datasets/book_review/test.tsv \
--learning_rate 3e-5 --epochs_num 8 --batch_size 64
GPT-2 model can be used for text generation. First of all, we create story_beginning.txt and enter the beginning of the text. Then we use scripts/generate_lm.py to do text generation:
python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/gpt2/distil_config.json \
--test_path story_beginning.txt --prediction_path story_full.txt \
--seq_length 128
This is the set of Chinese ALBERT pre-trained weights. Configuration files are in models/albert/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained ALBERT weights:
Take the CLUECorpusSmall ALBERT-base weight as an example, we download the CLUECorpusSmall ALBERT-base weight through the above link and put it in models/ folder. The example of using ALBERT-base on downstream dataset:
python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
--vocab_path models/google_zh_vocab.txt --config_path models/albert/base_config.json \
--train_path datasets/book_review/train.tsv \
--dev_path datasets/book_review/dev.tsv \
--test_path datasets/book_review/test.tsv \
--learning_rate 2e-5 --epochs_num 3 --batch_size 64
python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_vocab.txt \
--config_path models/albert/base_config.json \
--test_path datasets/book_review/test_nolabel.tsv \
--prediction_path datasets/book_review/prediction.tsv \
--labels_num 2
This is the set of Chinese T5 pre-trained weights. Configuration files are in models/t5/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained T5 weights:
Model link | Description link |
---|---|
CLUECorpusSmall T5-small | https://huggingface.co/uer/t5-small-chinese-cluecorpussmall |
CLUECorpusSmall T5-base | https://huggingface.co/uer/t5-base-chinese-cluecorpussmall |
Take the CLUECorpusSmall T5-small weight as an example, we download the CLUECorpusSmall T5-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path dataset.pt \
--processes_num 8 --seq_length 128 \
--dynamic_masking --data_processor t5
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--output_model_path models/book_review_t5_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-4 --batch_size 64 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
or use it on downstream dataset:
python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--train_path datasets/tnews_text2text/train.tsv \
--dev_path datasets/tnews_text2text/dev.tsv \
--seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32
python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5/small_config.json \
--test_path datasets/tnews_text2text/test_nolabel.tsv \
--prediction_path datasets/tnews_text2text/prediction.tsv \
--seq_length 128 --tgt_seq_length 8 --batch_size 32
Users can download tnews dataset of text2text format from here.
This is the set of Chinese T5-v1_1 pre-trained weights. Configuration files are in models/t5-v1_1/ folder.
The link and detailed description (Huggingface model hub) of different pre-trained T5-v1_1 weights:
Take the CLUECorpusSmall T5-v1_1-small weight as an example, we download the CLUECorpusSmall T5-v1_1-small weight through the above link and put it in models/ folder. We can conduct further pre-training upon it:
python3 preprocess.py --corpus_path corpora/book_review.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path dataset.pt \
--processes_num 8 --seq_length 128 \
--dynamic_masking --data_processor t5
python3 pretrain.py --dataset_path dataset.pt \
--pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--output_model_path models/book_review_t5-v1_1_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
--learning_rate 5e-4 --batch_size 64 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
or use it on downstream dataset:
python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--train_path datasets/tnews_text2text/train.tsv \
--dev_path datasets/tnews_text2text/dev.tsv \
--seq_length 128 --tgt_seq_length 8 --learning_rate 3e-4 --epochs_num 3 --batch_size 32
python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--test_path datasets/tnews_text2text/test_nolabel.tsv \
--prediction_path datasets/tnews_text2text/prediction.tsv \
--seq_length 128 --tgt_seq_length 8 --batch_size 32
This is the set of PEGASUS pre-trained weights. Configuration files are in models/pegasus/ folder.
The link and detailed description (Huggingface model hub) of PEGASUS weights:
This is the set of BART pre-trained weights. Configuration files are in models/bart/ folder.
The link and detailed description (Huggingface model hub) of BART weights:
This is the set of fine-tuned Chinese RoBERTa weights. SBERT ChineseTextualInference NLI uses the models/sbert/base_config.json configuration file. The rest use the models/bert/base_config.json configuration file.
The link and detailed description (Huggingface model hub) of different fine-tuned RoBERTa weights:
One can load these pre-trained models for pre-training, fine-tuning, and inference.
This is the set of pre-trained weights besides Transformer.
The link and detailed description of different pre-trained weights:
Model link | Configuration file | Model details | Training details |
---|---|---|---|
CLUECorpusSmall LSTM language model | models/rnn_config.json | --embedding word --remove_embedding_layernorm --encoder lstm --target lm | steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall GRU language model | models/rnn_config.json | --embedding word --remove_embedding_layernorm --encoder gru --target lm | steps: 500000 learning rate: 1e-3 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall GatedCNN language model | models/gatedcnn_9_config.json | --embedding word --remove_embedding_layernorm --encoder gatedcnn --target lm | steps: 500000 learning rate: 1e-4 batch size: 64*8 (the number of GPUs) sequence length: 256 |
CLUECorpusSmall ELMo | models/birnn_config.json | --embedding word --remove_embedding_layernorm --encoder bilstm --target bilm | steps: 500000 learning rate: 5e-4 batch size: 64*8 (the number of GPUs) sequence length: 256 |
Model link | Description | Description link |
---|---|---|
Google Chinese BERT-Base | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
Google Chinese ALBERT-Base | Configuration file: models/albert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Large | Configuration file: models/albert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Xlarge | Configuration file: models/albert/xlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
Google Chinese ALBERT-Xxlarge | Configuration file: models/albert/xxlarge_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/albert |
HFL Chinese BERT-wwm | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese BERT-wwm-ext | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese RoBERTa-wwm-ext | Configuration file: models/bert/base_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
HFL Chinese RoBERTa-wwm-large-ext | Configuration file: models/bert/large_config.json Vocabulary: models/google_zh_vocab.txt Tokenizer: BertTokenizer |
https://github.com/ymcui/Chinese-BERT-wwm |
Model link | Description | Description link |
---|---|---|
English BERT-Base-uncased | Configuration file: models/bert/base_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English BERT-Base-cased | Configuration file: models/bert/base_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English BERT-Large-uncased | Configuration file: models/bert/large_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English BERT-Large-cased | Configuration file: models/bert/large_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English BERT-Large-WWM-uncased | Configuration file: models/bert/large_config.json Vocabulary: models/google_uncased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English BERT-Large-WWM-cased | Configuration file: models/bert/large_config.json Vocabulary: models/google_cased_en_vocab.txt Tokenizer: BertTokenizer |
https://github.com/google-research/bert |
English RoBERTa-Base | Configuration file: models/xlm-roberta/base_config.json Vocabulary: models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt Tokenizer: BPETokenizer |
https://huggingface.co/roberta-base |
English RoBERTa-Large | Configuration file: models/xlm-roberta/large_config.json Vocabulary: models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt Tokenizer: BPETokenizer |
https://huggingface.co/roberta-large |