Scripts of LLM pretraining and finetuing (SFT)
LoRA & DeepSpeed supported
The repository is based on tatsu-lab/stanford_alpaca.
-
Before you start continual pre-training LLM, you should provide the model name (huggingface) or local model path.
-
Prepare training data, you can use plain text in the format of markdown or txt for pretraining. The example is A Guide to Writing the NeurIPS Impact Statement. You can add more text corpus in the data folder.
-
Launch
pip install -r requirements.txt
cd llm_pretrain
./pretrain_llama.sh
Note that some parameter settings of these models are different.
-
Before you start fine-tuning LLM, you should provide the model name (huggingface) or local model path.
-
Prepare training data, you can add your own task data like the example in sft_examples.json, which is similar to the alpaca_data.json
The format is as follows:
{
"binary_selection": [
{
"instruction": "Does the following text violate the law?\nText: OH MY FUCKING GOD",
"output": "No"
},
...
],
"another_task_name": [
{
"instruction": "How are you?",
"output": "Not bad."
},
...
],
...
}
Note that if you put the alpaca_data.json in the data folder, the script will use it as part of the training data.
LLaMA-2: Since there is no pad_token in LLaMA-2, it is recommended that you could add 'tokenizer.pad_token = tokenizer.unk_token' to the tokenizer.
- Launch
pip install -r requirements.txt
cd llm_sft
./train_llama.sh
pip install -r requirements.txt
cd llm_sft
./train_baichuan_LORA.sh
You can adjust the configurations in the train_lora.py. In our experiments, for baichuan, your transformers version should >= 4.29.0 and < 4.34.0.
Note that some parameter settings of these models are different.
If you want to use DeepSpeed, use the following command:
--deepspeed "./configs/default_offload_opt_param.json" \
.
├── LICENSE
├── README.md
├── llm_pretrain_clean
│ ├── data
│ │ └── A_Guide_to_Writing_the_NeurIPS_Impact_Statement.md
│ ├── evaluation
│ │ └── inference_single.py
│ ├── generate_pretrain_data.py
│ ├── pretrain.py
│ ├── pretrain_baichuan2.sh
│ ├── pretrain_llama.sh
│ ├── pretrain_mistral.sh
│ ├── requirementsX.txt
│ └── utils.py
└── sft_model_clean
├── README.md
├── configs
│ └── default_offload_opt_param.json
├── data
│ ├── alpaca_data.json
│ └── sft_examples.json
├── evaluation
│ └── inference_single.py
├── generate_sft_data.py
├── requirementsX.txt
├── train.py
├── train_baichuan.sh
├── train_baichuan_LORA.sh
├── train_llama.sh
├── train_lora.py
├── train_mistral.sh
└── utils.py