Skip to content

Commit

Permalink
init
Browse files Browse the repository at this point in the history
  • Loading branch information
wangxidong06 committed Mar 7, 2024
1 parent 84538b1 commit ee35f7c
Show file tree
Hide file tree
Showing 134 changed files with 153,779 additions and 155,205 deletions.
5 changes: 0 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +0,0 @@
/ckpts
/data
/logs
/wanda_logs
/result
32 changes: 13 additions & 19 deletions 0.download_data.sh
Original file line number Diff line number Diff line change
@@ -1,32 +1,26 @@
# download ApolloCorpus
mkdir metadata

cd metadata
wget https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/resolve/main/ApolloCorpus.zip
unzip ApolloCorpus.zip
cd train/pretrain

qa_dir="qa"
pretrain_sft_dir="pretrain_sft"
# Prepare Data for Mix training
mkdir mixTrain

if [ ! -d "$qa_dir" ]; then
mkdir -p "$qa_dir"
fi

if [ ! -d "$pretrain_sft_dir" ]; then
mkdir -p "$pretrain_sft_dir"
fi

cd train/pretrain
# Mixtraining Only use QA pairs in Pretrain
for file in *; do
if [[ $file == *_qa.json ]]; then
mv "$file" "$qa_dir/"
elif [[ $file == *_text.json ]]; then
mv "$file" "$pretrain_sft_dir/"
fi
cp "$file" "../mixTrain/"
done
mv pretrain_sft/ ../
mv qa/ ../
cd ../
rm pretrain

mv sft/ all_sft/
# copy all file from sft to mix_train
mv sft/* mixTrain/

# merge all the file from mix_train directory to json
python merge_json_train.py
cd ../


11 changes: 7 additions & 4 deletions 1.data_process_test&dev.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
python ./src/process/prepare/data_process_test_qwen.py \
# Take gemma as example, other models' python code is in ./src/process/prepare/data_process_test_{model}.py
mkdir -p ./data/gemma

python ./src/process/prepare/data_process_test_gemma.py \
--data_path ./metadata/test.json \
--few_shot 3 \
--save_path ./data/Qwen/test.json
--save_path ./data/gemma/test.json


python ./src/process/prepare/data_process_test_qwen.py \
python ./src/process/prepare/data_process_test_gemma.py \
--data_path ./metadata/dev.json \
--few_shot 3 \
--save_path ./data/Qwen/dev.json
--save_path ./data/gemma/dev.json
18 changes: 10 additions & 8 deletions 2.data_process_train.sh
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
model_name = Qwen
model_path = /your_model_path/Qwen1.5-0.5B
experiment_name=qwenallsftcom_data
# need change 4 place
# Please set the wandb key in the python file (e.g ./src/process/prepare/data_process_train_gemma.py)

mkdir wandb_logs

experiment_name=Gemma_MixTrain_Data
log_folder="./logs/${experiment_name}"
mkdir -p $log_folder
log_name=$(date +"%m-%d_%H-%M").log


python ./src/process/prepare/data_process_train_qwen.py \
--data_path ./metadata/train/sft.json \
--model_path ${model_path} \
python ./src/process/prepare/data_process_train_gemma.py \
--data_path ./metadata/train/mixTrain.json \
--model_path /your/path/to/gemma-2b \
--wandb_log ./wandb_logs \
--experiment_name ${experiment_name} \
--save_path ./data/${model_name}/allsftcom > ${log_folder}/$log_name 2>&1 &

--save_path ./data/Gemma/mixTrain > ${log_folder}/$log_name 2>&1 &
19 changes: 11 additions & 8 deletions 3.single_node_train_qwen.sh → 3.single_node_train_gemma.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
#!/bin/bash
#python *.py

# Please set the wandb key in the python file (e.g ./src/sft/train_gemma_resume_val.py)
process_port=29502
experiment_name=Qwen1.5_0.5B_allsft
model_dir=/your_model_path/Qwen1.5-0.5B
train_data_file=./data/Qwen/allsftData
dev_data_file=./data/Qwen/dev.json
experiment_name=Gemma2b_MixTrain_Train
model_dir=/your/path/to/gemma-2b
# ckpt_dir=
train_data_file=./data/gemma/MixTrain
dev_data_file=./data/gemma/dev.json
output_dir=./ckpts
log_folder="./logs/${experiment_name}"
mkdir -p $log_folder
Expand All @@ -17,7 +19,7 @@ accelerate launch \
--num_machines 1 \
--main_process_port ${process_port} \
--num_cpu_threads_per_process 8 \
--deepspeed_multinode_launcher standard ./src/sft/train_qwen_resume_val.py \
--deepspeed_multinode_launcher standard ./src/sft/train_gemma_resume_val.py \
--model_path ${model_dir} \
--experiment_name ${experiment_name} \
--gradient_accumulation_steps 8 \
Expand All @@ -26,11 +28,12 @@ accelerate launch \
--output_dir ${output_dir} \
--log_dir ./wandb_logs \
--n_epochs 1 \
--train_bsz_per_gpu 4 \
--eval_bsz_per_gpu 4 \
--learning_rate 1e-4 \
--train_bsz_per_gpu 2 \
--eval_bsz_per_gpu 2 \
--learning_rate 1e-5 \
--eval_step -1 \
--save_step -1 \
--warmup_rates 0.03 \
--max_ckpts 5 \
--gradient_checkpointing > ${log_folder}/$log_name 2>&1 &

9 changes: 4 additions & 5 deletions 4.eval.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
experiment_name=Qwen1.5-0.5B_test
cd .
experiment_name=Gemma2b_MixTrain_Test
log_folder="./logs/${experiment_name}"
result_folder="./results/${experiment_name}"
mkdir -p $log_folder
mkdir -p $result_folder
log_name=$(date +"%m-%d_%H-%M").log

python ./src/evaluate/eval_qwen.py \
--input_path=./data/qwen/test.json \
python ./src/evaluate/eval_gemma.py \
--input_path=./data/gemma/test.json \
--output_path=${result_folder}/model_ans.jsonl \
--score_path=${result_folder}/score.json \
--wrong_item_path=${result_folder}/wrong_item.json > ${log_folder}/$log_name 2>&1 &
--wrong_item_path=${result_folder}/wrong_item.json > ${log_folder}/$log_name 2>&1 &
65 changes: 36 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Apollo, Multilingual Medicine: Model, Dataset, Benchmark, Code
# Multilingual Medicine: Model, Dataset, Benchmark, Code

Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far
<center>
Expand All @@ -25,12 +25,12 @@ Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far
## Results
🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-0.5B" target="_blank">Apollo-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-1.8B" target="_blank">Apollo-1.8B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-2B" target="_blank">Apollo-2B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-6B" target="_blank">Apollo-6B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-7B" target="_blank">Apollo-7B</a>


<details><summary>Click to expand</summary>

![Apollo](assets/result.png)



</details>



Expand All @@ -43,8 +43,8 @@ Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far

![Apollo](assets/dataset.png)

- [Zip File](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/blob/main/ApolloCorpus.zip)
- [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)
- [Zip File](https://huggingface.co/datasets/FreedomIntelligence/Medbase_data/blob/main/Medbase_data-datasets.zip)
- [Data category](https://huggingface.co/datasets/FreedomIntelligence/Medbase_data/tree/main/train)
- Pretrain:
- data item:
- json_name: {data_source}_{language}_{data_type}.json
Expand Down Expand Up @@ -126,17 +126,37 @@ Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far
## Results reproduction
<details><summary>Click to expand</summary>
To facilitate training and evaluation, a series of bash scripts are provided below. These scripts are exemplified with the Qwen model and include steps for data download, processing, training, and evaluation. If you are working with a different model, adjustments to the content of these bash files may be necessary.
```python
bash 0.download_data.sh
bash 1.data_process_test&dev.sh
bash 2.data_process_train.sh
bash 3.single_node_train_qwen.sh
bash 4.eval.sh
```
After executing these commands, the score will be saved at `score_path`.
We take Gemma-2b as example
1. Download Dataset for project:
```bash
bash 0.download_data.sh
```
2. Prepare test and dev for specific model:
We create test data for specific model with their special token
```bash
bash 1.data_process_test&dev.sh
```
3. Prepare train data for specific model (Create tokenizered data in advance):
- You can adjust data Training order and Training Epoch in this step
```bash
bash 2.data_process_train.sh
```
4. Train the model
- If you want to train in Multi Nodes please refer to ./scripts/multi_node_train_*.sh
```bash
bash 3.single_node_train_qwen.sh
```
5. Evaluate your model
- Generate score for benchmark
```bash
bash 4.eval.sh
```
- Play with your ckpts in bash
```bash
python ./src/evaluate/cli_demo.py --model_name='./ckpts/your/path/tfmr'
```
</details>
Expand All @@ -160,16 +180,3 @@ Please use the following citation if you intend to use our dataset for training
primaryClass={cs.CL}
}
```

## Contribution and Feedback
If you encounter any issues or have suggestions, requests, or want to report a bug, please feel free to open a GitHub issue. We welcome PRs!

## Star History

<a href="https://star-history.com/#FreedomIntelligence/Apollo&Date">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=FreedomIntelligence/Apollo&type=Date&theme=dark" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=FreedomIntelligence/Apollo&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=FreedomIntelligence/Apollo&type=Date" />
</picture>
</a>
Loading

0 comments on commit ee35f7c

Please sign in to comment.