Skip to content

Commit

Permalink
fix error commit & update pro
Browse files Browse the repository at this point in the history
  • Loading branch information
huybery committed Aug 16, 2023
1 parent 1cdc7be commit 64f4da7
Show file tree
Hide file tree
Showing 34 changed files with 1,154 additions and 161 deletions.
57 changes: 44 additions & 13 deletions PRO/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,47 +4,78 @@ Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, H
arXiv: [Abstract](https://arxiv.org/abs/2306.17492) / [PDF](https://arxiv.org/pdf/2306.17492.pdf)

## Abstract
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

## The pipeline of PRO
<div align="center"><img src="./resources/pipeline.jpg" style="zoom:100%"></div>

## Results
### Automatic Evaluation
<div align="center"><img src="./resources/automatic.jpg" style="zoom:100%"></div>
### Automatic Evaluation on *HH-RLHF*
<div align="center"><img src="./resources/automatic_hh.jpg" style="zoom:100%"></div>

### GPT-4 Evaluation
<div align="center"><img src="./resources/gpt4.jpg" style="zoom:33%"></div>

### Human Evaluation
<div align="center"><img src="./resources/human.jpg" style="zoom:33%"></div>

### Automatic Evaluation on *Summarize From Feedback*
<div align="center"><img src="./resources/automatic_summarize.jpg" style="zoom:50%"></div>

## Running!
### Data Preparation
1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/pro_data.zip) and unzip it.
2. Place the unzipped ```data/``` folder in the root directory of the project.
3. You can also get the raw data from [this repo](https://github.com/anthropics/hh-rlhf), and run the following command to preprocess it to get the same data as ```train_len2/``` in ```data.zip```:
We provide the preprocessed data for training and testing, which can be get with following steps:
1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/data.zip) and unzip it.
2. Place the unzipped ```data``` folder in the root directory of the project.

Besides, we also provide the scripts for preprocessing the raw data. Please follow the steps below to prepare the data:
1. Create a directory named ```data``` in the root directory of this project.
2. Create a directory named ```data/raw_data``` in the ```data``` directory.
3. Download the raw data from [*HH-RLHF*](https://github.com/anthropics/hh-rlhf) or [*Summarize From Feedback*](https://github.com/openai/summarize-from-feedback), which should be named as ```hhrlhf``` or ```summarize_from_feedback```, and put it in the ```data/raw_data``` directory.
4. Run the following command to preprocess the data:

```
cd train/preprocess_data
# For HH-RLHF
cd train/hh_preprocess_data
python step_1_process.py
python step_2_get_train_data.py
python step_3_get_test_data.py
# For Summarize From Feedback
cd ../summarize_preprocess_data
python step_1_process.py
python step_2_get_train_data.py
python step_3_get_test_data.py
```

### Train
We provide the training script for training the model. For example, you can run the following command to train the model:
We provide the training scripts for training the model. For example, you can run the following commands to train the model:
```
cd train
./train.sh [id_of_exp] train_len2 2
# Train LLMs with HH-RLHF
./train_hh.sh [id_of_exp] hh_train_len2 2
# Train LLMs with Summarize From Feedback
./train_summarize.sh [id_of_exp] summarize_train_len2 2
# Length 3
./train3_summarize.sh [id_of_exp] summarize_train_len3_alpaca 3
```
You can modify the ```train.sh``` to train the model with different dataset.

The scripts can be easily modified to train LLMs with different datasets.

### Test
You can run the following command to test the model:
The following command can be used to test the model:
```
cd eval
# Test LLMs with HH-RLHF
cd eval_hh
./run_infer_main_dist.sh
# Test LLMs with Summarize From Feedback
cd ../eval_summarize
./run_infer_main_dist.sh
```
> **Note:** Before run this script, you should modify the ```infer_main_dist.sh``` to specify ```id_of_exp``` and corresponding ranking length in training.
> **Note:** Before running, the ```id_of_exp``` and corresponding ranking length (during training) in ```run_infer_main_dist.sh``` have to be specified.
## Citation
If this work is helpful to you, welcome to cite our paper as:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def get_args():
"helpful_online.json",
"helpful_rejection.json"
]:
file_path = os.path.join("..", "data", "test", file_name)
file_path = os.path.join("..", "data", "hh_test", file_name)
with open(file_path, "r", encoding='utf-8') as f:
infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0}

Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@ def pipeline(prompts):
text = text_res[index]
assert truncated_prompts[index].rstrip() in text
text = text.replace(truncated_prompts[index].rstrip(), "").strip()
# text = text[prompts_size[index]:].strip()
for stop in ["Human:", "human:", "Assistant:", "assistant:"]:
stop_ix = text.find(stop)
if stop_ix >= 0:
Expand Down
40 changes: 0 additions & 40 deletions PRO/eval/metrics2.py → PRO/eval_hh/metrics2.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,47 +21,7 @@ def get_bleu(hyp, ref):
ref = ref.strip()
return nltk.translate.bleu_score.sentence_bleu([ref], hyp)

# Thank trlx for their helpful code:
# https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L115
def create_reward_fn_1():
reward_tokenizer = AutoTokenizer.from_pretrained("gpt2")
reward_tokenizer.pad_token = reward_tokenizer.eos_token
reward_tokenizer.truncation_side = "left"
reward_model = TrainRewardModel("EleutherAI/gpt-j-6B", reward_tokenizer.eos_token_id)
checkpoint = os.path.join("..", "rm", "gptj-rm-static", "hf_ckpt.pt")

reward_model.load_state_dict(torch.load(checkpoint))
reward_device = "cuda:{}".format(rank)
reward_model = reward_model.half().to(reward_device)
reward_model.eval()

def get_score(prefixes, suffixes):
# prefixes = [[p1, p1, p1], [p2, p2, p2]]
# suffixes = [s1, s2]
texts = []
for p, s in zip(prefixes,suffixes):
p = "".join(p)
p = p.replace("<|prompter|>", "\n\nHuman: ").replace("<|assistant|>", "\n\nAssistant: ")
texts.append(p + s + reward_tokenizer.eos_token)

input = reward_tokenizer(
texts,
padding=True,
truncation=True,
max_length=reward_tokenizer.max_len_single_sentence,
return_tensors="pt",
).to(reward_device)

with torch.no_grad():
rewards = reward_model(input['input_ids']) # [batch]

return rewards.view(-1)
# return torch.sigmoid(rewards.view(-1))

return get_score, 16

def create_reward_fn_2():
# model_name = "OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5"
model_name = "OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1"
model_device = "cuda:{}".format(rank)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,14 @@ export OMP_NUM_THREADS=16

id=$1
ranking_len=$2
# 30 min
accelerate launch --config_file dp_config.yaml infer_and_eval_main_generate.py \
--index $id \
--stage $ranking_len > logs/generate_infer_main_${id}_${ranking_len}.log 2>&1

#10 min
accelerate launch --config_file dp_config.yaml infer_and_eval_main_reward.py \
--index $id \
--stage $ranking_len > logs/reward_infer_main_${id}_${ranking_len}.log 2>&1

#1 second
python -u infer_and_eval_main_score.py \
--index $id \
--stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1

# total 40 min
--stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1
16 changes: 16 additions & 0 deletions PRO/eval_summarize/dp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
95 changes: 95 additions & 0 deletions PRO/eval_summarize/infer_and_eval_main_generate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#import some packages and reward funcs
import os
import argparse
import json
import tqdm
import torch
import torch.nn.functional as F
import metrics2
from transformers import (
AutoConfig,
AutoTokenizer,
LlamaTokenizer,
AutoModelForCausalLM
)
from infer_func_now import setup_seed, generate_pipeline
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs
from datetime import timedelta

def get_args():
parser = argparse.ArgumentParser(description="")
parser.add_argument('--index', type=str)
parser.add_argument('--stage', type=int)
parser.add_argument('--directory', default="best_checkpoint", type=str)
args = parser.parse_args()
return args

if __name__ == "__main__":
args = get_args()
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
accelerator = Accelerator(kwargs_handlers=[kwargs])# **accelerator_log_kwargs)
rank = int(os.environ['RANK'])
rank_sum = accelerator.num_processes
model_name_or_path = os.path.join("..", "checkpoints", f"index_{args.index}", f"stage_{args.stage}", f"{args.directory}")
model_device = "cuda:{}".format(rank)

model_config = AutoConfig.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, config=model_config, torch_dtype=torch.bfloat16).to(model_device)
if accelerator.is_main_process:
print(type(model))
print(model.config)
if model.config.architectures[0].lower() == "llamaforcausallm":
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path)
tokenizer.unk_token = "<unk>"
tokenizer.bos_token = "<s>"
tokenizer.eos_token = "</s>"
else:
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

tokenizer.pad_token=tokenizer.eos_token,
tokenizer.pad_token_id=tokenizer.eos_token_id,
tokenizer.sep_token = "<sep>"
model.resize_token_embeddings(len(tokenizer))

print(model.dtype)
torch.cuda.empty_cache()
model.eval()
print(f"Rank {rank} is activated...")
if accelerator.is_main_process:
file_name = "test.json"
save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
if os.path.exists(save_path):
os.remove(save_path)
accelerator.wait_for_everyone()

file_name = "test.json"
file_path = os.path.join("..", "data", "summarize_test", file_name)
with open(file_path, "r", encoding='utf-8') as f:
infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0}

for line_index in infer_data:
infer_data[line_index]["line_index"] = line_index
infer_data = [infer_data[line_index] for line_index in infer_data]

prompts = [l['prefix'][0] for l in infer_data]

setup_seed()
generated_suffixes, truncated_prompts = generate_pipeline(model, tokenizer, prompts, add_special_tokens=True)
setup_seed()
save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))

for index in range(len(infer_data)):
infer_data[index]['infer'] = {"t": generated_suffixes[index]}
with open(save_path, 'a', encoding='utf-8') as f:
for line in infer_data:
content = json.dumps(line, ensure_ascii=False)
f.write(content+'\n')

accelerator.wait_for_everyone()

print("")
if accelerator.is_main_process:
print("Eval on {}".format(file_name))
torch.cuda.empty_cache()
accelerator.wait_for_everyone()
84 changes: 84 additions & 0 deletions PRO/eval_summarize/infer_and_eval_main_reward.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#import some packages and reward funcs
import os
import argparse
import json
import tqdm
import torch
import torch.nn.functional as F
import metrics2
from transformers import (
AutoConfig,
AutoTokenizer,
LlamaTokenizer,
AutoModelForCausalLM
)
from peft import PeftConfig, PeftModel
from infer_func_now import setup_seed
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs
from datetime import timedelta

def get_args():
parser = argparse.ArgumentParser(description="")
parser.add_argument('--index', type=str)
parser.add_argument('--stage', type=int)
parser.add_argument('--directory', default="best_checkpoint", type=str)
args = parser.parse_args()
return args

if __name__ == "__main__":
args = get_args()
setup_seed()
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
accelerator = Accelerator(kwargs_handlers=[kwargs])
rank = int(os.environ['RANK'])
rank_sum = accelerator.num_processes
torch.cuda.empty_cache()
print(f"Rank {rank} is activated...")
if accelerator.is_main_process:
file_name = "test.json"
save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name))
if os.path.exists(save_path):
os.remove(save_path)

save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
with open(save_path, 'r', encoding='utf-8') as f:
infer_data = [json.loads(l) for l in f.readlines()]
if "line_index" in infer_data[0]:
infer_data = {l["line_index"]: l for l in infer_data}
with open(save_path, 'w', encoding='utf-8') as f:
infer_data = [infer_data[line_index] for line_index in range(len(infer_data))]
for line in infer_data:
content = json.dumps(line, ensure_ascii=False)
f.write(content+'\n')

accelerator.wait_for_everyone()

get_score, reward_batch_size = metrics2.create_reward_fn()

file_name = "test.json"
save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
with open(save_path, 'r', encoding='utf-8') as f:
infer_data = [json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index - rank) % rank_sum == 0]
raw_prefixes = [l['prefix'][0].strip() + " " for l in infer_data]
generated_suffixes = [l['infer']["t"].strip() for l in infer_data]

setup_seed()
rewards = []
batch_size = reward_batch_size
for index in tqdm.tqdm(range(0,len(raw_prefixes), batch_size), desc=f"Rank {rank} rewarding..."):
if len(raw_prefixes) - index < batch_size:
batch_size = len(raw_prefixes) - index
rewards.extend(torch.sigmoid(get_score(raw_prefixes[index:index+batch_size], generated_suffixes[index:index+batch_size])).cpu().detach().numpy().tolist())
assert len(rewards) == len(generated_suffixes) and len(rewards) == len(infer_data), (len(rewards), len(generated_suffixes), len(infer_data))

for index in range(len(infer_data)):
infer_data[index]["infer"]["score"] = rewards[index]
infer_data[index]["infer"]["bleu"] = metrics2.get_bleu(infer_data[index]['infer']['t'], infer_data[index]['suffix'][0])

save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name))
with open(save_path, 'a', encoding='utf-8') as f:
for line in infer_data:
content = json.dumps(line, ensure_ascii=False)
f.write(content+'\n')
print(f"Rank {rank} completed!")
Loading

0 comments on commit 64f4da7

Please sign in to comment.