fix error commit & update pro

AlibabaResearch · Aug 16, 2023 · 64f4da7 · 64f4da7
1 parent 1cdc7be
commit 64f4da7
Show file tree

Hide file tree

Showing 34 changed files with 1,154 additions and 161 deletions.
diff --git a/PRO/README.md b/PRO/README.md
@@ -4,47 +4,78 @@ Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, H
 arXiv: [Abstract](https://arxiv.org/abs/2306.17492) / [PDF](https://arxiv.org/pdf/2306.17492.pdf)
 
 ## Abstract
-Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.
+Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.
 
 ## The pipeline of PRO
 <div align="center"><img src="./resources/pipeline.jpg" style="zoom:100%"></div>
 
 ## Results
-### Automatic Evaluation
-<div align="center"><img src="./resources/automatic.jpg" style="zoom:100%"></div>
+### Automatic Evaluation on *HH-RLHF*
+<div align="center"><img src="./resources/automatic_hh.jpg" style="zoom:100%"></div>
 
 ### GPT-4 Evaluation
 <div align="center"><img src="./resources/gpt4.jpg" style="zoom:33%"></div>
 
 ### Human Evaluation
 <div align="center"><img src="./resources/human.jpg" style="zoom:33%"></div>
 
+### Automatic Evaluation on *Summarize From Feedback*
+<div align="center"><img src="./resources/automatic_summarize.jpg" style="zoom:50%"></div>
+
 ## Running!
 ### Data Preparation
-1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/pro_data.zip) and unzip it.
-2. Place the unzipped ```data/``` folder in the root directory of the project.
-3. You can also get the raw data from [this repo](https://github.com/anthropics/hh-rlhf), and run the following command to preprocess it to get the same data as ```train_len2/``` in ```data.zip```:
+We provide the preprocessed data for training and testing, which can be get with following steps:
+1. Download [data.zip](https://ylab-mobile-prod.oss-cn-beijing.aliyuncs.com/yueli.ybw/data.zip) and unzip it.
+2. Place the unzipped ```data``` folder in the root directory of the project.
+
+Besides, we also provide the scripts for preprocessing the raw data. Please follow the steps below to prepare the data:
+1. Create a directory named ```data``` in the root directory of this project.
+2. Create a directory named ```data/raw_data``` in the ```data``` directory.
+3. Download the raw data from [*HH-RLHF*](https://github.com/anthropics/hh-rlhf) or [*Summarize From Feedback*](https://github.com/openai/summarize-from-feedback), which should be named as ```hhrlhf``` or ```summarize_from_feedback```, and put it in the ```data/raw_data``` directory.
+4. Run the following command to preprocess the data:
+
 ```
-cd train/preprocess_data
+# For HH-RLHF
+cd train/hh_preprocess_data
+python step_1_process.py
+python step_2_get_train_data.py
+python step_3_get_test_data.py
+
+# For Summarize From Feedback
+cd ../summarize_preprocess_data
 python step_1_process.py
 python step_2_get_train_data.py
 python step_3_get_test_data.py
 ```
+
 ### Train
-We provide the training script for training the model. For example, you can run the following command to train the model:
+We provide the training scripts for training the model. For example, you can run the following commands to train the model:
 ```
 cd train
-./train.sh [id_of_exp] train_len2 2
+
+# Train LLMs with HH-RLHF
+./train_hh.sh [id_of_exp] hh_train_len2 2
+
+# Train LLMs with Summarize From Feedback
+./train_summarize.sh [id_of_exp] summarize_train_len2 2
+# Length 3
+./train3_summarize.sh [id_of_exp] summarize_train_len3_alpaca 3
 ```
-You can modify the ```train.sh``` to train the model with different dataset.
+
+The scripts can be easily modified to train LLMs with different datasets. 
 
 ### Test
-You can run the following command to test the model:
+The following command can be used to test the model:
 ```
-cd eval
+# Test LLMs with HH-RLHF
+cd eval_hh
+./run_infer_main_dist.sh
+
+# Test LLMs with Summarize From Feedback
+cd ../eval_summarize
 ./run_infer_main_dist.sh
 ```
-> **Note:** Before run this script, you should modify the ```infer_main_dist.sh``` to specify ```id_of_exp``` and corresponding ranking length in training.
+> **Note:** Before running, the ```id_of_exp``` and corresponding ranking length (during training) in ```run_infer_main_dist.sh``` have to be specified.
 
 ## Citation
 If this work is helpful to you, welcome to cite our paper as:

diff --git a/PRO/eval/dp_config.yaml → PRO/eval_hh/dp_config.yaml b/PRO/eval/dp_config.yaml → PRO/eval_hh/dp_config.yaml
diff --git a/PRO/eval/infer_and_eval_main_generate.py → PRO/eval_hh/infer_and_eval_main_generate.py b/PRO/eval/infer_and_eval_main_generate.py → PRO/eval_hh/infer_and_eval_main_generate.py
@@ -76,7 +76,7 @@ def get_args():
         "helpful_online.json",
         "helpful_rejection.json"
     ]:
-        file_path = os.path.join("..", "data", "test", file_name)
+        file_path = os.path.join("..", "data", "hh_test", file_name)
         with open(file_path, "r", encoding='utf-8') as f:
             infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0}
 

diff --git a/PRO/eval/infer_and_eval_main_reward.py → PRO/eval_hh/infer_and_eval_main_reward.py b/PRO/eval/infer_and_eval_main_reward.py → PRO/eval_hh/infer_and_eval_main_reward.py
diff --git a/PRO/eval/infer_and_eval_main_score.py → PRO/eval_hh/infer_and_eval_main_score.py b/PRO/eval/infer_and_eval_main_score.py → PRO/eval_hh/infer_and_eval_main_score.py
diff --git a/PRO/eval/infer_func_now.py → PRO/eval_hh/infer_func_now.py b/PRO/eval/infer_func_now.py → PRO/eval_hh/infer_func_now.py
@@ -62,7 +62,6 @@ def pipeline(prompts):
         text = text_res[index]
         assert truncated_prompts[index].rstrip() in text
         text = text.replace(truncated_prompts[index].rstrip(), "").strip()
-        # text = text[prompts_size[index]:].strip()
         for stop in ["Human:", "human:", "Assistant:", "assistant:"]:
             stop_ix = text.find(stop)
             if stop_ix >= 0:

diff --git a/PRO/eval/metrics2.py → PRO/eval_hh/metrics2.py b/PRO/eval/metrics2.py → PRO/eval_hh/metrics2.py
@@ -21,47 +21,7 @@ def get_bleu(hyp, ref):
     ref = ref.strip()
     return nltk.translate.bleu_score.sentence_bleu([ref], hyp)
 
-# Thank trlx for their helpful code:
-# https://github.com/CarperAI/trlx/blob/main/examples/hh/ppo_hh.py#L115
-def create_reward_fn_1():
-    reward_tokenizer = AutoTokenizer.from_pretrained("gpt2")
-    reward_tokenizer.pad_token = reward_tokenizer.eos_token
-    reward_tokenizer.truncation_side = "left"
-    reward_model = TrainRewardModel("EleutherAI/gpt-j-6B", reward_tokenizer.eos_token_id)
-    checkpoint = os.path.join("..", "rm", "gptj-rm-static", "hf_ckpt.pt")
-
-    reward_model.load_state_dict(torch.load(checkpoint))
-    reward_device = "cuda:{}".format(rank)
-    reward_model = reward_model.half().to(reward_device)
-    reward_model.eval()
-
-    def get_score(prefixes, suffixes):
-        # prefixes = [[p1, p1, p1], [p2, p2, p2]]
-        # suffixes = [s1, s2]
-        texts = []
-        for p, s in zip(prefixes,suffixes):
-            p = "".join(p)
-            p = p.replace("<|prompter|>", "\n\nHuman: ").replace("<|assistant|>", "\n\nAssistant: ")
-            texts.append(p + s + reward_tokenizer.eos_token)
-
-        input = reward_tokenizer(
-            texts,
-            padding=True,
-            truncation=True,
-            max_length=reward_tokenizer.max_len_single_sentence,
-            return_tensors="pt",
-        ).to(reward_device)
-
-        with torch.no_grad():
-            rewards = reward_model(input['input_ids']) # [batch]
-
-        return rewards.view(-1)
-        # return torch.sigmoid(rewards.view(-1))
-
-    return get_score, 16
-
 def create_reward_fn_2():
-    # model_name = "OpenAssistant/oasst-rm-2.1-pythia-1.4b-epoch-2.5"
     model_name = "OpenAssistant/oasst-rm-2-pythia-6.9b-epoch-1"
     model_device = "cuda:{}".format(rank)
     tokenizer = AutoTokenizer.from_pretrained(model_name)

diff --git a/PRO/eval/reward_model.py → PRO/eval_hh/reward_model.py b/PRO/eval/reward_model.py → PRO/eval_hh/reward_model.py
diff --git a/PRO/eval/run_infer_main_dist.sh → PRO/eval_hh/run_infer_main_dist.sh b/PRO/eval/run_infer_main_dist.sh → PRO/eval_hh/run_infer_main_dist.sh
@@ -3,19 +3,14 @@ export OMP_NUM_THREADS=16
 
 id=$1
 ranking_len=$2
-# 30 min
 accelerate launch --config_file dp_config.yaml infer_and_eval_main_generate.py \
     --index $id \
     --stage $ranking_len > logs/generate_infer_main_${id}_${ranking_len}.log 2>&1
 
-#10 min
 accelerate launch --config_file dp_config.yaml infer_and_eval_main_reward.py \
     --index $id \
     --stage $ranking_len > logs/reward_infer_main_${id}_${ranking_len}.log 2>&1
 
-#1 second
 python -u infer_and_eval_main_score.py \
     --index $id \
-    --stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1
-
-# total 40 min
+    --stage $ranking_len > logs/score_infer_main_${id}_${ranking_len}.log 2>&1
diff --git a/PRO/eval_summarize/dp_config.yaml b/PRO/eval_summarize/dp_config.yaml
@@ -0,0 +1,16 @@
+compute_environment: LOCAL_MACHINE
+deepspeed_config: {}
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+dynamo_backend: 'NO'
+fsdp_config: {}
+gpu_ids: all
+machine_rank: 0
+main_training_function: main
+megatron_lm_config: {}
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+use_cpu: false
diff --git a/PRO/eval_summarize/infer_and_eval_main_generate.py b/PRO/eval_summarize/infer_and_eval_main_generate.py
@@ -0,0 +1,95 @@
+#import some packages and reward funcs
+import os
+import argparse
+import json
+import tqdm
+import torch
+import torch.nn.functional as F
+import metrics2
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    LlamaTokenizer,
+    AutoModelForCausalLM
+)
+from infer_func_now import setup_seed, generate_pipeline
+from accelerate import Accelerator
+from accelerate.utils import InitProcessGroupKwargs
+from datetime import timedelta
+
+def get_args():
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument('--index', type=str)
+    parser.add_argument('--stage', type=int)
+    parser.add_argument('--directory', default="best_checkpoint", type=str)
+    args = parser.parse_args()
+    return args
+
+if __name__ == "__main__":
+    args = get_args()
+    kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
+    accelerator = Accelerator(kwargs_handlers=[kwargs])# **accelerator_log_kwargs)
+    rank = int(os.environ['RANK'])
+    rank_sum = accelerator.num_processes
+    model_name_or_path = os.path.join("..", "checkpoints", f"index_{args.index}", f"stage_{args.stage}", f"{args.directory}")
+    model_device = "cuda:{}".format(rank)
+
+    model_config = AutoConfig.from_pretrained(model_name_or_path)
+    model = AutoModelForCausalLM.from_pretrained(model_name_or_path, config=model_config, torch_dtype=torch.bfloat16).to(model_device)
+    if accelerator.is_main_process:
+        print(type(model))
+        print(model.config)
+    if model.config.architectures[0].lower() == "llamaforcausallm":
+        tokenizer = LlamaTokenizer.from_pretrained(model_name_or_path)
+        tokenizer.unk_token = "<unk>"
+        tokenizer.bos_token = "<s>"
+        tokenizer.eos_token = "</s>"
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+    tokenizer.pad_token=tokenizer.eos_token,
+    tokenizer.pad_token_id=tokenizer.eos_token_id,
+    tokenizer.sep_token = "<sep>"
+    model.resize_token_embeddings(len(tokenizer))
+
+    print(model.dtype)
+    torch.cuda.empty_cache()
+    model.eval()
+    print(f"Rank {rank} is activated...")
+    if accelerator.is_main_process:
+        file_name = "test.json"
+        save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
+        if os.path.exists(save_path):
+            os.remove(save_path)
+    accelerator.wait_for_everyone()
+
+    file_name = "test.json"
+    file_path = os.path.join("..", "data", "summarize_test", file_name)
+    with open(file_path, "r", encoding='utf-8') as f:
+        infer_data = {line_index: json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index-rank) % rank_sum == 0}
+
+    for line_index in infer_data:
+        infer_data[line_index]["line_index"] = line_index
+    infer_data = [infer_data[line_index] for line_index in infer_data]
+
+    prompts = [l['prefix'][0] for l in infer_data]
+
+    setup_seed()
+    generated_suffixes, truncated_prompts = generate_pipeline(model, tokenizer, prompts, add_special_tokens=True)
+    setup_seed()        
+    save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
+
+    for index in range(len(infer_data)):
+        infer_data[index]['infer'] = {"t": generated_suffixes[index]}
+    with open(save_path, 'a', encoding='utf-8') as f:
+        for line in infer_data:
+            content = json.dumps(line, ensure_ascii=False)
+            f.write(content+'\n')
+
+    accelerator.wait_for_everyone()
+
+    print("")
+    if accelerator.is_main_process:
+        print("Eval on {}".format(file_name))
+    torch.cuda.empty_cache()
+    accelerator.wait_for_everyone()
diff --git a/PRO/eval_summarize/infer_and_eval_main_reward.py b/PRO/eval_summarize/infer_and_eval_main_reward.py
@@ -0,0 +1,84 @@
+#import some packages and reward funcs
+import os
+import argparse
+import json
+import tqdm
+import torch
+import torch.nn.functional as F
+import metrics2
+from transformers import (
+    AutoConfig,
+    AutoTokenizer,
+    LlamaTokenizer,
+    AutoModelForCausalLM
+)
+from peft import PeftConfig, PeftModel
+from infer_func_now import setup_seed
+from accelerate import Accelerator
+from accelerate.utils import InitProcessGroupKwargs
+from datetime import timedelta
+
+def get_args():
+    parser = argparse.ArgumentParser(description="")
+    parser.add_argument('--index', type=str)
+    parser.add_argument('--stage', type=int)
+    parser.add_argument('--directory', default="best_checkpoint", type=str)
+    args = parser.parse_args()
+    return args
+
+if __name__ == "__main__":
+    args = get_args()
+    setup_seed()
+    kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=5400))
+    accelerator = Accelerator(kwargs_handlers=[kwargs])
+    rank = int(os.environ['RANK'])
+    rank_sum = accelerator.num_processes
+    torch.cuda.empty_cache()
+    print(f"Rank {rank} is activated...")
+    if accelerator.is_main_process:
+        file_name = "test.json"
+        save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name))
+        if os.path.exists(save_path):
+            os.remove(save_path)
+
+        save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
+        with open(save_path, 'r', encoding='utf-8') as f:
+            infer_data = [json.loads(l) for l in f.readlines()]
+        if "line_index" in infer_data[0]:
+            infer_data = {l["line_index"]: l for l in infer_data}
+            with open(save_path, 'w', encoding='utf-8') as f:
+                infer_data = [infer_data[line_index] for line_index in range(len(infer_data))]
+                for line in infer_data:
+                    content = json.dumps(line, ensure_ascii=False)
+                    f.write(content+'\n')
+
+    accelerator.wait_for_everyone()
+
+    get_score, reward_batch_size = metrics2.create_reward_fn()
+
+    file_name = "test.json"
+    save_path = os.path.join("inference_res/cache", "infer_generate_main_{}_{}_{}".format(args.index, args.stage, file_name))
+    with open(save_path, 'r', encoding='utf-8') as f:
+        infer_data = [json.loads(l) for line_index, l in enumerate(f.readlines()) if (line_index - rank) % rank_sum == 0]
+    raw_prefixes = [l['prefix'][0].strip() + " " for l in infer_data]
+    generated_suffixes = [l['infer']["t"].strip() for l in infer_data]
+
+    setup_seed()
+    rewards = []
+    batch_size = reward_batch_size
+    for index in tqdm.tqdm(range(0,len(raw_prefixes), batch_size), desc=f"Rank {rank} rewarding..."):
+        if len(raw_prefixes) - index < batch_size:
+            batch_size = len(raw_prefixes) - index
+        rewards.extend(torch.sigmoid(get_score(raw_prefixes[index:index+batch_size], generated_suffixes[index:index+batch_size])).cpu().detach().numpy().tolist())
+    assert len(rewards) == len(generated_suffixes) and len(rewards) == len(infer_data), (len(rewards), len(generated_suffixes), len(infer_data))
+
+    for index in range(len(infer_data)):
+        infer_data[index]["infer"]["score"] = rewards[index]
+        infer_data[index]["infer"]["bleu"] = metrics2.get_bleu(infer_data[index]['infer']['t'], infer_data[index]['suffix'][0])
+
+    save_path = os.path.join("inference_res", "infer_main_{}_{}_{}".format(args.index, args.stage, file_name))
+    with open(save_path, 'a', encoding='utf-8') as f:
+        for line in infer_data:
+            content = json.dumps(line, ensure_ascii=False)
+            f.write(content+'\n')
+    print(f"Rank {rank} completed!")