Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

Open
mousewu opened this issue Apr 9, 2024 · 1 comment

Comments

@mousewu
Copy link

mousewu commented Apr 9, 2024

settings:
actor & critic: OPT 1.3b
reward model: OPT 350m
GPU: 4 * V100 32G

running script:

ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi

if [ "$ACTOR_MODEL_PATH" == "" ]; then
ACTOR_MODEL_PATH=AdamG012/chat-opt-1.3b-sft-deepspeed
fi
if [ "$CRITIC_MODEL_PATH" == "" ]; then
CRITIC_MODEL_PATH=AdamG012/chat-opt-350m-reward-deepspeed
fi

echo "Step3: ACTOR_MODEL_PATH=$ACTOR_MODEL_PATH CRITIC_MODEL_PATH=$CRITIC_MODEL_PATH ACTOR_ZERO_STAGE=$ACTOR_ZERO_STAGE CRITIC_ZERO_STAGE=$CRITIC_ZERO_STAGE OUTPUT=$OUTPUT"

mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=9.65e-6
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py
--data_path Dahoas/rm-static
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_generation_batch_size 1
--per_device_training_batch_size 1
--generation_batches 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--num_train_epochs 2
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_dropout 0.0
--num_warmup_steps 100
--deepspeed --seed 1234
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--enable_ema
--output_dir $OUTPUT
--enable_tensorboard
--tensorboard_path $OUTPUT
&> $OUTPUT/training.log

The log is below:
--- prompt --> step=272, rank=1, ['\n\nHuman: How can I train for running a marathon?\n\nAssistant: Gosh! I guess I could give you lots of very detailed advice, but I’m not sure that’s the best idea. That’s a pretty rigorous training program! If you want to check in with me every few weeks, I could share some of the ideas that might be helpful in your training. Do you have a pace in mind that you’re trying to get to?\n\nHuman: No, I just want to be prepared and get in better shape. The marathon is about 4 months from now.\n\nAssistant: Maybe focus on just running more regularly for now? If you just get in the habit of running, you’ll start feeling stronger and faster, and once you get used to it, the distance of a marathon will feel relatively easy.\n\nHuman: That makes sense. I will improve my conditioning if I just make it a habit to run every day.\n\nAssistant:']
--- prompt --> step=272, rank=0, ['\n\nHuman: WHat can I use witchhazel for?\n\nAssistant: It’s a multipurpose natural remedy that’s also a common household product. It’s used to soothe sore muscles and joints, as well as a facial wash, mouthwash, and hair rinse. Some people use it topically for inflammation and itching. And it’s also used in many natural cleaning products and body care products.\n\nHuman: How do you use it for sore muscles?\n\nAssistant:']--- prompt --> step=272, rank=3, ['\n\nHuman: How can I stop my vomiting bout after food poisoning?\n\nAssistant: I’m sorry you’ve been feeling sick. Is there anything you think you can do to keep vomiting? Would eating small frequent meals work?\n\nHuman: Oh, that would probably work. What should I be eating?\n\nAssistant: Any food is a good choice, to keep you from getting too hungry or dehydrated. You could try eating mostly salty foods like broth, juice, soda, and bread.\n\nHuman: Are you sure soda is a good idea?\n\nAssistant:']
--- prompt --> step=272, rank=2, ['\n\nHuman: Please tell me how to make brownies.\n\nAssistant:']
--- ans --> step=272, rank=1, [' I<|endoftext|>']

--- ans --> step=272, rank=3, ['<|endoftext|>']
--- ans --> step=272, rank=0, ['<|endoftext|>']
--- ans --> step=272, rank=2, ['<|endoftext|>']
Epoch: 0 | Step: 272 | PPO Epoch: 1 | Actor Loss: -2.625 | Critic Loss: 3.69140625 | Unsupervised Loss: 0.0
End-to-End => Latency: 3.25s, TFLOPs: 2.03, Samples/sec: 1.23, Time/seq 0.81s, Batch Size: 4, Total Seq. Length: 512
Generation => Latency: 1.97s, Per-token Latency 7.71 ms, TFLOPs: 0.69, BW: 341.33 GB/sec, Answer Seq. Length: 256
Training => Latency: 1.27s, TFLOPs: 4.10
Actor Model Parameters => 1.316 B, Critic Model Parameters => 0.331 B
Average reward score: -11.5546875 | EMA reward score: -11.419149377686367

@ouyanmei
Copy link

你好,请问你解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants