The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

mousewu · 2024-04-09T07:08:20Z

settings:
actor & critic: OPT 1.3b
reward model: OPT 350m
GPU: 4 * V100 32G

running script:

ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi

if [ "$ACTOR_MODEL_PATH" == "" ]; then
ACTOR_MODEL_PATH=AdamG012/chat-opt-1.3b-sft-deepspeed
fi
if [ "$CRITIC_MODEL_PATH" == "" ]; then
CRITIC_MODEL_PATH=AdamG012/chat-opt-350m-reward-deepspeed
fi

echo "Step3: ACTOR_MODEL_PATH=$ACTOR_MODEL_PATH CRITIC_MODEL_PATH=$CRITIC_MODEL_PATH ACTOR_ZERO_STAGE=$ACTOR_ZERO_STAGE CRITIC_ZERO_STAGE=$CRITIC_ZERO_STAGE OUTPUT=$OUTPUT"

mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=9.65e-6
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py
--data_path Dahoas/rm-static
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_generation_batch_size 1
--per_device_training_batch_size 1
--generation_batches 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--num_train_epochs 2
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_dropout 0.0
--num_warmup_steps 100
--deepspeed --seed 1234
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--enable_ema
--output_dir $OUTPUT
--enable_tensorboard
--tensorboard_path $OUTPUT
&> $OUTPUT/training.log

The log is below:
--- prompt --> step=272, rank=1, ['\n\nHuman: How can I train for running a marathon?\n\nAssistant: Gosh! I guess I could give you lots of very detailed advice, but I’m not sure that’s the best idea. That’s a pretty rigorous training program! If you want to check in with me every few weeks, I could share some of the ideas that might be helpful in your training. Do you have a pace in mind that you’re trying to get to?\n\nHuman: No, I just want to be prepared and get in better shape. The marathon is about 4 months from now.\n\nAssistant: Maybe focus on just running more regularly for now? If you just get in the habit of running, you’ll start feeling stronger and faster, and once you get used to it, the distance of a marathon will feel relatively easy.\n\nHuman: That makes sense. I will improve my conditioning if I just make it a habit to run every day.\n\nAssistant:']
--- prompt --> step=272, rank=0, ['\n\nHuman: WHat can I use witchhazel for?\n\nAssistant: It’s a multipurpose natural remedy that’s also a common household product. It’s used to soothe sore muscles and joints, as well as a facial wash, mouthwash, and hair rinse. Some people use it topically for inflammation and itching. And it’s also used in many natural cleaning products and body care products.\n\nHuman: How do you use it for sore muscles?\n\nAssistant:']--- prompt --> step=272, rank=3, ['\n\nHuman: How can I stop my vomiting bout after food poisoning?\n\nAssistant: I’m sorry you’ve been feeling sick. Is there anything you think you can do to keep vomiting? Would eating small frequent meals work?\n\nHuman: Oh, that would probably work. What should I be eating?\n\nAssistant: Any food is a good choice, to keep you from getting too hungry or dehydrated. You could try eating mostly salty foods like broth, juice, soda, and bread.\n\nHuman: Are you sure soda is a good idea?\n\nAssistant:']
--- prompt --> step=272, rank=2, ['\n\nHuman: Please tell me how to make brownies.\n\nAssistant:']
--- ans --> step=272, rank=1, [' I<|endoftext|>']

--- ans --> step=272, rank=3, ['<|endoftext|>']
--- ans --> step=272, rank=0, ['<|endoftext|>']
--- ans --> step=272, rank=2, ['<|endoftext|>']
Epoch: 0 | Step: 272 | PPO Epoch: 1 | Actor Loss: -2.625 | Critic Loss: 3.69140625 | Unsupervised Loss: 0.0
End-to-End => Latency: 3.25s, TFLOPs: 2.03, Samples/sec: 1.23, Time/seq 0.81s, Batch Size: 4, Total Seq. Length: 512
Generation => Latency: 1.97s, Per-token Latency 7.71 ms, TFLOPs: 0.69, BW: 341.33 GB/sec, Answer Seq. Length: 256
Training => Latency: 1.27s, TFLOPs: 4.10
Actor Model Parameters => 1.316 B, Critic Model Parameters => 0.331 B
Average reward score: -11.5546875 | EMA reward score: -11.419149377686367

ouyanmei · 2024-08-20T07:11:57Z

你好，请问你解决了吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

mousewu commented Apr 9, 2024 •

edited

Loading

ouyanmei commented Aug 20, 2024

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled #887

Comments

mousewu commented Apr 9, 2024 • edited Loading

ouyanmei commented Aug 20, 2024

mousewu commented Apr 9, 2024 •

edited

Loading