Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

consultantQ · 2024-07-18T14:41:52Z

Hello Team,
I'm working with the "sft_data_alfworld.json" file. I notice that in this file, not all monte_carlo_step_reward are 1; some are 0.2 or even 0. How do you obtain the step rewards in expert trajectories? Intuitively, all the scores in expert trajectories should be 1. Thank you!

WeiminXiong · 2024-10-10T03:46:44Z

Thank you for the question. The step rewards in expert trajectories within the sft_data_alfworld.json file are calculated using the same method as the step rewards of the sampled agent trajectories. The reason some rewards are 0 or 0.2 might be due to the scoring model not being robust enough. While provided with parts of the correct trajectory, it still fails to ensure the successful completion of the task. However, both expert and agent trajectories are awarded step rewards using the same scoring model, ensuring fairness and consistency when comparing action scores.

WangHanLinHenry · 2024-11-06T05:48:14Z

Great work!
I have a question regarding the sft_data_alfworld.json file. Which model was used for sampling to obtain the step rewards?

The ground truth step rewards constructed from models with different capabilities can vary. I'd like to know whether the step rewards for ground truth data are estimated using the current model in each iteration, or how this process is handled?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

consultantQ commented Jul 18, 2024

WeiminXiong commented Oct 10, 2024

WangHanLinHenry commented Nov 6, 2024

Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

Comments

consultantQ commented Jul 18, 2024

WeiminXiong commented Oct 10, 2024

WangHanLinHenry commented Nov 6, 2024