Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue About How “monte_carlo_step_reward” Obtained in Expert Trajectories #2

Open
consultantQ opened this issue Jul 18, 2024 · 2 comments

Comments

@consultantQ
Copy link

Hello Team,
I'm working with the "sft_data_alfworld.json" file. I notice that in this file, not all monte_carlo_step_reward are 1; some are 0.2 or even 0. How do you obtain the step rewards in expert trajectories? Intuitively, all the scores in expert trajectories should be 1. Thank you!

@WeiminXiong
Copy link
Owner

Thank you for the question. The step rewards in expert trajectories within the sft_data_alfworld.json file are calculated using the same method as the step rewards of the sampled agent trajectories. The reason some rewards are 0 or 0.2 might be due to the scoring model not being robust enough. While provided with parts of the correct trajectory, it still fails to ensure the successful completion of the task. However, both expert and agent trajectories are awarded step rewards using the same scoring model, ensuring fairness and consistency when comparing action scores.

@WangHanLinHenry
Copy link

Great work!
I have a question regarding the sft_data_alfworld.json file. Which model was used for sampling to obtain the step rewards?

The ground truth step rewards constructed from models with different capabilities can vary. I'd like to know whether the step rewards for ground truth data are estimated using the current model in each iteration, or how this process is handled?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants