You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello Team,
I'm working with the "sft_data_alfworld.json" file. I notice that in this file, not all monte_carlo_step_reward are 1; some are 0.2 or even 0. How do you obtain the step rewards in expert trajectories? Intuitively, all the scores in expert trajectories should be 1. Thank you!
The text was updated successfully, but these errors were encountered:
Thank you for the question. The step rewards in expert trajectories within the sft_data_alfworld.json file are calculated using the same method as the step rewards of the sampled agent trajectories. The reason some rewards are 0 or 0.2 might be due to the scoring model not being robust enough. While provided with parts of the correct trajectory, it still fails to ensure the successful completion of the task. However, both expert and agent trajectories are awarded step rewards using the same scoring model, ensuring fairness and consistency when comparing action scores.
Great work!
I have a question regarding the sft_data_alfworld.json file. Which model was used for sampling to obtain the step rewards?
The ground truth step rewards constructed from models with different capabilities can vary. I'd like to know whether the step rewards for ground truth data are estimated using the current model in each iteration, or how this process is handled?
Hello Team,
I'm working with the "sft_data_alfworld.json" file. I notice that in this file, not all monte_carlo_step_reward are 1; some are 0.2 or even 0. How do you obtain the step rewards in expert trajectories? Intuitively, all the scores in expert trajectories should be 1. Thank you!
The text was updated successfully, but these errors were encountered: