You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After testing the PPO algorithm across 56 Atari environments, I noticed a discrepancy with some environments. In particular, the mean rewards attained differed from mean rewards attained by the PPO implementations from Stable Baselines3 and CleanRL in nine environments. The table below shows the nine environments, where five trials was conducted for each (implementation, environment) permutation, and an environment-wise one-way ANOVA was subsequently conducted to determine the effect of implementation source on mean reward. With respect to Baselines (not the 108 variant), it is observed that the implementation means are significantly different.
In the figure below, the training curves are aggregated from five trials to indicate the minimum, maximum, and mean within the shaded regions. The y-axis represents the mean reward while the x-axis represents the number of frames (total of 40 million frames). The curves for Baselines, Stable Baselines3, and CleanRL are in purple, orange, and red respectively (the blue and green curves can be ignored). It can be observed that Baselines' curves are significantly different than the curves from CleanRL and Stable Baselines3, aligning with the table above.
After manually debugging the code, I managed to locate the inconsistency. The environment was not conforming to the ALE specification of 108K frames per episode for the v4 variant---the default variant used by this repository and most DRL Libraries (e.g., CleanRL and Stable Baselines3). After setting max_episode_steps in the make_atari function to be 27K (108K frames), the implementations were now consistent in three out of the nine environments, as seen in the table above and the figure below.
I will create a pull request which sets the default number of frames per episode to be 108K (27K steps), with minimal changes to the original codebase so that it does not affect other components. However, I believe that there might still be other inconsistencies since there are six environments that still significantly differ between the implementations. Any suggestions on the possible causes of these inconsistencies would be much appreciated. In case the pull-request is not accepted, I have also included the fix below for those wanting to train Atari environments :)
# one line change in baselines/baselines/common/atari_wrappers.pydefmake_atari(env_id, max_episode_steps=27000):
After testing the PPO algorithm across 56 Atari environments, I noticed a discrepancy with some environments. In particular, the mean rewards attained differed from mean rewards attained by the PPO implementations from Stable Baselines3 and CleanRL in nine environments. The table below shows the nine environments, where five trials was conducted for each (implementation, environment) permutation, and an environment-wise one-way ANOVA was subsequently conducted to determine the effect of implementation source on mean reward. With respect to Baselines (not the 108 variant), it is observed that the implementation means are significantly different.
In the figure below, the training curves are aggregated from five trials to indicate the minimum, maximum, and mean within the shaded regions. The y-axis represents the mean reward while the x-axis represents the number of frames (total of 40 million frames). The curves for Baselines, Stable Baselines3, and CleanRL are in purple, orange, and red respectively (the blue and green curves can be ignored). It can be observed that Baselines' curves are significantly different than the curves from CleanRL and Stable Baselines3, aligning with the table above.
After manually debugging the code, I managed to locate the inconsistency. The environment was not conforming to the ALE specification of 108K frames per episode for the v4 variant---the default variant used by this repository and most DRL Libraries (e.g., CleanRL and Stable Baselines3). After setting
max_episode_steps
in themake_atari
function to be 27K (108K frames), the implementations were now consistent in three out of the nine environments, as seen in the table above and the figure below.I will create a pull request which sets the default number of frames per episode to be 108K (27K steps), with minimal changes to the original codebase so that it does not affect other components. However, I believe that there might still be other inconsistencies since there are six environments that still significantly differ between the implementations. Any suggestions on the possible causes of these inconsistencies would be much appreciated. In case the pull-request is not accepted, I have also included the fix below for those wanting to train Atari environments :)
Run Command To Replicate:
The text was updated successfully, but these errors were encountered: