Proximal Policy Optimization (PPO)
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, & Oleg Klimov. (2017). Proximal Policy Optimization Algorithms.
Run with default arguments
./ unstable_baselines / ppo / train .sh - - rank 0 - - seed 1 "BreakoutNoFrameskip-v4"
Run multiple environments with default arguments
./ unstable_baselines / ppo / train .sh - - rank 0 - - seed 1 "BreakoutNoFrameskip-v4" "SeaquestNoFrameskip-v4" "PongNoFrameskip-v4"
Atari-like environment (Image observation + discrete action)
python - m unstable_baselines .ppo .run - - rank 0 - - seed 1 - - logdir = './log/{env_id}/ppo/{rank}' \
- - logging = 'training.log' - - monitor_dir = 'monitor' - - tb_logdir = '' - - model_dir = 'model' \
- - env_id = "BreakoutNoFrameskip-v4" - - num_envs = 8 - - num_epochs = 10000 \
- - num_steps = 125 - - num_subepochs = 8 - - batch_size = 256 - - verbose = 2 \
- - shared_net - - record_video
Enable shared_net
shares the CNN between policy and value function.
Total timesteps (Samples) ≈ num_envs * num_steps * num_epochs (~10M in this case)
Number of times each sample reused ≈ num_subepochs (~8 in this case)
Continuous control environment
python - m unstable_baselines .ppo .run - - rank 0 - - seed 1 - - logdir = './log/{env_id}/ppo/{rank}' \
- - logging = 'training.log' - - monitor_dir = 'monitor' - - tb_logdir = '' - - model_dir = 'model' \
- - env_id = "HalfCheetahBulletEnv-v0" - - num_envs = 1 - - num_epochs = 1000 \
- - num_steps = 1000 - - num_subepochs = 10 - - batch_size = 100 - - verbose = 2 \
- - ent_coef = 0.0 - - record_video
Total timesteps (Samples) = num_envs * num_steps * num_epochs (~1M in this case)
Number of times each sample reused = num_subepochs (~10 in this case)
BeamRiderNoFrameskip-v4
BreakoutNoFrameskip-v4
PongNoFrameskip-v4
SeaquestNoFrameskip-v4
AsteroidsNoFrameskip-v4
EnduroNoFrameskip-v4
QbertNoFrameskip-v4
MsPacmanNoFrameskip-v4
Learning curve
env_id
Max rewards
Mean rewards
Std rewards
Train samples
Train seeds
Eval episodes
Eval seed
AsteroidsNoFrameskip-v4
1570
1072
281.73
10M
1~8
20
0
BeamRiderNoFrameskip-v4
2832
1513.4
647.36
10M
1~8
20
0
BreakoutNoFrameskip-v4
368
131.85
118.28
10M
1~8
20
0
EnduroNoFrameskip-v4
302
189.2
29.79
10M
1~8
20
0
MsPacmanNoFrameskip-v4
2650
2035.5
463.1
10M
1~8
20
0
PongNoFrameskip-v4
21
21
0
10M
1~8
20
0
QbertNoFrameskip-v4
16925
16441.25
259.23
10M
1~8
20
0
SeaquestNoFrameskip-v4
1760
1750
17.32
10M
1~8
20
0
M = million (1e6)
env_id
AsteroidsNoFrameskip-v4
BeamRiderNoFrameskip-v4
BreakoutNoFrameskip-v4
EnduroNoFrameskip-v4
MsPacmanNoFrameskip-v4
PongNoFrameskip-v4
QbertNoFrameskip-v4
SeaquestNoFrameskip-v4
num_envs
8
8
8
8
8
8
8
8
num_epochs
10000
10000
10000
10000
10000
10000
10000
10000
num_steps
125
125
125
125
125
125
125
125
num_subepochs
8
8
8
8
8
8
8
8
batch_size
256
256
256
256
256
256
256
256
ent_coef
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
vf_coef
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
shared_net
✔️
✔️
✔️
✔️
✔️
✔️
✔️
✔️
HalfCheetahBulletEnv-v0
AntBulletEnv-v0
HopperBulletEnv-v0
Walker2DBulletEnv-v0
HumanoidBulletEnv-v0
Learning curve
env_id
Max rewards
Mean rewards
Std rewards
Train samples
Train seeds
Eval episodes
Eval seed
AntBulletEnv-v0
2247.002
2157.180
107.803
2M
1
20
0
HalfCheetahBulletEnv-v0
2696.556
2477.882
759.322
2M
1
20
0
HopperBulletEnv-v0
2689.504
2542.172
373.381
2M
1
20
0
HumanoidBulletEnv-v0
2447.299
1883.564
923.937
8M
1
20
0
Walker2DBulletEnv-v0
2108.727
2005.461
286.699
4M
1
20
0
env_id
AntBulletEnv-v0
HalfCheetahBulletEnv-v0
HopperBulletEnv-v0
HumanoidBulletEnv-v0
Walker2DBulletEnv-v0
num_envs
1
1
1
16
4
num_epochs
2000
2000
2000
1000
2000
num_steps
1000
1000
1000
500
500
num_subepochs
10
10
10
20
20
batch_size
100
100
100
1000
1000
lr
3e-4
3e-4
3e-4
3e-4
3e-4
ent_coef
0.0
0.0
0.0
0.0
0.0
vf_coef
0.5
0.5
0.5
0.5
0.5
shared_net
❌
❌
❌
❌
❌
MlpNet
[256, 256]
[256, 256]
[256, 256]
[256, 256]
[256, 256]
Box
Discrete
MultiDiscrete
MultiBinary
Observation
✔️
✔️
✔️
✔️
Action
✔️
✔️
❌
❌
shared_net=True
Continuous control environment
shared_net=False