Skip to content
This repository has been archived by the owner on Apr 16, 2022. It is now read-only.

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Proximal Policy Optimization (PPO)

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, & Oleg Klimov. (2017). Proximal Policy Optimization Algorithms.

How to use

Run with default arguments

./unstable_baselines/ppo/train.sh --rank 0 --seed 1 "BreakoutNoFrameskip-v4"

Run multiple environments with default arguments

./unstable_baselines/ppo/train.sh --rank 0 --seed 1 "BreakoutNoFrameskip-v4" "SeaquestNoFrameskip-v4" "PongNoFrameskip-v4"

Atari-like environment (Image observation + discrete action)

python -m unstable_baselines.ppo.run --rank 0 --seed 1 --logdir='./log/{env_id}/ppo/{rank}' \
               --logging='training.log' --monitor_dir='monitor' --tb_logdir='' --model_dir='model' \
               --env_id="BreakoutNoFrameskip-v4" --num_envs=8 --num_epochs=10000 \
               --num_steps=125 --num_subepochs=8 --batch_size=256 --verbose=2 \
               --shared_net --record_video

Enable shared_net shares the CNN between policy and value function.
Total timesteps (Samples) ≈ num_envs * num_steps * num_epochs (~10M in this case)
Number of times each sample reused ≈ num_subepochs (~8 in this case)

Continuous control environment

python -m unstable_baselines.ppo.run --rank 0 --seed 1 --logdir='./log/{env_id}/ppo/{rank}' \
               --logging='training.log' --monitor_dir='monitor' --tb_logdir='' --model_dir='model' \
               --env_id="HalfCheetahBulletEnv-v0" --num_envs=1 --num_epochs=1000 \
               --num_steps=1000 --num_subepochs=10 --batch_size=100 --verbose=2 \
               --ent_coef=0.0 --record_video

Total timesteps (Samples) = num_envs * num_steps * num_epochs (~1M in this case)
Number of times each sample reused = num_subepochs (~10 in this case)

Atari 2600

Video

BeamRiderNoFrameskip-v4 BreakoutNoFrameskip-v4 PongNoFrameskip-v4 SeaquestNoFrameskip-v4
AsteroidsNoFrameskip-v4 EnduroNoFrameskip-v4 QbertNoFrameskip-v4 MsPacmanNoFrameskip-v4

Results

Learning curve

env_id Max rewards Mean rewards Std rewards Train samples Train seeds Eval episodes Eval seed
AsteroidsNoFrameskip-v4 1570 1072 281.73 10M 1~8 20 0
BeamRiderNoFrameskip-v4 2832 1513.4 647.36 10M 1~8 20 0
BreakoutNoFrameskip-v4 368 131.85 118.28 10M 1~8 20 0
EnduroNoFrameskip-v4 302 189.2 29.79 10M 1~8 20 0
MsPacmanNoFrameskip-v4 2650 2035.5 463.1 10M 1~8 20 0
PongNoFrameskip-v4 21 21 0 10M 1~8 20 0
QbertNoFrameskip-v4 16925 16441.25 259.23 10M 1~8 20 0
SeaquestNoFrameskip-v4 1760 1750 17.32 10M 1~8 20 0

M = million (1e6)

Hyperparametrs

env_id AsteroidsNoFrameskip-v4 BeamRiderNoFrameskip-v4 BreakoutNoFrameskip-v4 EnduroNoFrameskip-v4 MsPacmanNoFrameskip-v4 PongNoFrameskip-v4 QbertNoFrameskip-v4 SeaquestNoFrameskip-v4
num_envs 8 8 8 8 8 8 8 8
num_epochs 10000 10000 10000 10000 10000 10000 10000 10000
num_steps 125 125 125 125 125 125 125 125
num_subepochs 8 8 8 8 8 8 8 8
batch_size 256 256 256 256 256 256 256 256
ent_coef 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
vf_coef 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
shared_net ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Pybullet

Video

HalfCheetahBulletEnv-v0 AntBulletEnv-v0 HopperBulletEnv-v0
Walker2DBulletEnv-v0 HumanoidBulletEnv-v0

Learning Curve

Learning curve

env_id Max rewards Mean rewards Std rewards Train samples Train seeds Eval episodes Eval seed
AntBulletEnv-v0 2247.002 2157.180 107.803 2M 1 20 0
HalfCheetahBulletEnv-v0 2696.556 2477.882 759.322 2M 1 20 0
HopperBulletEnv-v0 2689.504 2542.172 373.381 2M 1 20 0
HumanoidBulletEnv-v0 2447.299 1883.564 923.937 8M 1 20 0
Walker2DBulletEnv-v0 2108.727 2005.461 286.699 4M 1 20 0

Hyperparametrs

env_id AntBulletEnv-v0 HalfCheetahBulletEnv-v0 HopperBulletEnv-v0 HumanoidBulletEnv-v0 Walker2DBulletEnv-v0
num_envs 1 1 1 16 4
num_epochs 2000 2000 2000 1000 2000
num_steps 1000 1000 1000 500 500
num_subepochs 10 10 10 20 20
batch_size 100 100 100 1000 1000
lr 3e-4 3e-4 3e-4 3e-4 3e-4
ent_coef 0.0 0.0 0.0 0.0 0.0
vf_coef 0.5 0.5 0.5 0.5 0.5
shared_net
MlpNet [256, 256] [256, 256] [256, 256] [256, 256] [256, 256]

Architecture

Box Discrete MultiDiscrete MultiBinary
Observation ✔️ ✔️ ✔️ ✔️
Action ✔️ ✔️


Atari-like environment


shared_net=True

Continuous control environment


shared_net=False