Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
model.py		model.py

README.md

Proximal Policy Optimization (PPO)

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, & Oleg Klimov. (2017). Proximal Policy Optimization Algorithms.

How to use

Run with default arguments

./unstable_baselines/ppo/train.sh --rank 0 --seed 1 "BreakoutNoFrameskip-v4"

Run multiple environments with default arguments

./unstable_baselines/ppo/train.sh --rank 0 --seed 1 "BreakoutNoFrameskip-v4" "SeaquestNoFrameskip-v4" "PongNoFrameskip-v4"

Atari-like environment (Image observation + discrete action)

python -m unstable_baselines.ppo.run --rank 0 --seed 1 --logdir='./log/{env_id}/ppo/{rank}' \
               --logging='training.log' --monitor_dir='monitor' --tb_logdir='' --model_dir='model' \
               --env_id="BreakoutNoFrameskip-v4" --num_envs=8 --num_epochs=10000 \
               --num_steps=125 --num_subepochs=8 --batch_size=256 --verbose=2 \
               --shared_net --record_video

^{Enable shared_net shares the CNN between policy and value function.}
^{Total timesteps (Samples) ≈ num_envs * num_steps * num_epochs (~10M in this case)}
^{Number of times each sample reused ≈ num_subepochs (~8 in this case)}

Continuous control environment

python -m unstable_baselines.ppo.run --rank 0 --seed 1 --logdir='./log/{env_id}/ppo/{rank}' \
               --logging='training.log' --monitor_dir='monitor' --tb_logdir='' --model_dir='model' \
               --env_id="HalfCheetahBulletEnv-v0" --num_envs=1 --num_epochs=1000 \
               --num_steps=1000 --num_subepochs=10 --batch_size=100 --verbose=2 \
               --ent_coef=0.0 --record_video

^{Total timesteps (Samples) = num_envs * num_steps * num_epochs (~1M in this case)}
^{Number of times each sample reused = num_subepochs (~10 in this case)}

Atari 2600

Video

`BeamRiderNoFrameskip-v4`	`BreakoutNoFrameskip-v4`	`PongNoFrameskip-v4`	`SeaquestNoFrameskip-v4`

`AsteroidsNoFrameskip-v4`	`EnduroNoFrameskip-v4`	`QbertNoFrameskip-v4`	`MsPacmanNoFrameskip-v4`

Results

Learning curve

`env_id`	Max rewards	Mean rewards	Std rewards	Train samples	Train seeds	Eval episodes
`AsteroidsNoFrameskip-v4`	1570	1072	281.73	10M	1~8	20
`BeamRiderNoFrameskip-v4`	2832	1513.4	647.36	10M	1~8	20
`BreakoutNoFrameskip-v4`	368	131.85	118.28	10M	1~8	20
`EnduroNoFrameskip-v4`	302	189.2	29.79	10M	1~8	20
`MsPacmanNoFrameskip-v4`	2650	2035.5	463.1	10M	1~8	20
`PongNoFrameskip-v4`	21	21	0	10M	1~8	20
`QbertNoFrameskip-v4`	16925	16441.25	259.23	10M	1~8	20
`SeaquestNoFrameskip-v4`	1760	1750	17.32	10M	1~8	20

^{M = million (1e6)}

Hyperparametrs

`env_id`	`AsteroidsNoFrameskip-v4`	`BeamRiderNoFrameskip-v4`	`BreakoutNoFrameskip-v4`	`EnduroNoFrameskip-v4`	`MsPacmanNoFrameskip-v4`	`PongNoFrameskip-v4`	`QbertNoFrameskip-v4`	`SeaquestNoFrameskip-v4`
`num_envs`	8	8	8	8	8	8	8	8
`num_epochs`	10000	10000	10000	10000	10000	10000	10000	10000
`num_steps`	125	125	125	125	125	125	125	125
`num_subepochs`	8	8	8	8	8	8	8	8
`batch_size`	256	256	256	256	256	256	256	256
`ent_coef`	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
`vf_coef`	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
`shared_net`	✔️	✔️	✔️	✔️	✔️	✔️	✔️	✔️

Pybullet

Video

`HalfCheetahBulletEnv-v0`	`AntBulletEnv-v0`	`HopperBulletEnv-v0`

`Walker2DBulletEnv-v0`	`HumanoidBulletEnv-v0`

Learning Curve

Learning curve

`env_id`	Max rewards	Mean rewards	Std rewards	Train samples	Train seeds	Eval episodes
`AntBulletEnv-v0`	2247.002	2157.180	107.803	2M	1	20
`HalfCheetahBulletEnv-v0`	2696.556	2477.882	759.322	2M	1	20
`HopperBulletEnv-v0`	2689.504	2542.172	373.381	2M	1	20
`HumanoidBulletEnv-v0`	2447.299	1883.564	923.937	8M	1	20
`Walker2DBulletEnv-v0`	2108.727	2005.461	286.699	4M	1	20

Hyperparametrs

`env_id`	`AntBulletEnv-v0`	`HalfCheetahBulletEnv-v0`	`HopperBulletEnv-v0`	`HumanoidBulletEnv-v0`	`Walker2DBulletEnv-v0`
`num_envs`	1	1	1	16	4
`num_epochs`	2000	2000	2000	1000	2000
`num_steps`	1000	1000	1000	500	500
`num_subepochs`	10	10	10	20	20
`batch_size`	100	100	100	1000	1000
`lr`	3e-4	3e-4	3e-4	3e-4	3e-4
`ent_coef`	0.0	0.0	0.0	0.0	0.0
`vf_coef`	0.5	0.5	0.5	0.5	0.5
`shared_net`	❌	❌	❌	❌	❌
`MlpNet`	[256, 256]	[256, 256]	[256, 256]	[256, 256]	[256, 256]

Architecture

	`Box`	`Discrete`	`MultiDiscrete`	`MultiBinary`
Observation	✔️	✔️	✔️	✔️
Action	✔️	✔️	❌	❌

Atari-like environment

^{shared_net=True}

Continuous control environment

^{shared_net=False}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ppo

ppo

README.md

Proximal Policy Optimization (PPO)

How to use

Run with default arguments

Run multiple environments with default arguments

Atari-like environment (Image observation + discrete action)

Continuous control environment

Atari 2600

Video

Results

Hyperparametrs

Pybullet

Video

Learning Curve

Hyperparametrs

Architecture

Atari-like environment

Continuous control environment

Files

ppo

Directory actions

More options

Directory actions

More options

Latest commit

History

ppo

Folders and files

parent directory

README.md

Proximal Policy Optimization (PPO)

How to use

Run with default arguments

Run multiple environments with default arguments

Atari-like environment (Image observation + discrete action)

Continuous control environment

Atari 2600

Video

Results

Hyperparametrs

Pybullet

Video

Learning Curve

Hyperparametrs

Architecture

Atari-like environment

Continuous control environment