From 6e5439bf14417e28dfd8d1c0de31ca0b20b3cb72 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Fri, 5 Jan 2018 19:58:40 +0200 Subject: [PATCH 01/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 14e507c..3e6599e 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # A2C -An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A2C is a variant of advantage actor critic introduced by [OpenAI in their published baselines](https://github.com/openai/baselines). However, these baselines are difficult to understand and modify. So, I implemented the A2C based on their implementation but in a clearer and simpler way. +An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A2C is a variant of advantage actor critic introduced by [OpenAI in their published baselines](https://github.com/openai/baselines). However, these baselines are difficult to understand and modify. So, I made the A2C based on their implementation but in a clearer and simpler way. ## Asynchronous vs Synchronous Advantage Actor Critic From f12d1ccbe293d59d28f15358d1ae7c84e33df6e9 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Fri, 5 Jan 2018 19:59:40 +0200 Subject: [PATCH 02/14] Update README.md --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index 3e6599e..eb9049d 100644 --- a/README.md +++ b/README.md @@ -67,3 +67,6 @@ In the project, two configuration files are provided as examples for training on ## License This project is licensed under the Apache License 2.0 - see the LICENSE file for details. + +## Reference Repository +[OpenAI Baselines](https://github.com/openai/baselines) From 92a4114cb7afddcb9d8274e9165dc1748499f77e Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Fri, 5 Jan 2018 20:03:53 +0200 Subject: [PATCH 03/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index eb9049d..9033334 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ tensorboard --logdir=experiments/my_experiment/summaries ### Video Producing During training, you can generate videos of the trained agent playing the game. This is achieved by changing `record_video_every` in the configuration file from -1 to the number of episodes between two generated videos. Generated videos are in your experiment directory. -During testing, videos are generated automatically if the optional `monitor` method is implemented in the environment. +During testing, videos are generated automatically if the optional `monitor` method is implemented in the environment. As for the gym included environment, it's already been implemented. ## Usage ### Main Dependencies From 65459618a69cd5e40c49fadbc1e8a0013ad973cb Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 01:43:32 +0200 Subject: [PATCH 04/14] Update README.md --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 9033334..9daed54 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,12 @@ # A2C An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A2C is a variant of advantage actor critic introduced by [OpenAI in their published baselines](https://github.com/openai/baselines). However, these baselines are difficult to understand and modify. So, I made the A2C based on their implementation but in a clearer and simpler way. +## What's new to OpenAI Baseline? +1. Support for Tensorboard visualization per running agent in an environment. +2. Support for different policy networks in an easier way. +3. Support for environments other than OpenAI gym in an easy way. +4. Support for video exporting per environment. +5. Simple and easy code to modify and begin experimenting. All you do is a plug and play. ## Asynchronous vs Synchronous Advantage Actor Critic Asynchronous advantage actor critic was introduced in [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783.pdf). The difference between both methods is that in asynchronous AC, parallel agents update the global network each one on its own. So, at a certain time, the weights used by an agent maybe different than the weights used by another agent leading to the fact that each agent plays with a different policy to explore more and more of the environment. However, in synchronous AC, all of the updates by the parallel agents are collected to update the global network. To encourage exploration, stochastic noise is added to the probability distribution of the actions predicted by each agent. From f1b986eb4f6f33cb41ec101a6356c4e4fd703d38 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 01:43:54 +0200 Subject: [PATCH 05/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9daed54..9d68809 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # A2C An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A2C is a variant of advantage actor critic introduced by [OpenAI in their published baselines](https://github.com/openai/baselines). However, these baselines are difficult to understand and modify. So, I made the A2C based on their implementation but in a clearer and simpler way. -## What's new to OpenAI Baseline? +### What's new to OpenAI Baseline? 1. Support for Tensorboard visualization per running agent in an environment. 2. Support for different policy networks in an easier way. 3. Support for environments other than OpenAI gym in an easy way. From ec6e4c26866b103396107d3ab513e6032ad7e254 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 01:45:02 +0200 Subject: [PATCH 06/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9d68809..903facf 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A 2. Support for different policy networks in an easier way. 3. Support for environments other than OpenAI gym in an easy way. 4. Support for video exporting per environment. -5. Simple and easy code to modify and begin experimenting. All you do is a plug and play. +5. Simple and easy code to modify and begin experimenting. All you need to do is a plug and play! ## Asynchronous vs Synchronous Advantage Actor Critic Asynchronous advantage actor critic was introduced in [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783.pdf). The difference between both methods is that in asynchronous AC, parallel agents update the global network each one on its own. So, at a certain time, the weights used by an agent maybe different than the weights used by another agent leading to the fact that each agent plays with a different policy to explore more and more of the environment. However, in synchronous AC, all of the updates by the parallel agents are collected to update the global network. To encourage exploration, stochastic noise is added to the probability distribution of the actions predicted by each agent. From a64840c826a457963d936e94d9ed4fb5c3dec6be Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 01:45:14 +0200 Subject: [PATCH 07/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 903facf..9a1a458 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A 2. Support for different policy networks in an easier way. 3. Support for environments other than OpenAI gym in an easy way. 4. Support for video exporting per environment. -5. Simple and easy code to modify and begin experimenting. All you need to do is a plug and play! +5. Simple and easy code to modify and begin experimenting. All you need to do is plug and play! ## Asynchronous vs Synchronous Advantage Actor Critic Asynchronous advantage actor critic was introduced in [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783.pdf). The difference between both methods is that in asynchronous AC, parallel agents update the global network each one on its own. So, at a certain time, the weights used by an agent maybe different than the weights used by another agent leading to the fact that each agent plays with a different policy to explore more and more of the environment. However, in synchronous AC, all of the updates by the parallel agents are collected to update the global network. To encourage exploration, stochastic noise is added to the probability distribution of the actions predicted by each agent. From c2765e9483abfb728d460759a3d9887f595c5307 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 01:47:05 +0200 Subject: [PATCH 08/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9a1a458..3ca9a91 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ tensorboard --logdir=experiments/my_experiment/summaries

-### Video Producing +### Video Generation During training, you can generate videos of the trained agent playing the game. This is achieved by changing `record_video_every` in the configuration file from -1 to the number of episodes between two generated videos. Generated videos are in your experiment directory. During testing, videos are generated automatically if the optional `monitor` method is implemented in the environment. As for the gym included environment, it's already been implemented. From e512b65556dee9f6f6d49382c120a7f4cc4ec79f Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 15:00:31 +0200 Subject: [PATCH 09/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3ca9a91..a47a992 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ The methods that should be implemented in the new environment class are: 5. `get_action_space()` for returing an object with attribute tuple `n` representing the number of possible actions in the environment. 6. `render()` for rendering the environment if appropriate. -### Policy Models Supported +### Policy Networks Supported This implementation comes with the basic CNN policy network from OpenAI baseline. However, it supports using different policy networks. All you have to do is to inherit from the base class `BasePolicy` in `models\base_policy.py`, and implement all the methods in a plug and play fashion again :D (See the CNNPolicy example class). ### Tensorboard Visualization From 1d41cca6c2f08f2db7f66e7479b198b555ee6828 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 15:02:25 +0200 Subject: [PATCH 10/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a47a992..65a819f 100644 --- a/README.md +++ b/README.md @@ -17,7 +17,7 @@ Asynchronous advantage actor critic was introduced in [Asynchronous Methods for ### Environments Supported This implementation allows for using different environments. It's not restricted to OpenAI gym environments. If you want to attach the project to another environment rather than that provided by gym, all you have to do is to inherit from the base class `BaseEnv` in `envs/base_env.py`, and implement all the methods in a plug and play fashion (See the gym environment example class). You also have to add the name of the new environment class in `A2C.py\env_name_parser()` method. -The methods that should be implemented in the new environment class are: +The methods that should be implemented in a new environment class are: 1. `make()` for creating the environment and returning a reference to it. 2. `step()` for taking a step in the environment and returning a tuple (observation images, reward float value, done boolean, any other info). 3. `reset()` for resetting the environment to the initial state. From df928217930bf2c763d32f0cb84a15462a9b0c61 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 19:03:45 +0200 Subject: [PATCH 11/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 65a819f..1644323 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ The methods that should be implemented in a new environment class are: 2. `step()` for taking a step in the environment and returning a tuple (observation images, reward float value, done boolean, any other info). 3. `reset()` for resetting the environment to the initial state. 4. `get_observation_space()` for returning an object with attribute tuple `shape` representing the shape of the observation space. -5. `get_action_space()` for returing an object with attribute tuple `n` representing the number of possible actions in the environment. +5. `get_action_space()` for returning an object with attribute `n` representing the number of possible actions in the environment. 6. `render()` for rendering the environment if appropriate. ### Policy Networks Supported From 702c74b22a63456371c9a6c02956f8ba0bc4705b Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 19:04:50 +0200 Subject: [PATCH 12/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1644323..f90b859 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ An implementation of `Synchronous Advantage Actor Critic (A2C)` in TensorFlow. A 1. Support for Tensorboard visualization per running agent in an environment. 2. Support for different policy networks in an easier way. 3. Support for environments other than OpenAI gym in an easy way. -4. Support for video exporting per environment. +4. Support for video generation of an agent acting in the environment. 5. Simple and easy code to modify and begin experimenting. All you need to do is plug and play! ## Asynchronous vs Synchronous Advantage Actor Critic From 6f257d2dd34b7fb14d2688c32f3d5a9a213d5922 Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 19:07:30 +0200 Subject: [PATCH 13/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f90b859..415f6fd 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ The methods that should be implemented in a new environment class are: 6. `render()` for rendering the environment if appropriate. ### Policy Networks Supported -This implementation comes with the basic CNN policy network from OpenAI baseline. However, it supports using different policy networks. All you have to do is to inherit from the base class `BasePolicy` in `models\base_policy.py`, and implement all the methods in a plug and play fashion again :D (See the CNNPolicy example class). +This implementation comes with the basic CNN policy network from OpenAI baseline. However, it supports using different policy networks. All you have to do is to inherit from the base class `BasePolicy` in `models\base_policy.py`, and implement all the methods in a plug and play fashion again :D (See the CNNPolicy example class). You also have to add the name of the new policy network class in `models\model.py\policy_name_parser()` method. ### Tensorboard Visualization This implementation allows for the beautiful Tensorboard visualization. It displays the time plots per running agent of the two most important signals in reinforcement learning: episode length and total reward in the episode. All you have to do is to launch Tensorboard from your experiment directory located in `experiments/`. From 08bd418e2ee2174d741d3985845b764d199c86cf Mon Sep 17 00:00:00 2001 From: Mostafa Gamal Date: Sat, 6 Jan 2018 19:09:16 +0200 Subject: [PATCH 14/14] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 415f6fd..cbe4b64 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,7 @@ tensorboard --logdir=experiments/my_experiment/summaries ### Video Generation -During training, you can generate videos of the trained agent playing the game. This is achieved by changing `record_video_every` in the configuration file from -1 to the number of episodes between two generated videos. Generated videos are in your experiment directory. +During training, you can generate videos of the trained agent acting (playing) in the environment. This is achieved by changing `record_video_every` in the configuration file from -1 to the number of episodes between two generated videos. Videos are generated in your experiment directory. During testing, videos are generated automatically if the optional `monitor` method is implemented in the environment. As for the gym included environment, it's already been implemented.