The given repository contains the runs for the safe exploration experiments mentioned in the original paper. The paper essentially talks about a novel architecture employed for solving real world problems where violating safety or critical contraints are heavily penalized.
There are two environments proposed in the paper for reference, namely the Ball(n-Dimensional)
and Spaceship
environments whose dynamics are either modelled by first-order or second-order differential equations respectively. The mentioned solution essentially proposes a Constrained Marko Decision Process for modeling the environment dynamics and focuses on constrained policy ptimization where an additional safety-layer is built in top of the original policy calculated from DDPG (Deep Deterministic Policy Gradient) layer. The additional safety layer penalizes or avoids the constraints violations and performs an action correction after every policy evaluation. ; i.e., after every policy query,it solves an optimization problem for finding the minimal change to the action such that the safety constraints are met.
Our experiments include designing the Safety
Layer from scratch and integrating it with DDPG
and Twin Delayed Deep Deterministic model
(TD3) on various gym environments including Ball-1D
, Ball-2D
, Ball-3D
, Spaceship-Arena
, Spaceship-Corridor
, Bioreactor
. The TD3 algorithm is an improvement over DDPG that avoids
the maximization bias by introducing joint backpropagation of twin critics. Our experiments also includes rewards and cumulative constraint violations for each of the environments with customized
reward shaping. We have essentially performed a comparative analysis depicting how a minimal safety layer implementation over the deterministic policy model effectively boosts up the training and evaluation rewards obtained by the agent while navigating in the respective environment over episodes and is nearly successful in attaining constraints free actions.
The plots obtained using the safety layer for different environments highlights that the agent is able to attain optimal convergence in terms of rewards in way lesser episodes. The action correction also comes at the cost of increased wall clock time since on every action selection, a forward pass through the trained constraint model is executed to return the safe actions for navigation in the environment. The implementation also guarantees 0 constraints in some of the environments, thus highlighting the potential of a linear safety approximation in several industrial use cases.
All of the results are compiled in the form .npy
files inside the files link. The link to the script for visualizing the results obtained for all the above mentioned environments is Link. Some visualizations and comparisons can be found in
Link
All the working implementations are found inside the ./notebooks
directory
Members | Github-ID |
---|---|
Rajarshi Dutta | @Rajarshi1001 |
Udvas Basak | @IonUdvas |
Divyani Gaur | @DivyaniGaur |
- Dalal, Gal, et al. "Safe exploration in continuous action spaces." arXiv preprint arXiv:1801.08757 (2018).