Some reinforcement learning algorithms implemented with Tensorflow 2
For practise I chose OpenAI gym Lunar Lander environment. It is representative and doesn't skip frames like some other envs.
-
Advantage Actor-Critic with online update, N-returns backup and entropy bonus
-
Proximal Policy Optimization(PPO) + Generalized Advantage Estimator(GAE)
-
Soft Actor-Critic with Value network (alpha term regularization taken from SAC v2)
-
APE-X DPG
-
APE-X with Soft Actor Critic
-
Curiosity based on Random Network Distillation ( with Soft Actor Critic)
-
Recurrent Experience Replay in Distributed Reinforcement Lerning (R2D2) with SAC.
- Orchestrator
- Agent
- Learner
- Agent buffer Responsible for collecting trajectories and is important part of whole algorithm.
Note 1: for this experiment the famous Lunar Lander environment was altered to produce 'stacked' states. This achived by adding liner interpolated states between 'state' and 'next_state'.
Note 2: Because original paper says nothing about behavior near trajoctory end, the simplest approach was taken - length of last trajectoy may vary, but has length of atleast 2 records.
-
Regularizing Action Policies for Smooth Control implementation based on Soft Actor-Critic
-
Active dendrites networks implementation arxiv paper
- Modified LunarLander environemnt LunarLander multitask. This implementation has two tasks: original landing task and new - lift off. Last one require lander to fly off from landing pad
- Active Dendrits and k-Winner-Takes-All layers
- The training script