Efficient_Off_Policy_Meta_Reinforcement_Learning_via_Probabilistic_Context_Variables
-
Uses a latent variable to summarize the data collected so far into sufficient statistics about the current task.
-
The distribution over the latent variable conditioned on collected data is stochastic to encourage temporally correlated exploration.
-
The architecture for the distribution is permutation-invariant wrt its input. (VERY NEAT!)
-
Demonstrates that, compared to alternatives, it is better to train the latent variable distribution using recently collected data and train the actor-critic with data uniformly sampled form the experience buffer.
Meta_Gradient_Reinforcement_Learning
-
Proposes to learn the value of the hyper-parameters lambda and gamma, which parameterize the return function. These are now referred to as meta-parameters.
-
Online cross-validation is used to ensure that no extra data is needed to train the meta-parameters.
-
The meta-parameters are trained using gradient, with can be obtained in closed form with approximations.
Exploiting_Hierarchy_for_Learning_and_Transfer_in_KL-regularized_RL
-
The policy has a hierarchical structure, comprising of a high-level policy, which is agnostic to low-level control and provides instruction to a low-level policy through a latent variable.
-
The objective function includes a KL regularization term to ensure the agent's policy does not stray too far from a default policy, which can be fixed or learnt.
-
Restricting information to either the high-level policy or the low-level policy leads to more robust behavior in the transfer setting.
MCP_Learning_Composable_Hierarchical_Control_with_Multiplicative_Compositional_Policies
-
The policy consists of multiple primitive policies, which are combined multiplicatively.
-
The policy is trained to perform well on motion imitation tasks, and then transfer to tasks with different goals.
Learning_Modular_Neural_Network_Policies_for_Multi_Task_and_Multi_Robot_Transfer
-
The policy consists of task-specific and robot-specific module.
-
Task-specific and robot-specific modules are trained to be invariant to a specific robot-task combination.
-
At test time, the corresponding task and robot modules are combined, demonstrated zero-shot capability.
Learning_Invariant_Feature_Spaces_to_Transfer_Skills_with_Reinforcement_Learning
-
Learn a feature space for the observation that is invariant to the specific robots.
-
Try to do imitation learning to transfer from a source agent to a target agent having different morphologies.
Learning_to_Reinforcement_Learn
-
The policy is a recurrent NN.
-
The key idea is that previous reward and action are inputted to the RNN at the current timestep.
-
Perform experiments on different types of bandits and MDPs to demonstrate different aspect of meta-RL.
Meta_World_A_Benchmark_and_Evaluation_for_Multi_Task_and_Meta_Reinforcement_Learning