Uses a latent variable to summarize the data collected so far into sufficient statistics about the current task.
The distribution over the latent variable conditioned on collected data is stochastic to encourage temporally correlated exploration.
The architecture for the distribution is permutation-invariant wrt its input. (VERY NEAT!)
Demonstrates that, compared to alternatives, it is better to train the latent variable distribution using recently collected data and train the actor-critic with data uniformly sampled form the experience buffer.
Proposes to learn the value of the hyper-parameters lambda and gamma, which parameterize the return function. These are now referred to as meta-parameters.
Online cross-validation is used to ensure that no extra data is needed to train the meta-parameters.
The meta-parameters are trained using gradient, with can be obtained in closed form with approximations.
The policy has a hierarchical structure, comprising of a high-level policy, which is agnostic to low-level control and provides instruction to a low-level policy through a latent variable.
The objective function includes a KL regularization term to ensure the agent's policy does not stray too far from a default policy, which can be fixed or learnt.
Restricting information to either the high-level policy or the low-level policy leads to more robust behavior in the transfer setting.
The policy consists of multiple primitive policies, which are combined multiplicatively.
The policy is trained to perform well on motion imitation tasks, and then transfer to tasks with different goals.
The policy consists of task-specific and robot-specific module.
Task-specific and robot-specific modules are trained to be invariant to a specific robot-task combination.
At test time, the corresponding task and robot modules are combined, demonstrated zero-shot capability.
Learn a feature space for the observation that is invariant to the specific robots.
Try to do imitation learning to transfer from a source agent to a target agent having different morphologies.
The policy is a recurrent NN.
The key idea is that previous reward and action are inputted to the RNN at the current timestep.
Perform experiments on different types of bandits and MDPs to demonstrate different aspect of meta-RL.