Skip to content
/ mopac Public

Model Predictive Actor-Critic Reinforcement Learning

License

Notifications You must be signed in to change notification settings

dnandha/mopac

Repository files navigation

Model Predictive Actor-Critic Reinforcement Learning

Supplemental code for the ICRA 2021 paper on MoPAC. Video.

MOPAC showcase gif

Abstract

Substantial advancements to model-based reinforcement learning algorithms have been impeded by the model-bias induced by the collected data, which generally hurts performance. Meanwhile, their inherent sample efficiency warrants utility for most robot applications, limiting potential damage to the robot and its environment during training. Inspired by information theoretic model predictive control and advances in deep reinforcement learning, we introduce Model Predictive Actor-Critic (MoPAC), a hybrid model-based/model-free method that combines model predictive rollouts with policy optimization as to mitigate model bias. MoPAC leverages optimal trajectories to guide policy learning, but explores via its model-free method, allowing the algorithm to learn more expressive dynamics models. This combination guarantees optimal skill learning up to an approximation error and reduces necessary physical interaction with the environment, making it suitable for real-robot training. We provide extensive results showcasing how our proposed method generally outperforms current state-of-the-art and conclude by evaluating MoPAC for learning on a physical robotic hand performing valve rotation and finger gaiting--a task that requires grasping, manipulation, and then regrasping of an object.

Reference

@inproceedings{morgan2021model,
  author = {Andrew Morgan and Daljeet Nandha and Georgia Chalvatzaki and Carlo D'Eramo and Aaron Dollar and Jan Peters},
  title = {Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with Deep Reinforcement Learning},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year = {2021}
}

Installation

  1. Install MuJoCo (https://www.roboti.us/index.html) at ~/.mujoco/mujoco200 and copy your license key to ~/.mujoco/mjkey.txt
  2. Clone mopac
git clone --recursive https://github.com/dnandha/mopac.git
  1. Create a conda environment and install mopac
cd mopac
conda env create -f environment.yml
conda activate mopac
pip install -e .
pip install -e viskit

Usage

Configuration files can be found in mopac/examples/config/.

Running locally:

mopac run_local mopac.examples.development --config=mopac.examples.config.${envname}.0 --gpus=1 --trial-gpus=1

Running on cluster:

ray start --block --head --redis-port=6379 --temp-dir=${ray_tmp_dir} &
mopac run_example_cluster mopac.examples.development --config=mopac.examples.config.${envname}.0

Restoring a checkpoint locally:

mopac run_local mopac.examples.development --config=mopac.examples.config.${envname}.0 --gpus=1 --trial-gpus=1 --restore=${path_to_checkpoint_ending_with_slash}
for chkpt in $(find ~/ray_mopac/${env_name}/${exp_name} -name checkpoint_${number}); do mopac run_local mopac.examples.development --config=mopac.examples.config.${envname}.0 --gpus=1 --trial-gpus=1 --restore=${chkpt}/; done

Restoring a checkpoint on cluster:

for chkpt in $(find ~/ray_mopac/${env_name}/${exp_name} -name checkpoint_${number}); do export chkpt=${chkpt}/; sbatch -J ${path_to_job_script}; done

Job sript must contain parameter --restore=${chkpt}, e.g. mopac run_example_cluster mopac.examples.development --config=mopac.examples.config.${envname}.0 --restore=${chkpt}.

New environments

To run on a different environment, you can modify the provided template. You will also need to provide the termination function for the environment in mopac/static. If you name the file the lowercase version of the environment name, it will be found automatically. See hopper.py for an example.

Logging

This codebase contains viskit as a submodule. You can view saved runs with:

viskit ~/ray_mopac --port 6008

assuming you used the default log_dir.

Hyperparameters

The rollout length schedule is defined by a length-4 list in a config file. The format is [start_epoch, end_epoch, start_length, end_length], so the following:

'rollout_schedule': [20, 100, 5, 15] 

The mix ratio of model-based and model-free samples is defined by a length-4 list in a config file. The format is [start_epoch, end_epoch, start_length, end_length], so the following:

'ratio_schedule': [20, 100, 5, 15] 

This corresponds to a model rollout length linearly increasing from 5 to 15 over epochs 20 to 100.

If you want to speed up training in terms of wall clock time (but possibly make the runs less sample-efficient), you can set a timeout for model training (max_model_t, in seconds) or train the model less frequently (every model_train_freq steps).

Acknowledgements

The underlying soft actor-critic implementation in MOPAC comes from Tuomas Haarnoja and Kristian Hartikainen's softlearning codebase. The modeling code is a slightly modified version of Kurtland Chua's PETS implementation.

This code is an extension of MBPO for model predictive rollouts.