URLS aims to provide a set of unsupervised reinforcement learning algorithms and experiments for the purpose of researching the applicability of unsupervised reinforcement learning to a variety of paradigms.
The codebase is based upon URLB and ExORL. Further details are provided in the following papers:
- URLB: Unsupervised Reinforcement Learning Benchmark
- Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning
URLS is intended as a successor to URLB allowing for an increased number of experiments and RL paradigms.
Install MuJoCo if it is not already the case:
- Download MuJoCo binaries here.
- Unzip the downloaded archive into
~/.mujoco/
. - Append the MuJoCo subdirectory bin path into the env variable
LD_LIBRARY_PATH
.
Install the following libraries:
sudo apt update
sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3 unzip
Install dependencies:
conda env create -f conda_env.yml
conda activate urls-env
We provide the following workflows:
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Fine-tuning, learn with the pre-trained agent on a specific, task specific reward is now used for the agent
python finetune.py pretrained_agent=UNSUPERVISED_AGENT task=TASK snapshot_ts=TS obs_type=OBS_TYPE
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Sampling, sample demos from agent replay buffer on a specific task
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE
Offline-learning, learn a policy using the offline data collected on the specific task.
python train_offline.py agent=OFFLINE_AGENT expl_agent=UNSUPERVISED_AGENT task=TASK
Pre-training, learn from agents intrinsic reward on a specific domain
python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN
Sampling, sample demos from agent replay buffer with constraints and images
python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE
Trajectories to Images, create image dataset from trajectories
python data_to_images.py --env=DOMAIN
Train VAE, train Variational Auto Encoder from the image dataset
python train_encoder.py --env=DOMAIN
Train MPC, train LS3 safe model predictive controller on specific domain
python train_mpc.py --env=DOMAIN
The following unsupervised reinforcement learning agents are available, replace UNSUPERVISED_AGENT
with Command.
For example to use DIAYN, set UNSUPERVISED_AGENT
= diayn
.
Agent | Command | Type | Implementation Author(s) | Paper | Intrinsic Reward |
---|---|---|---|---|---|
ICM | icm |
Knowledge | Denis | paper | $| | g(\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}{t}) - \mathbf{z}{t+1} | | ^{2}$ |
Disagreement | disagreement |
Knowledge | Catherine | paper | $Var{ g_{i} (\mathbf{z}{t+1} | \mathbf{z}{t}, \mathbf{a}_{t}) }$ |
RND | rnd |
Knowledge | Kevin | paper | $| | g(\mathbf{z}{t}, \mathbf{a}{t}) - \tilde{g}(\mathbf{z}{t}, \mathbf{a}{t}) | | ^{2}_{2}$ |
APT(ICM) | icm_apt |
Data | Hao, Kimin | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
APT(Ind) | ind_apt |
Data | Hao, Kimin | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
ProtoRL | proto |
Data | Denis | paper | $\sum_{j \in random} \log | | \mathbf{z}{t} - \mathbf{z}{j} | |$ |
DIAYN | diayn |
Competence | Misha | paper | |
APS | aps |
Competence | Hao, Kimin | paper | |
SMM | smm |
Competence | Albert | paper |
The following 5 RL procedures are available to learn a policy offline from unsupervised data. Replace OFFLINE_AGENT
with Command, for example to use behavioral cloning, set OFFLINE_AGENT
= bc
.
Offline RL Procedure | Command | Paper |
---|---|---|
Behavior Cloning | bc |
paper |
CQL | cql |
paper |
CRR | crr |
paper |
TD3+BC | td3_bc |
paper |
TD3 | td3 |
paper |
The following environments with specific domains and tasks are provided. We also provide a wrapper to convert Gym environments to DMC extended time-step types based on DeepMind's acme wrapper.
Environment Type | Domain | Task |
---|---|---|
Deep Mind Control | walker |
stand , walk , run , flip |
Deep Mind Control | quadruped |
walk , run , stand , jump |
Deep Mind Control | jaco |
reach_top_left , reach_top_right , reach_bottom_left , reach_bottom_right |
Deep Mind Control | cheetah |
run |
Gym Box2D | BipedalWalker-v3 |
walk |
Gym Box2D | CarRacing-v1 |
race |
Gym Classic Control | MountainCarContinuous-v0 |
goal |
Safe Control | SimplePointBot |
goal |
The majority of URLS including the ExORL & URLB based code is licensed under the MIT license, however portions of the project are available under separate license terms: DeepMind is licensed under the Apache 2.0 license.