Releases: takuseno/d3rlpy
Release v0.70
Command Line Interface
New commands are added in this version.
record
You can record the video of the evaluation episodes without coding anything.
$ d3rlpy record d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy record d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
play
You can run the evaluation episodes with rendering images.
# record simple environment
$ d3rlpy play d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy play d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
data-point mask for bootstrapping
Ensemble training for Q-functions has been shown as a powerful method to achieve robust training. Previously, bootstrap
option has been available for algorithms. But, the mask for Q-function loss is randomly created every time when the batch is sampled.
In this version, create_mask
option is available for MDPDataset
and ReplayBuffer
, which will create a unique mask at each data-point.
# offline training
dataset = d3rlpy.dataset.MDPDataset(observations, actions, rewards, terminals, create_mask=True, mask_size=5)
cql = d3rlpy.algos.CQL(n_critics=5, bootstrap=True, target_reduction_type='none')
cql.fit(dataset)
# online training
buffer = d3rlpy.online.buffers.ReplayBuffer(1000000, create_mask=True, mask_size=5)
sac = d3rlpy.algos.SAC(n_critics=5, bootstrap=True, target_reduction_type='none')
sac.fit_online(env, buffer)
As you noticed above, target_reduction_type
is newly introduced to specify how to aggregate target Q values. In the standard Soft Actor-Critic, the target_reduction_type='min'
. If you choose none
, each ensemble Q-function uses its own target value, which is similar to what Bootstrapped DQN does.
better module access
From this version, you can navigate to all modules through d3rlpy
.
# previously
from d3rlpy.datasets import get_cartpole
dataset = get_cartpole()
# v0.70
import d3rlpy
dataset = d3rlpy.datasets.get_cartpole()
new logger style
From this version, structlog
is internally used to print information instead of raw print
function. This allows us to emit more structural information. Furthermore, you can control what to show and what to save to the file if you overwrite logger configuration.
enhancements
soft_q_backup
option is added toCQL
.Paper Reproduction
page has been added to the documentation in order to show the performance with the paper configuration.commit
method atD3RLPyLogger
returns metrics (thanks, @jamartinh )
bugfix
- fix
epoch
count in offline training. - fix
total_step
count in online training. - fix typos at documentation (thanks, @pstansell )
Release v0.61
CLI
record
command is newly introduced in this version. You can record videos of evaluation episodes with the saved model.
$ d3rlpy record d3rlpy_logs/CQL_20210131144357/model_100.pt --env-id Hopper-v2
You can also use the wrapped environment.
$ d3rlpy record d3rlpy_logs/DQN_online_20210130170041/model_1000.pt \
--env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
bugfix
- fix saving models every step in
fit_online
method - fix Atari wrapper to reproduce the paper result
- fix CQL and BEAR algorithms
Release v0.60
logo
New logo images are made for d3rlpy 🎉
standard | inverted |
---|---|
![]() |
![]() |
ActionScaler
ActionScaler
provides action scaling pre/post-processing for continuous control algorithms. Previously actions must be in between [-1.0, 1.0]
. From now on, you don't need to care about the range of actions.
from d3rlpy.cql import CQL
cql = CQL(action_scaler='min_max') # just pass action_scaler argument
handling timeout episodes
Episodes terminated by timeouts should not be clipped at bootstrapping. From this version, you can specify episode boundaries as well as the terminal flags.
from d3rlpy.dataset import MDPDataset
observations = ...
actions = ...
rewards = ...
terminals = ... # this indicates the environmental termination
episode_terminals = ... # this indicates episode boundaries
datasets = MDPDataset(observations, actions, rewards, terminals, episode_terminals)
# if episode_terminals are omitted, terminals will be used to specify episode boundaries
# datasets = MDPDataset(observations, actions, rewards, terminals)
In online training, you can specify this option via timelimit_aware
flag.
from d3rlpy.sac import SAC
env = gym.make('Hopper-v2') # make sure if the environment is wrapped by gym.wrappers.Timelimit
sac = SAC()
sac.fit_online(env, timelimit_aware=True) # this flag is True by default
reference: https://arxiv.org/abs/1712.00378
batch online training
When training with computationally expensive environments such as robotics simulators or rich 3D games, it will take a long time to finish due to the slow environment steps.
To solve this, d3rlpy supports batch online training.
from d3rlpy.algos import SAC
from d3rlpy.envs import AsyncBatchEnv
if __name__ == '__main__': # this is necessary if you use AsyncBatchEnv
env = AsyncBatchEnv([lambda: gym.make('Hopper-v2') for _ in range(10)]) # distributing 10 environments in different processes
sac = SAC(use_gpu=True)
sac.fit_batch_online(env) # train with 10 environments concurrently
docker image
Pre-built d3rlpy docker image is available in DockerHub.
$ docker run -it --gpus all --name d3rlpy takuseno/d3rlpy:latest bash
enhancements
BEAR
algorithm is updated based on the official implementation- new
mmd_kernel
option is available
- new
to_mdp_dataset
method is added toReplayBuffer
ConstantEpsilonGreedy
explorer is addedd3rlpy.envs.ChannelFirst
wrapper is added (thanks for reporting, @feyza-droid )- new dataset utility function
d3rlpy.datasets.get_d4rl
is added- this is handling timeouts inside the function
- offline RL paper reproduction codes are added
- smoothed moving average plot at
d3rlpy plot
CLI function (thanks, @pstansell ) - user-friendly messages for assertion errors
- better memory consumption
save_interval
argument is added tofit_online
bugfix
- core dumps are fixed in Google Colaboratory tutorials
- typos in some documentations (thanks for reporting, @pstansell )
Release v0.51
minor fix
- add
typing-extensions
depdency - update MANIFEST.in
Release v0.50
typing
Now, d3rlpy is fully type-annotated not only for the better use of this library but also for the better contribution experiences.
mypy
andpylint
check the type consistency and code quality.- due to a lot of changes to add type annotations, there might be degradation that is not detected by linters.
CLI
v0.50 introduces the new command-line interface, d3rlpy
command that helps you to do more without any efforts. For now, d3rlpy
provides the following commands.
# plot CSV data
$ d3rlpy plot d3rlpy_logs/XXX/YYY.csv
# plot CSV data
$ d3rlpy plot-all d3rlpy_logs/XXX
# export the save model as inference formats (e.g. ONNX, TorchScript)
$ d3rlpy export d3rlpy_logs/XXX/model_YYY.pt
enhancements
- faster CPU to GPU transfer
- this change makes online training x2 faster
- make IQN Q function more precise based on the paper
documentation
- Add doc about SB3 integration ( thanks, @araffin )
Release v0.41
Algorithm
- Policy in Latent Action Space (PLAS)
Off-Policy Evaluation
Off-policy evaluation (OPE) is a method to evaluate policy performance only with the offline dataset.
# train policy
from d3rlpy.algos import CQL
from d3rlpy.datasets import get_pybullet
dataset, env = get_pybullet('hopper-bullet-mixed-v0')
cql = CQL()
cql.fit(dataset.episodes)
# Off-Policy Evaluation
from d3rlpy.ope import FQE
from d3rlpy.metrics.scorer import soft_opc_scorer
from d3rlpy.metrics.scorer import initial_state_value_estimation_scorer
fqe = FQE(algo=cql)
fqe.fit(dataset.episodes,
eval_episodes=dataset.episodes
scorers={
'soft_opc': soft_opc_scorer(1000),
'init_value': initial_state_value_estimation_scorer
})
- Fitted Q-Evaluation
Q Function Factory
d3rlpy provides flexible controls over Q functions through Q function factory. Following this change, the previous q_func_type
argument was renamed to q_func_factory
.
from d3rlpy.algos import DQN
from d3rlpy.q_functions import QRQFunctionFactory
# initialize Q function factory
q_func_factory = QRQFunctionFactory(n_quantiles=32)
# give it to algorithm object
dqn = DQN(q_func_factory=q_func_factory)
You can pass Q function name as string too.
dqn = DQN(q_func_factory='qr')
You can also make your own Q function factory. Currently, these are the supported Q function factory.
EncoderFactory
- DenseNet architecture (only for vector observation)
from d3rlpy.algos import DQN
dqn = DQN(encoder_factory='dense')
N-step TD calculation
d3rlpy supports N-step TD calculation for ALL algorithms. You can pass n_steps
arugment to configure this parameters.
from d3rlpy.algos import DQN
dqn = DQN(n_steps=5) # n_steps=1 by default
Paper reproduction scripts
d3rlpy supports many algorithms including online and offline paradigms. Originally, d3rlpy is designed for industrial practitioners. But, academic research is still important to push deep reinforcement learning forward. Currently, there are online DQN-variant reproduction codes.
The evaluation results will be also available soon.
enhancements
build_with_dataset
andbuild_with_env
methods are added to algorithm objectsshuffle
flag is added tofit
method (thanks, @jamartinh )
Release v0.40
Algorithms
- Support the discrete version of Soft Actor-Critic
fit_online
hasn_steps
argument instead ofn_epochs
for the complete reproduction of the papers.
OptimizerFactory
d3rlpy provides more flexible controls for optimizer configuration via OptimizerFactory
.
from d3rlpy.optimizers import AdamFactory
from d3rlpy.algos import DQN
dqn = DQN(optim_factory=AdamFactory(weight_decay=1e-4))
See more at https://d3rlpy.readthedocs.io/en/v0.40/references/optimizers.html .
EncoderFactory
d3rlpy provides more flexible controls for the neural network architecture via EncoderFactory
.
from d3rlpy.algos import DQN
from d3rlpy.encoders import VectorEncoderFactory
# encoder factory
encoder_factory = VectorEncoderFactory(hidden_units=[300, 400], activation='tanh')
# set OptimizerFactory
dqn = DQN(encoder_factory=encoder_factory)
Also you can build your own encoders.
import torch
import torch.nn as nn
from d3rlpy.encoders import EncoderFactory
# your own neural network
class CustomEncoder(nn.Module):
def __init__(self, obsevation_shape, feature_size):
self.feature_size = feature_size
self.fc1 = nn.Linear(observation_shape[0], 64)
self.fc2 = nn.Linear(64, feature_size)
def forward(self, x):
h = torch.relu(self.fc1(x))
h = torch.relu(self.fc2(h))
return h
# THIS IS IMPORTANT!
def get_feature_size(self):
return self.feature_size
# your own encoder factory
class CustomEncoderFactory(EncoderFactory):
TYPE = 'custom' # this is necessary
def __init__(self, feature_size):
self.feature_size = feature_size
def create(self, observation_shape, action_size=None, discrete_action=False):
return CustomEncoder(observation_shape, self.feature_size)
def get_params(self, deep=False):
return {
'feature_size': self.feature_size
}
dqn = DQN(encoder_factory=CustomEncoderFactory(feature_size=64))
See more at https://d3rlpy.readthedocs.io/en/v0.40/references/network_architectures.html .
Stable Baselines 3 wrapper
- Now d3rlpy is partially compatible with Stable Baselines 3.
- More documentations will be available soon.
bugfix
- fix the memory leak problem at
fit_online
.- Now, you can train online algorithms with the big replay buffer size for the image observation.
- fix preprocessing at CQL.
- fix ColorJitter augmentation.
installation
PyPi
- From this version, d3rlpy officially supports Windows.
- The binary packages for each platform are built in GitHub Actions. And they are uploaded, which means that you don't have to install Cython to install this package from PyPi.
Anaconda
- From previous version, d3rlpy is available in conda-forge.
Release v0.32
This version introduces hotfix.
⚠️ Fix the significant bug in the case of online training with image observation.
Release v0.31
This version introduces minor changes.
- Move
n_epochs
arguments tofit
method. - Fix scikit-learn compatibility issues.
- Fix zero-division error during online training.
Release version v0.30
Algorithm
- Support Advantage-Weighted Actor-Critic (AWAC)
fit_online
method is available as a convenient alias tod3rlpy.online.iterators.train
function.- unnormalizing action problem is fixed at AWR.
Metrics
- The following metrics are available.
- initial_state_value_estimation_scorer
- soft_opc_scorer
⚠️ MDPDataset
d3rlpy.dataset
module is now implemented with Cython in order to speed up memory copies.- Following operations are significantly faster than the previous version.
- creating
TransitionMiniBatch
object - frame stacking via
n_frames
argument - lambda return calculation at AWR algorithms
- creating
- This change approximately makes Atari training 6% faster.