Features for Offline Reinforcement Learning Pipeline #359

Mobius1D · 2021-07-11T09:23:43Z

Mobius1D
Jul 11, 2021
Collaborator

These are the features that I am planning on for the pipeline.
A lot of inspiration for the features has been taken from d3rlpy

A _run function that takes in the trajectory and trains the policy. For now it will just sample from the trajectory and train the network using the policy. Appropriate stop functions and hooks will be provided for evaluation. This will include both on policy evaluation and off policy evaluation like FQE.
Provisions for getting datasets and putting it into trajectories. This is for different datasets like d4rl-atari, d4rl, d4rl-pybullet, and d3rlpy has its own datasets which can be interfaced with our trajectories. This will be useful until our datasets are made.
Default implementation of algorithms rather than needing to setup all the networks. Would be very useful for beginners. Refer Setup Algorithm.
A function to allow users to make their own datasets. I am not sure if we can add this as a part of run function or allow the user to add it separately by providing a function to make the dataset.
making a test-train split functionality.
Add some functionality for cross validation and grid search. You can check cross validation here.
A feature to make the algorithm based on the dataset that is given with default parameters. refer build_with_dataset function.
- The type of trajectory and the sampler can be made by default using the information that we have about the algorithm and the dataset without having to be coded.
An evaluator function that performs off policy or on policy evaluation based on the input and can be used as a hook or a separate argument. FQE for off policy evaluation can be implemented.
Different metrics as hooks for evaluating as the training is performed. Refer Metrics.
Training logs during the experiment rather than after the experiment is done. Default provisions for the same purpose.
Saving models so that it can be deployed in other frameworks. Features for saving in different formats such as .onnx and .pt
Features for recording and displaying videos, now will be implemented as a prototype and as a wrapper around other python libraries. Limited to pybullet and atari for now but can be implemented for a lot of environments.

Features such as a utility function for CQL loss aren't included in this list because these will be developed as more algorithms are added.

The main _run function is pretty simple since the algorithm doesn't need to interact with the environment. The approach that I see is to sample using a batch sampler for every loop and train based on the sample. Another vital component is an evaluator integrated into the _run function. The other functionalities are more related to making the package approachable and to add more features.

These are just features that I thought would be important. Feedback about the proposed ideas or any other features that can be added are welcome.

Credits: https://github.com/takuseno/d3rlpy

findmyway · 2021-07-11T16:05:31Z

findmyway
Jul 11, 2021
Maintainer

Nice points.

For now it will just sample from the trajectory and train the network using the policy.

Sometimes, the trajectory is too large to load it all into the memory. So we might not be able to sample them in memory. You may consider reservoir sampling here.

A function to allow users to make their own datasets. I am not sure if we can add this as a part of run function or allow the user to add it separately by providing a function to make the dataset.

This looks trivial since we can simply add a custom hook.

Do you have any concrete plan for what you'd like to implement the next week?

2 replies

Mobius1D Jul 14, 2021
Collaborator Author

Sometimes, the trajectory is too large to load it all into the memory. So we might not be able to sample them in memory. You may consider reservoir sampling here.

Makes sense.

This looks trivial since we can simply add a custom hook.

Yes, but still we need to have a trajectory based on expert data, then we train a policy to get the maximum score. At this point we can run the policy and collect the data and put the source (if we do reservoir sampling) into the _run to train in offline mode. Or we can put the offline policy and the _run will recognize it and generate the dataset and train by itself. Its going to be a simple difference but from an usability standpoint, the latter is better.

Mobius1D Jul 14, 2021
Collaborator Author

Do you have any concrete plan for what you'd like to implement the next week?

Will post it out shortly

pilgrimygy · 2021-07-12T02:16:54Z

pilgrimygy
Jul 12, 2021
Collaborator

I have some ideas:

Maybe we don't need a new _run function. In DQN and DQN variants, we train the agent by the data sampled by trajectories. In fact, we only use a fixed dataset instead of trajectories.
Making a dataset is not difficult, at least for environments like CartPole. In fact, existing hooks can collect all paths; we only need to collect them. We need to tell the user that we can do this. I want to know where would you put the dataset? I currently put it in each learner. When we need training, we call to function update! like dqns/common and sample a batch by dataset stored in the learner. This is my thought.
I will continue to add later.

4 replies

Mobius1D Jul 14, 2021
Collaborator Author

Maybe we don't need a new _run function. In DQN and DQN variants, we train the agent by the data sampled by trajectories. In fact, we only use a fixed dataset instead of trajectories.

Yes, I completely agree on the fact that we don't have to make the agent take actions. But we do have to allow the user to get metrics, do evaluation during the training process, report about the progress of the training etc. A _run function will help to keep the code simple for the end user.

Making a dataset is not difficult, at least for environments like CartPole. In fact, existing hooks can collect all paths; we only need to collect them.

Yes. Making a dataset is not that difficult since we get the trajectory directly. But some complex environments need a lot of training to get the agent stabilized. Plus, for getting a good reproduction of the results we need to use standard datasets like d4rl and others. I am curious about how you handle collecting the dataset using atari experiments.

Mobius1D Jul 14, 2021
Collaborator Author

I also have a doubt when it comes to collecting datasets using hooks, because everytime I use the run function with an agent with a trajectory the agent will start training, to my knowledge. Is it like having a DoEveryStep and collecting the data by adding it to the trajectory?

findmyway Jul 14, 2021
Maintainer

because everytime I use the run function with an agent with a trajectory the agent will start training, to my knowledge

No, it could be either training or testing.

pilgrimygy Jul 14, 2021
Collaborator

Well. Looking forward to your work~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features for Offline Reinforcement Learning Pipeline #359

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Features for Offline Reinforcement Learning Pipeline #359

Mobius1D Jul 11, 2021 Collaborator

Replies: 2 comments · 6 replies

findmyway Jul 11, 2021 Maintainer

Mobius1D Jul 14, 2021 Collaborator Author

Mobius1D Jul 14, 2021 Collaborator Author

pilgrimygy Jul 12, 2021 Collaborator

Mobius1D Jul 14, 2021 Collaborator Author

Mobius1D Jul 14, 2021 Collaborator Author

findmyway Jul 14, 2021 Maintainer

pilgrimygy Jul 14, 2021 Collaborator

Mobius1D
Jul 11, 2021
Collaborator

Replies: 2 comments 6 replies

findmyway
Jul 11, 2021
Maintainer

Mobius1D Jul 14, 2021
Collaborator Author

Mobius1D Jul 14, 2021
Collaborator Author

pilgrimygy
Jul 12, 2021
Collaborator

Mobius1D Jul 14, 2021
Collaborator Author

Mobius1D Jul 14, 2021
Collaborator Author

findmyway Jul 14, 2021
Maintainer

pilgrimygy Jul 14, 2021
Collaborator