Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to handle non-episodic environments #613

Closed
wants to merge 10 commits into from
Closed

Add a way to handle non-episodic environments #613

wants to merge 10 commits into from

Conversation

HenriDeh
Copy link
Member

@HenriDeh HenriDeh commented Apr 4, 2022

Hello,

TLDR:
This adds a is_episodic field to Trajectory so that BatchSamplers and algorithms can choose to not multiply the value of a last state by 0. This allows handling infinite horizon environments that stop after a given number of steps to occasionally reset.

Complete story:
In the zoo implementations, the value of a next_state is multiplied by (1-t) when computing a bootstrapping target. However, I personally work with non-episodic environments, they theoretically never stop, by they last a given number of steps before reseting. That means that their "terminal" state does not have a value of 0.
I didn't know how to proceed if I wanted to, say, use SAC on a non-episodic environment.

  • If the environment returns is_terminated == true, its last state will have a value of 0.
  • If it does not, then it will run forever. While this works no problem with some environments like pendulum, more complex ones may need to reset to the initial state.
  • Anyways, even an episodic environment may need to impose a limit on the number of steps per episode. If such a limit is reached, the last state should not have a value of 0.

I first thought that adding a second way of interrupting the while loop in core/run.jl without setting is_stop == true could do it. An optional episode_stop_condition argument that takes a callable just like stop_condition.

However, my problem with that is how to handle NStep bootstrapping. Indeed, in trajectory_extensions.jl, the NStepBatchSampler uses the terminal trace to determine whether states belong to the same episode. This would be impossible with a non-episodic environment that never returns a true.

So, here is a final proposition to handle the problem with minimal changes:
Since the terminal information is needed by AbstractBatchSampler's only, the user can give the sampler such information by setting an is_episodic field to Trajectories. If false, the sampler changes all terminal trace to 0 AFTER using it to cut off steps from the next episode. I initially thought about adding this to the samplers, but they are barely ever user defined. Since samplers sample from trajectories, they can always query for this field. By default the field is set to true in order to keep existing experiments in a working state.

@HenriDeh HenriDeh changed the title add is_episodic Add a way to handle non-episodic environments Apr 4, 2022
@HenriDeh
Copy link
Member Author

HenriDeh commented Apr 4, 2022

I don't know if this kind of change fits your plans for the package, so I'd rather wait for your opinion before merging @findmyway.
I work with RL in unusual applications, so I have specific needs, which is why I'm doing so many PR to make the package more flexible, but I don't want to clutter your project with too many things if you don't want to.

@HenriDeh HenriDeh requested a review from findmyway April 4, 2022 11:07
@findmyway findmyway mentioned this pull request Apr 4, 2022
52 tasks
@findmyway
Copy link
Member

Hi @HenriDeh ,

Thanks for this PR. I guess discussions at #140 may also be interesting to you.

Indeed, non-episodic environments were not considered seriously. I plan to improve this part in the #614 . And it's the highest priority to me. So could we postpone merging this PR for now? I'll figure out a draft design first and discuss it with you later.

@HenriDeh
Copy link
Member Author

HenriDeh commented Apr 4, 2022

Okay sure! I understand, this PR is a bit hacky so you may want something else. I can work with my local branch in the meantime.
In fact, I'm very interested to know more about your plans for the future. I am currently implementing Retrace and to achieve the flexibility I need for my research, I also need to implement my algorithm in a way that can work with different types of QNetworks: I would like to be able to use the same algorithm with standard QNetwork bootstrapping, Target QNetworks and Twin QNetworks. So I designed some sort of api that could be reused by other algorithms that use QNetworks. Like what is described in #139. In essence, Retrace is just another way of computing targets for a MSE loss and I thought I could make something like a compute_target(policy, traj, qnetwork, batch).
The odds that I came up with the same ideas as you are slim so we can wait and I'll adapt to your design. I'll make a PR with Retrace and QNetworks to show you my ideas, but I expect that you will want to pin that one too.
Likewise, I wanted to implement a Discrete equivalent of GaussianNetwork to overload some functions in MPO. That I think should not be a problem. Don't hesitate to ping me if you want to discuss it.

@HenriDeh HenriDeh marked this pull request as draft April 14, 2022 13:33
@HenriDeh
Copy link
Member Author

HenriDeh commented Apr 14, 2022

Hello, I just want to point out that this approach to non-episodic environments leads to incorrect learning of the last step!
This is caused by the fact that the terminal state is not saved in the trajectory, thus, the q' value computed by the target_network is that of the first state of the next episode. This is not a problem for episodic environments since this value is multiplied by 0, but in non-episodic environments, we need the q' of the correct state.

This is impossible to achieve without changing the way trajectories work: they must trace the correct last state. I think we could change that in the following way:

  • Remove the pop! of state (and action) at PreEpisodeStage.
  • Add a dummy terminal and reward at PostEpisodeStage, so that indices still line up.
  • Keep a list of the indices of the last states stored in the trajectory.
  • When sampling, exclude that list from the valid_range of sampleable inds
  • SART (and alike) CircularArrays must no longer work with :state and :action traces that have +1 capacity.

This is not hard to do but it involves changes that are probably breaking for implementations that do not use BatchSampler, if there's any. But I think it would be nice to have this in v0.11. I'll make the changes in this PR to showcase and we'll see what to do with that.

@HenriDeh
Copy link
Member Author

I think it should work like this, I'll try tomorrow.

@findmyway
Copy link
Member

Hi @HenriDeh

I've been thinking about this scenario this weekend. There are still several points I'm not sure about yet. It would be great if you could help clarify them 🤗 .

Hello, I just want to point out that this approach to non-episodic environments leads to incorrect learning of the last step!

To address this problem, I prefer to split all the experiences into chunks. Each chunk is an ElasticSARTBuffer. And all the chunks are wrapped in a Vector (or CircularVector?)

image

(In my upcoming new design, the each Traces will form a chunk.)

Now my questions are:

  1. How to evict staled experiences? Once the total number of experiences reaches a threshold,
    a. we pop out the oldest chunk one by one. So the chunk wrapper should support the popfirst! method.
    b. we pop out the oldest experience (the state-action-reward-terminal pair) one by one. This requires the ElasticSARTBuffer and the chunk wrapper both should support the popfirst! method.
  2. How to sample batches?
    a. The BatchSampling strategy is easy to implement. Do you have some other use cases, like the NStepSampling, or sampling based on chunks? In Combine transformers and RL #392 , each element in a batch contains N consequent experiences.
  3. How do you decide when to reset! the non-episodic environments?
    a. This might affect the design of the interfaces slightly. Is the reset! signal controlled by a predefined number or is it calculated dynamically?

Once the answers to the above questions are clear, I should be able to split the trajectories part codes into https://github.com/JuliaReinforcementLearning/Trajectories.jl in the next week.

@HenriDeh
Copy link
Member Author

Hello @findmyway,

I'm not sure I understand well what a "chunk" is. Is the structure like what follows?

Trajectory
  Traces # an ElasticSartBuffer
    Trace{:state} # a (Circular)Vector of Chunk
      Chunk1 # an ElasticArray of states
      Chunk2 # an ElasticArray of states
      ...
    Trace{:action} # ibid for other traces
    Trace{:reward}
    Trace{:terminal}

Where each chunk contains one episode? (In that case I would vouch for a struct named Episode instead of Chunk to use the terminology of Sutton & Barto)

If this is correct, here's my take on your questions:

  1. I would say option B, because 1) it guarantees a maximum memory usage by the buffer: if the episodes get longer because the agent improves, the buffer will not take more space; 2) I think it would work out of the box in the case of a continuing environment for online algorithms (one environment runs and never stops): there's a single Chunk with a maximum length. With this option, the capacity of the ElasticSARTBuffer is the maximum sum of the length of the Chunks of each Trace. That said, I would not be against the possibility letting the user decide on option A instead, but it is less commonly used I believe so I'd give it a lower priority.
  2. Yes, with retrace, vtrace and generalized advantage estimator. In this case, I like to think about sampling as "sampling a first state t" but fetching the experiences from state t to t+N (or to terminal state T if t+N > T). NStepBatchSampler does that but only with rewards and compacts them into a single discounted reward-to-go. With retrace and others, we need to keep everything because rewards are multiplied by importance sampling ratios and TD errors (that are computed with qnetworks using states and actions). I think Chunks are in fact quite convenient for these samplers. This is different than Transformers however, because Transformers will want to sample the states (observations actually) before t, while n-step TD samplers want the states after t. Combining both would mean sampling states before and after t.
  3. I see two possibilities.
    a) The environment controls that internally or because it is wrapped by step counter. In that case the algorithm must know somehow that the "terminal" states are not really terminal and that it must not assign them a value of 0 when computing targets. That's the approach I used in this PR because it was the simplest to implement.
    b) We add an EpisodeResetCondition argument to run, akin to StopCondition that will reset the environment when appropriate. Currently, the EpisodeResetCondition is implicitly ResetAtTerminal (made up names), we could make that explicit and add options such as ResetAfterNSteps, that will reset the environment even if it did not return a is_terminal === true and so the value will be multiplied by 1. I like this approach because we can imagine more exotic and user defined conditions like a ResetWhenTheCartpoleStaysDownForNConsecutiveFrames. The most common approach is to use a fixed number of steps before reset, but one could totally imagine a number of steps that varies according to a schedule or any other condition (that means that the ResetCondition should be called by taking the (policy, trajectory, env, stage) arguments to allow for every imaginable condition).
    Note that in both a) and b), we must not use dummy states because we need the real last state to compute its value.

Hope this helps, don't hesitate if you need more explanations :)

@findmyway
Copy link
Member

Great thanks for your detailed comments. Things are clear to me now.

Where each chunk contains one episode? (In that case I would vouch for a struct named Episode instead of Chunk to use the terminology of Sutton & Barto)

Exactly, I'll take this suggestion.

@HenriDeh
Copy link
Member Author

If you are up for solution b

b) We add an EpisodeResetCondition argument to run, akin to StopCondition that will reset the environment when appropriate. Currently, the EpisodeResetCondition is implicitly ResetAtTerminal (made up names), we could make that explicit and add options such as ResetAfterNSteps, that will reset the environment even if it did not return a is_terminal === true and so the value will be multiplied by 1. I like this approach because we can imagine more exotic and user defined conditions like a ResetWhenTheCartpoleStaysDownForNConsecutiveFrames. The most common approach is to use a fixed number of steps before reset, but one could totally imagine a number of steps that varies according to a schedule or any other condition (that means that the ResetCondition should be called by taking the (policy, trajectory, env, stage) arguments to allow for every imaginable condition)

I can help and implement that in a PR soon. If you have something different in mind I can still help to lighten your workload :)

@findmyway
Copy link
Member

Thanks! @HenriDeh

Yes, I agree with this approach. Though this may be easy to implement, we'd better make the change together with those in Trajectories.jl.

I understand that the development speed is kind of slow right now 😄. But I think the initial version will be ready before the end of this week.

@HenriDeh
Copy link
Member Author

Okay, I'll make a first draft and then when Trajectories is more advanced, we'll see what I need to do. I think it won't need much adaptation due to the new design it is really a simple change in run.jl in the end. I'm closing this PR.

@HenriDeh HenriDeh closed this Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants