-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to handle non-episodic environments #613
Add a way to handle non-episodic environments #613
Conversation
I don't know if this kind of change fits your plans for the package, so I'd rather wait for your opinion before merging @findmyway. |
Hi @HenriDeh , Thanks for this PR. I guess discussions at #140 may also be interesting to you. Indeed, non-episodic environments were not considered seriously. I plan to improve this part in the #614 . And it's the highest priority to me. So could we postpone merging this PR for now? I'll figure out a draft design first and discuss it with you later. |
Okay sure! I understand, this PR is a bit hacky so you may want something else. I can work with my local branch in the meantime. |
Hello, I just want to point out that this approach to non-episodic environments leads to incorrect learning of the last step! This is impossible to achieve without changing the way trajectories work: they must trace the correct last state. I think we could change that in the following way:
This is not hard to do but it involves changes that are probably breaking for implementations that do not use |
I think it should work like this, I'll try tomorrow. |
Hi @HenriDeh , I've been thinking about this scenario this weekend. There are still several points I'm not sure about yet. It would be great if you could help clarify them 🤗 .
To address this problem, I prefer to split all the experiences into chunks. Each chunk is an (In my upcoming new design, the each Now my questions are:
Once the answers to the above questions are clear, I should be able to split the trajectories part codes into https://github.com/JuliaReinforcementLearning/Trajectories.jl in the next week. |
Hello @findmyway, I'm not sure I understand well what a "chunk" is. Is the structure like what follows?
Where each chunk contains one episode? (In that case I would vouch for a struct named If this is correct, here's my take on your questions:
Hope this helps, don't hesitate if you need more explanations :) |
Great thanks for your detailed comments. Things are clear to me now.
Exactly, I'll take this suggestion. |
If you are up for solution b
I can help and implement that in a PR soon. If you have something different in mind I can still help to lighten your workload :) |
Thanks! @HenriDeh Yes, I agree with this approach. Though this may be easy to implement, we'd better make the change together with those in I understand that the development speed is kind of slow right now 😄. But I think the initial version will be ready before the end of this week. |
Okay, I'll make a first draft and then when Trajectories is more advanced, we'll see what I need to do. I think it won't need much adaptation due to the new design it is really a simple change in |
Hello,
TLDR:
This adds a
is_episodic
field to Trajectory so that BatchSamplers and algorithms can choose to not multiply the value of a last state by 0. This allows handling infinite horizon environments that stop after a given number of steps to occasionally reset.Complete story:
In the zoo implementations, the value of a
next_state
is multiplied by(1-t)
when computing a bootstrapping target. However, I personally work with non-episodic environments, they theoretically never stop, by they last a given number of steps before reseting. That means that their "terminal" state does not have a value of 0.I didn't know how to proceed if I wanted to, say, use SAC on a non-episodic environment.
is_terminated == true
, its last state will have a value of 0.I first thought that adding a second way of interrupting the while loop in
core/run.jl
without settingis_stop == true
could do it. An optionalepisode_stop_condition
argument that takes a callable just likestop_condition
.However, my problem with that is how to handle NStep bootstrapping. Indeed, in
trajectory_extensions.jl
, theNStepBatchSampler
uses the terminal trace to determine whether states belong to the same episode. This would be impossible with a non-episodic environment that never returns atrue
.So, here is a final proposition to handle the problem with minimal changes:
Since the terminal information is needed by
AbstractBatchSampler
's only, the user can give the sampler such information by setting anis_episodic
field toTrajectories
. If false, the sampler changes allterminal
trace to 0 AFTER using it to cut off steps from the next episode. I initially thought about adding this to the samplers, but they are barely ever user defined. Since samplers sample from trajectories, they can always query for this field. By default the field is set to true in order to keep existing experiments in a working state.