TRPO #747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

findmyway merged 14 commits into JuliaReinforcementLearning:master from baedan:trpo

Sep 11, 2022

Contributor

baedan commented Aug 8, 2022

this PR implements Trust-Region Policy Optimization, and adds a CartPole experiment for it.

to this end, i wrote a few utility functions that are shared amongst policy gradient policies (#737). but perhaps a better way to go about it is to have a PolicyGradientPolicy type, and have it wrap different learners.

baedan and others added 11 commits

July 19, 2022 17:55


          implement action_distribution

65b6893


          fix prediction

6e59e0e


          fix spelling

aede207


          working

e1cc427


          add trpo

5cf111a


          add JuliaRL_TRPO_CartPole.jl

2ac8fb6


          Merge branch 'master' into trpo

9eb9659


          Update cspell.json

53dd399


          fix header for experiment

46333fa


          Update cspell.json

b057e4a


          Update util.jl

41b6e1a

findmyway self-requested a review

August 8, 2022 10:26

Member

findmyway commented Aug 10, 2022

Looks fine to me in general. I think there's still room for improvement in the gradient part. I'll add more detailed comments this weekend.

findmyway approved these changes

View reviewed changes

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/trpo.jl

+                  gps = gradient(params(A.model)) do
+                      old_logits[] = A.model(s)
+                      total_loss = map(eachcol(softmax(old_logits[])), a) do x, y

Member

findmyway Aug 14, 2022

Could be simplified to logits[CartesianIndex.(a, 1:length(a))]

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/trpo.jl

+                  end
+                  # store logits as intermediate value
+                  old_logits = Ref{Matrix{Float32}}()

Member

findmyway Aug 14, 2022

Why a Ref is used here?

Contributor Author

baedan Aug 18, 2022

there were some oddities with Zygote 2nd order derivatives wrt implicit parameters when i tried local old_logits, yielding inconsistent results after mapreduce(vec, vcat, gradient) (i got a Ref(0) term sometimes as the first term). i'm not sure this is necessary anymore however, since i've since changed how the 2nd order gradient is calculated.

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/trpo.jl Outdated

Comment on lines 105 to 109

+                  for _ in 1:p.max_backtrack_step
+                      θ = θₖ + Δ
+                      search_condition(θ) && break
+                      Δ = Δ * p.backtrack_coeff
+                  end

Member

findmyway Aug 14, 2022

Seems in-line updating is fine here?

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/util.jl

+              export action_distribution, policy_gradient_estimate, IsPolicyGradient
+              export conjugate_gradient!
+              struct IsPolicyGradient end

Member

findmyway Aug 14, 2022

Based on the usages, it seems a subtype of AbstractPolicyGradient <: AbstractPolicy is better?

Contributor Author

baedan Aug 18, 2022

i think so! or better yet, a PolicyGradient wrapper class perhaps?

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/util.jl

+              function policy_gradient_estimate(::IsPolicyGradient, policy, states, actions, advantage)
+                  gs = gradient(params(policy.approximator)) do
+                      action_logits = action_distribution(policy.dist, policy.approximator(states))

Member

findmyway Aug 14, 2022

action_logits -> action_distribution?

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/util.jl

+              ```
+              """
+              action_distribution(dist::Type{T}, model_output) where {T<:ContinuousDistribution} =
+                  map(col -> dist(col...), eachcol(model_output))

Member

findmyway Aug 14, 2022

This has some extra assumptions here.

Parameters of the distribution are of the same length and size (scalar to be more specific)
The output of the model is a Matrix

Maybe we can figure out a more elegant way to use StructArrays.jl here later.

Contributor Author

baedan Aug 18, 2022

yeah, the semantics of dist is pretty bad here. after this i realized punning dist wouldn't even work with a Normal, since we usually want the network to output the log of the variance, not the variance itself.

perhaps here we could just ask the user to specify distribution type with a trait, and overload a dist function (with a better name of course).

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/util.jl

+              See [here](https://spinningup.openai.com/en/latest/algorithms/trpo.html#key-equations) for more information.
+              """
+              function surrogate_advantage(model, states, actions, advantage, action_logits)
+                  π_θₖ = map(eachcol(softmax(action_logits)), actions) do a, b

Member

findmyway Aug 14, 2022

Same as above

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/util.jl

+                  π_θₖ = map(eachcol(softmax(action_logits)), actions) do a, b
+                      a[b]
+                  end
+                  π_θ = map(eachcol(softmax(model(states))), actions) do a, b

Member

findmyway Aug 14, 2022

Same as above

src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vpg.jl Outdated Show resolved Hide resolved

baedan and others added 3 commits

August 18, 2022 13:07


          Update src/ReinforcementLearningZoo/src/algorithms/policy_gradient/vp…

4fa52d1

…g.jl

Co-authored-by: Jun Tian <[email protected]>


          inlining search_direction

1de8ae9


          Merge branch 'master' into trpo

2ee6ab7

Member

findmyway commented Sep 11, 2022

I'll merge this first. I may find some time in the next week to polish this further ;)

findmyway merged commit 0a344ce into JuliaReinforcementLearning:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet