Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling in continuous/complex action spaces with 'density prior' is not working #200

Open
1 task done
ManorZ opened this issue Jul 9, 2022 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@ManorZ
Copy link

ManorZ commented Jul 9, 2022

Search before asking

  • I have searched the MuZero issues and found no similar bug report.

🐛 Describe the bug

In Learning and Planning in Complex Action Spaces (Hubert et al.), there are basically two changes compared to MuZero:

  1. Modify the policy probabilities inside PUCB to be 'sampled policy' (pi_hat = beta_hat/beta * pi)
  2. Sample K actions instead of evaluating all possible actions (infinity in the continuous case)

In the code, I think I see a difference:

  1. No K samples are drawn at the root - only one.
  2. Regarding pi_hat = beta_hat/beta * pi, I see two options there: 'uniform prior' and 'density prior'.
  • Uniform prior gives equal density to all actions, and weights the policy accordingly, and the current code makes sense.
  • Density prior needs to take care of each action CDF (at the parent), but it doesn't work (error message and description below).

Add an example

The error message I get:
File "/home/user_231/muzero-general/self_play.py", line 401, in ucb_score
child.prior / sum([child.prior for child in parent.children.values()])
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Last test reward: 0.00. Training step: 0/3. Played games: 0. Loss: 0.00

This is because no one assigns node.prior in the continuous branch.
I think it has to be set by the parent in his expand method and to be equal to each child's CDF, at the sampled point.

Also, regarding the K, I think we need to make a small change in the expand method, and sample more than one action:

action_value = distribution.sample(K).squeeze(0).detach().cpu().numpy()
self.children[Action(action_value)] = Node()

Environment

No response

Minimal Reproducible Example

python muzero.py mujoco_IP {"node_prior":"density"}

Additional

No response

@ManorZ ManorZ added the bug Something isn't working label Jul 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant