Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

FXDevailly · 2021-03-27T21:34:00Z

This pull requests aims to provide the following additions :

The re-execution of the MCTS in Reanalyse to obtain updated/fresh policy targets (child_visits), which as stated in the MuZero paper is used for improved sample efficiency
The use of lagging parameters (target model) in Reanalyse to stabilize the bootstrapping of root values (as done in the MuZero paper).
The ability to parallelize multiple Reanalyse processes (especially useful when running MCTS is costly and/or the memory buffer is large).
An updated 'Reanalyse sampling' method to prioritize the update of episodes with 'older' targets (root values and child_visits).
The following settings are added to games files : 1) the parameters-update frequency of the target model used in Reanalyse, 2) the number of Reanalyse processes to be used, and 3) whether to update root values based on the target model alone (representation + value), or based on the updated MCTS root_values.
For PER, priorities are now updated in Reanalyse using the new MCTS predictions (in the same way they are initialized when saving an episode).
cartpole_sample_efficient.py with some of the settings used in the MuZero paper (PER=1, num_td_steps =5, num_unroll_steps=5), a train_steps/play_steps ratio of 20, parallelized Reanalyse, the use of a target model (high frequency/frequent update), and root_values updates based on the re-execution of the MCTS.

…strapping stabilization with a target network

theword · 2021-03-29T17:31:43Z

The number of games that get played drastically changes.

Your pull request -

Compared to the default cartpole -

But the results look good !

Didn't compare your other game parameters to the default (that could influence the results) which could make this an imperfect experiment but this method seems to be more productive if you can pay the higher computational cost.

FXDevailly · 2021-03-29T17:40:31Z

The number of games that get played drastically changes.

Your pull request -

Compared to the default cartpole -

But the results look good !

Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

theword · 2021-03-29T20:24:41Z

The number of games that get played drastically changes.
Your pull request -

Compared to the default cartpole -

But the results look good !

Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

Yeah, with parallelization it works out great. I didn't notice any performance problems on my CPU.

However, I think your code is breaking tic tic toe and connect 4 now.

When I run other games (not cartpole or your cartpole sample)

Possible unhandled error from worker: ray::Reanlyse.reanalyse()

Then it freaks out at return self.env.legal_actions() on line 167 of tic tac toe.

AttributeError: 'NoneType object has no attribute 'legal_actions'

This happens for both self.use_updated_mcts_value_targets being False or True

theword · 2021-04-22T03:03:38Z

The number of games that get played drastically changes.
Your pull request -

Compared to the default cartpole -

But the results look good !

Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

So I revisited this PR because it did increase the efficiency of Muz.

I tracked down the culprit to this line of code in replay_buffer.py in the init() of Reanalyze.

self.game.env = None

When legal_actions was called by the MCTS, it had no game object. When I took out this line of code, it all runs without error and appears to work. Cartpole worked because it just returns a list of 0 and 1. But all the other games call the legal_actions on self. What was the intended purpose of this line?

FXDevailly · 2021-04-22T12:59:44Z

The number of games that get played drastically changes.
Your pull request -

Compared to the default cartpole -

But the results look good !

Didn't compare your game parameters to the default that could influence the results and make this an imperfect expertiment but this method seems to be more productive if you can pay the higher computational cost.

Thanks for the feedback.
One of the changes influencing the number of played games is the ratio in the game file (20 in the sample_efficient version). I think you could try the same value (20) with the original version (not this pull request) to make the comparison fairer ! Also, about the computational cost, parallelization (using multiple reanalyze processes) can compensate for the higher 'reanalyze' cost.

So I revisited this PR because it did increase the efficiency of Muz.

I tracked down the culprit to this line of code in replay_buffer.py in the init() of Reanalyze.

self.game.env = None

When legal_actions was called by the MCTS, it had no game object. When I took out this line of code, it all runs without error and appears to work. Cartpole worked because it just returns a list of 0 and 1. But all the other games call the legal_actions on self. What was the intended purpose of this line?

When using cartpole, the "legal_actions" method is part of the Game object (self.game in Reanalyze). Since there can be many reanalyze processes, I wanted to save memory by removing the environments themselves (self.game.env in Reanalyze) in these game objects since they were not required to run the MCTS.

However, it appears that for other games (such as tic-tac-toe) the "legal_actions" method is obtained from the env object (self.game.env) itself, so it is required. I guess that the memory impact should not be too big if we just remove the line, as you suggested. Thanks for sharing this feedback !

FXDevailly

This line enables saving a negligible amount of memory but is incompatible with some environments as suggested by @theword.
It should therefore be deleted.

FXDevailly · 2021-04-22T13:03:14Z

replay_buffer.py

+        # Import the game class to enable MCTS updates
+        game_module = importlib.import_module("games." + self.config.game_filename)
+        self.game = game_module.Game()
+        self.game.env = None


Suggested change

self.game.env = None

qwyin · 2022-02-10T20:52:56Z

This is great. But after loading a checkpoint and restarting the training, the reanalyse_priorities will be initialized to None. That leads to an error when self.reanalyse_priorities +=1

YanivO1123 · 2022-03-23T09:14:48Z

replay_buffer.py

-                    torch.squeeze(values).detach().cpu().numpy()
-                )
+                # re-execute MCTS to update targets (child visist and root_values)
+                l = len(game_history.root_values)


Hey,

Should there not be an else on this part?
If I understand correctly, the if in line 346 is there to trigger "do not use re-executed MCTS tree roots, use updated values directly from the value function instead".
So these lines should trigger only if the if is not firing, I believe
Apologies if the mistake is mine

Ah I'm seeing now I missed the if in line 392, which solves the problem of updating with tree roots without intending to.
For computation efficiency though, wouldn't it still be better to add the else, instead of computing both?

I believe there shouldn't be an "else" on this since we have to execute MCTS to obtain a "fresh" policy (used as target for training, 80% of the time as mentioned in the paper, MuZero Reanalyze appendix). Meanwhile the value function is either updated via re-running MCTS or from a target network.

FXDevailly added 3 commits March 27, 2021 16:09

=re-execution of the mcts, parallelized reanalyse processes, and boot…

e6eba73

…strapping stabilization with a target network

=reanalyse settings added for sample efficiency

483ceee

=reanalyse settings added for sample efficiency

a8abc83

FXDevailly changed the title ~~Updates on Reanalyse / Sample Efficiency~~ Updates on Reanalyse / Sample Efficiency (Re-executing the MCTS, Parallelization, Stabilization with a target model, etc.) Mar 27, 2021

FXDevailly changed the title ~~Updates on Reanalyse / Sample Efficiency (Re-executing the MCTS, Parallelization, Stabilization with a target model, etc.)~~ Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) Mar 27, 2021

FXDevailly commented Apr 22, 2021

View reviewed changes

Update replay_buffer.py

55acc24

YanivO1123 reviewed Mar 23, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

FXDevailly commented Mar 27, 2021

theword commented Mar 29, 2021 •

edited

Loading

FXDevailly commented Mar 29, 2021 •

edited

Loading

theword commented Mar 29, 2021 •

edited

Loading

theword commented Apr 22, 2021 •

edited

Loading

FXDevailly commented Apr 22, 2021 •

edited

Loading

FXDevailly left a comment •

edited

Loading

FXDevailly Apr 22, 2021

qwyin commented Feb 10, 2022

YanivO1123 Mar 23, 2022

YanivO1123 Mar 23, 2022

trunghng Aug 18, 2024

Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

Are you sure you want to change the base?

Updates on Reanalyse / Sample Efficiency (Re-executing MCTS, Parallelization, Stabilization with a target model, etc.) #142

Conversation

FXDevailly commented Mar 27, 2021

theword commented Mar 29, 2021 • edited Loading

FXDevailly commented Mar 29, 2021 • edited Loading

theword commented Mar 29, 2021 • edited Loading

theword commented Apr 22, 2021 • edited Loading

FXDevailly commented Apr 22, 2021 • edited Loading

FXDevailly left a comment • edited Loading

Choose a reason for hiding this comment

FXDevailly Apr 22, 2021

Choose a reason for hiding this comment

qwyin commented Feb 10, 2022

YanivO1123 Mar 23, 2022

Choose a reason for hiding this comment

YanivO1123 Mar 23, 2022

Choose a reason for hiding this comment

trunghng Aug 18, 2024

Choose a reason for hiding this comment

theword commented Mar 29, 2021 •

edited

Loading

FXDevailly commented Mar 29, 2021 •

edited

Loading

theword commented Mar 29, 2021 •

edited

Loading

theword commented Apr 22, 2021 •

edited

Loading

FXDevailly commented Apr 22, 2021 •

edited

Loading

FXDevailly left a comment •

edited

Loading