-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[rllib] Add SAC testing to premerge #59581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[rllib] Add SAC testing to premerge #59581
Conversation
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
# Conflicts: # rllib/BUILD.bazel
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/bc/cartpole_bc.py # rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py # rllib/examples/algorithms/bc/pendulum_bc.py # rllib/examples/algorithms/iql/pendulum_iql.py # rllib/examples/algorithms/marwil/cartpole_marwil.py
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/appo/cartpole_appo.py # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
# Conflicts: # rllib/BUILD.bazel
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
# Conflicts: # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py # rllib/utils/test_utils.py
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly refactors and expands the APPO and SAC learning regression tests within RLlib. It introduces a new, detailed documentation block in rllib/BUILD.bazel and release/release_tests.yaml outlining test objectives, compute configurations, and environment legends for APPO and SAC. Many existing APPO and SAC tests are removed, and new ones are added or updated to cover various environments (Cartpole, Stateless Cartpole, TicTacToe, Atari, MuJoCo) and compute setups (local CPU, single CPU, single GPU, multi-GPU single/multi-node). The changes include adjusting test names, script paths, resource allocation (num-cpus, num-learners, num-gpus-per-learner, num-env-runners), and success criteria (stop-reward, stop-iters).
Review comments highlight several areas for improvement: updating docstrings in atari_appo.py, mujoco_appo.py, stateless_cartpole_appo_with_lstm.py, tictactoe_appo.py, atari_sac.py, mujoco_sac.py, stateless_cartpole_sac_with_lstm.py, and tictactoe_sac.py to ensure consistency with code parameters (rewards, timesteps, environment names) and correct example usage. There are also comments on fixing incorrect main attribute paths and multi_gpu tags in rllib/BUILD.bazel for SAC tests, and correcting a grammatical error in cartpole_appo.py's docstring. Additionally, the TicTacToe environment (rllib/examples/envs/classes/multi_agent/tic_tac_toe.py) is updated to include a max_timesteps limit, modify reward logic for invalid moves and draws, and improve observation space descriptions and rendering.
| py_test( | ||
| name = "learning_tests_multi_agent_pendulum_sac_gpu", | ||
| name = "learning_tests_sac_mujoco_single_gpu", | ||
| size = "large", | ||
| srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"], | ||
| srcs = ["examples/algorithms/sac/mujoco_sac.py"], | ||
| args = [ | ||
| "--as-test", | ||
| "--num-agents=2", | ||
| "--num-cpus=4", | ||
| "--num-cpus=8", | ||
| "--num-env-runners=5", | ||
| "--num-learners=1", | ||
| "--num-gpus-per-learner=1", | ||
| "--stop-reward=200", | ||
| ], | ||
| main = "examples/algorithms/sac/multi_agent_pendulum_sac.py", | ||
| tags = [ | ||
| "exclusive", | ||
| "gpu", | ||
| "learning_tests", | ||
| "learning_tests_continuous", | ||
| "team:rllib", | ||
| "torch_only", | ||
| ], | ||
| ) | ||
|
|
||
| py_test( | ||
| name = "learning_tests_multi_agent_pendulum_sac_multi_cpu", | ||
| size = "large", | ||
| srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"], | ||
| args = [ | ||
| "--num-agents=2", | ||
| "--num-learners=2", | ||
| ], | ||
| main = "examples/algorithms/sac/multi_agent_pendulum_sac.py", | ||
| tags = [ | ||
| "exclusive", | ||
| "learning_tests", | ||
| "learning_tests_continuous", | ||
| "team:rllib", | ||
| "torch_only", | ||
| ], | ||
| ) | ||
|
|
||
| py_test( | ||
| name = "learning_tests_multi_agent_pendulum_sac_multi_gpu", | ||
| size = "large", | ||
| timeout = "eternal", | ||
| srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"], | ||
| args = [ | ||
| "--num-agents=2", | ||
| "--num-learners=2", | ||
| "--num-gpus-per-learner=1", | ||
| ], | ||
| main = "examples/algorithms/sac/multi_agent_pendulum_sac.py", | ||
| main = "examples/algorithms/sac/mountaincar_sac.py", | ||
| tags = [ | ||
| "exclusive", | ||
| "learning_tests", | ||
| "learning_tests_continuous", | ||
| "multi_gpu", | ||
| "team:rllib", | ||
| "torch_only", | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Example showing how to train SAC in a multi-agent Pendulum environment. | ||
| This example demonstrates Soft Actor-Critic (SAC) in a multi-agent setting where | ||
| multiple independent agents each control their own pendulum. Each agent has its | ||
| own policy that learns to swing up and balance its pendulum. | ||
| This example: | ||
| - Trains on the MultiAgentPendulum environment with configurable number of agents | ||
| - Uses a multi-agent prioritized experience replay buffer | ||
| - Configures separate policies for each agent via policy_mapping_fn | ||
| - Applies n-step returns with random n in range [2, 5] | ||
| - Uses automatic entropy tuning with target_entropy="auto" | ||
| How to run this script | ||
| ---------------------- | ||
| `python tictactoe_sac.py --num-agents=2` | ||
| To train with more agents: | ||
| `python tictactoe_sac.py --num-agents=4` | ||
| To scale up with distributed learning using multiple learners and env-runners: | ||
| `python tictactoe_sac.py --num-learners=2 --num-env-runners=8` | ||
| To use a GPU-based learner add the number of GPUs per learners: | ||
| `python tictactoe_sac.py --num-learners=1 --num-gpus-per-learner=1` | ||
| For debugging, use the following additional command line options | ||
| `--no-tune --num-env-runners=0 --num-learners=0` | ||
| which should allow you to set breakpoints anywhere in the RLlib code and | ||
| have the execution stop there for inspection and debugging. | ||
| By setting `--num-learners=0` and `--num-env-runners=0` will make them run locally | ||
| instead of remote Ray Actor where breakpoints aren't possible. | ||
| For logging to your WandB account, use: | ||
| `--wandb-key=[your WandB API key] --wandb-project=[some project name] | ||
| --wandb-run-name=[optional: WandB run name (within the defined project)]` | ||
| Results to expect | ||
| ----------------- | ||
| Training should show all agents learning to swing up their pendulums within 500k | ||
| timesteps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name = "learning_tests_sac_atari_multi_gpu", | ||
| size = "large", | ||
| srcs = ["examples/algorithms/sac/pendulum_sac.py"], | ||
| srcs = ["examples/algorithms/sac/atari_sac.py"], | ||
| args = [ | ||
| "--as-test", | ||
| "--num-cpus=16", | ||
| "--num-env-runners=10", | ||
| "--num-learners=2", | ||
| "--num-gpus-per-learner=1", | ||
| "--stop-reward=5", | ||
| ], | ||
| main = "examples/algorithms/sac/pendulum_sac.py", | ||
| tags = [ | ||
| "exclusive", | ||
| "learning_tests", | ||
| "learning_tests_continuous", | ||
| "multi_gpu", | ||
| "team:rllib", | ||
| "torch_only", | ||
| ], | ||
| ) | ||
|
|
||
| # MultiAgentPendulum | ||
| py_test( | ||
| name = "learning_tests_multi_agent_pendulum_sac", | ||
| size = "large", | ||
| srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"], | ||
| args = [ | ||
| "--as-test", | ||
| "--num-agents=2", | ||
| "--num-cpus=4", | ||
| ], | ||
| main = "examples/algorithms/sac/multi_agent_pendulum_sac.py", | ||
| main = "examples/algorithms/sac/atari_sac.py", | ||
| tags = [ | ||
| "exclusive", | ||
| "gpu", | ||
| "learning_tests", | ||
| "learning_tests_continuous", | ||
| "team:rllib", | ||
| "torch_only", | ||
| ], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is configured for multi-GPU execution (num_learners=2, num_gpus_per_learner=1), but it's tagged with gpu. For consistency with other multi-GPU tests in this file (e.g., learning_test_appo_tictactoe_multi_gpu), this should be multi_gpu.
tags = [
"exclusive",
"multi_gpu",
"learning_tests",
"team:rllib",
],
| The algorithm should reach the default reward threshold of XX on Breakout | ||
| within 10 million timesteps (40 million frames with 4x frame stacking, | ||
| see: `default_timesteps` in the code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The number of environment steps can be | ||
| through changed `default_timesteps`. Training is fast on this simple environment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor grammatical error in the docstring. It should be "can be changed through" instead of "can be through changed".
| The number of environment steps can be | |
| through changed `default_timesteps`. Training is fast on this simple environment, | |
| The number of environment steps can be changed | |
| through `default_timesteps`. |
| The algorithm should reach the default reward threshold of 9000.0 within | ||
| 2 million timesteps (see: `default_timesteps` in the code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The algorithm should reach the default reward threshold of 300.0 (500.0 is the | ||
| maximum) within approximately 2 million timesteps (see: `default_timesteps` | ||
| in the code). The number of environment | ||
| steps can be changed through argparser's `default_timesteps`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Training will run for 100 thousand timesteps (see: `default_timesteps` in the | ||
| code) for p0 (policy 0) to achieve a mean return of XX compared to the | ||
| random policy. The number of environment steps can be changed through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `python [script file name].py --num-learners=2 --num-env-runners=8` | ||
| To use a GPU-based learner add the number of GPUs per learners: | ||
| `python [script file name].py --num-learners=1 --num-gpus-per-learner=1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example usage in the docstring uses [script file name].py as a placeholder. This should be replaced with the actual script name, atari_sac.py, for clarity.
| `python [script file name].py --num-learners=2 --num-env-runners=8` | |
| To use a GPU-based learner add the number of GPUs per learners: | |
| `python [script file name].py --num-learners=1 --num-gpus-per-learner=1` | |
| `python atari_sac.py --num-learners=2 --num-env-runners=8` | |
| To use a GPU-based learner add the number of GPUs per learners: | |
| `python atari_sac.py --num-learners=1 --num-gpus-per-learner=1` |
| Results to expect | ||
| ----------------- | ||
| Training should reach a reward of ~12,000 within 1M timesteps (~2000 iterations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
…9895) ## Description For #59508 and #59581 that using TicTacToe would cause the following error ``` ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True)) File "ray/python/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "ray/python/ray/tune/trainable/trainable.py", line 328, in train result = self.step() File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step train_results, train_iter_ctx = self._run_one_training_iteration() File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration training_step_return_value = self.training_step() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step return self._training_step_new_api_stack() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack self.local_replay_buffer.add(episodes) File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add existing_eps.concat_episode(eps) File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode sa_episode.concat_episode(other.agent_episodes[agent_id]) File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode assert self.t == other.t_started AssertionError ``` In the multi-agent-episode `concat_episode`, we check if any agent hasn't received their next observation from an observation, action. This results in a hanging action where in one episode is the observation, action then in the next is the resulting observation, reward, etc. This [code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848) check if this has happened then added an extra step at the beginning to include this hanging data. However, in testing, the multi-agent episode `cut` method already implements this (if using `slice` this will cause a hidden bug) meaning that an extra unnecessary step's data is being added resulting in the episode beginnings not lining up. Therefore, this PR removes this code and replaces with a simple check to assume that the hanging action is equivalent to the initial action in the next episode. For testing, I found that the `concat_episode` test was using `slice` which doesn't account for hanging data while `cut` which is used in the env-runner does. I modified the test to be more functional based where I created a custom environment that has agents taking actions at different frequencies then returning as an observation the agent's timestep. This allows us to test through concatenating all episodes of the same ID and checking that the observations increase 0, 1, 2, ... and ensures that no data goes missing for users. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>
…y-project#59895) ## Description For ray-project#59508 and ray-project#59581 that using TicTacToe would cause the following error ``` ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True)) File "ray/python/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "ray/python/ray/tune/trainable/trainable.py", line 328, in train result = self.step() File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step train_results, train_iter_ctx = self._run_one_training_iteration() File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration training_step_return_value = self.training_step() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step return self._training_step_new_api_stack() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack self.local_replay_buffer.add(episodes) File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add existing_eps.concat_episode(eps) File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode sa_episode.concat_episode(other.agent_episodes[agent_id]) File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode assert self.t == other.t_started AssertionError ``` In the multi-agent-episode `concat_episode`, we check if any agent hasn't received their next observation from an observation, action. This results in a hanging action where in one episode is the observation, action then in the next is the resulting observation, reward, etc. This [code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848) check if this has happened then added an extra step at the beginning to include this hanging data. However, in testing, the multi-agent episode `cut` method already implements this (if using `slice` this will cause a hidden bug) meaning that an extra unnecessary step's data is being added resulting in the episode beginnings not lining up. Therefore, this PR removes this code and replaces with a simple check to assume that the hanging action is equivalent to the initial action in the next episode. For testing, I found that the `concat_episode` test was using `slice` which doesn't account for hanging data while `cut` which is used in the env-runner does. I modified the test to be more functional based where I created a custom environment that has agents taking actions at different frequencies then returning as an observation the agent's timestep. This allows us to test through concatenating all episodes of the same ID and checking that the observations increase 0, 1, 2, ... and ensures that no data goes missing for users. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Signed-off-by: jasonwrwang <[email protected]>
Description
RLlib team is in the process of updating their algorithm examples and premerge testing.
This PR updates the SAC algorithm examples and add them to BUILD.bazel