[rllib] Add SAC testing to premerge #59581

pseudo-rnd-thoughts · 2025-12-19T17:51:17Z

Description

RLlib team is in the process of updating their algorithm examples and premerge testing.
This PR updates the SAC algorithm examples and add them to BUILD.bazel

Signed-off-by: Mark Towers <[email protected]>

# Conflicts: # rllib/BUILD.bazel

Signed-off-by: Mark Towers <[email protected]>

# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/bc/cartpole_bc.py # rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py # rllib/examples/algorithms/bc/pendulum_bc.py # rllib/examples/algorithms/iql/pendulum_iql.py # rllib/examples/algorithms/marwil/cartpole_marwil.py

Signed-off-by: Mark Towers <[email protected]>

# Conflicts: # rllib/BUILD.bazel # rllib/examples/algorithms/appo/cartpole_appo.py # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py

Signed-off-by: Mark Towers <[email protected]>

# Conflicts: # rllib/BUILD.bazel

Signed-off-by: Mark Towers <[email protected]>

# Conflicts: # rllib/examples/algorithms/appo/halfcheetah_appo.py # rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py # rllib/examples/algorithms/appo/multi_agent_pong_appo.py # rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py # rllib/examples/algorithms/appo/pendulum_appo.py # rllib/examples/algorithms/appo/pong_appo.py # rllib/examples/algorithms/appo/stateless_cartpole_appo.py # rllib/utils/test_utils.py

Signed-off-by: Mark Towers <[email protected]>

gemini-code-assist

Code Review

This pull request significantly refactors and expands the APPO and SAC learning regression tests within RLlib. It introduces a new, detailed documentation block in rllib/BUILD.bazel and release/release_tests.yaml outlining test objectives, compute configurations, and environment legends for APPO and SAC. Many existing APPO and SAC tests are removed, and new ones are added or updated to cover various environments (Cartpole, Stateless Cartpole, TicTacToe, Atari, MuJoCo) and compute setups (local CPU, single CPU, single GPU, multi-GPU single/multi-node). The changes include adjusting test names, script paths, resource allocation (num-cpus, num-learners, num-gpus-per-learner, num-env-runners), and success criteria (stop-reward, stop-iters).

Review comments highlight several areas for improvement: updating docstrings in atari_appo.py, mujoco_appo.py, stateless_cartpole_appo_with_lstm.py, tictactoe_appo.py, atari_sac.py, mujoco_sac.py, stateless_cartpole_sac_with_lstm.py, and tictactoe_sac.py to ensure consistency with code parameters (rewards, timesteps, environment names) and correct example usage. There are also comments on fixing incorrect main attribute paths and multi_gpu tags in rllib/BUILD.bazel for SAC tests, and correcting a grammatical error in cartpole_appo.py's docstring. Additionally, the TicTacToe environment (rllib/examples/envs/classes/multi_agent/tic_tac_toe.py) is updated to include a max_timesteps limit, modify reward logic for invalid moves and draws, and improve observation space descriptions and rendering.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/BUILD.bazel

 py_test(
-    name = "learning_tests_multi_agent_pendulum_sac_gpu",
+    name = "learning_tests_sac_mujoco_single_gpu",
    size = "large",
-    srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
+    srcs = ["examples/algorithms/sac/mujoco_sac.py"],
    args = [
        "--as-test",
-        "--num-agents=2",
-        "--num-cpus=4",
+        "--num-cpus=8",
+        "--num-env-runners=5",
        "--num-learners=1",
        "--num-gpus-per-learner=1",
+        "--stop-reward=200",
    ],
-    main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
-    tags = [
-        "exclusive",
-        "gpu",
-        "learning_tests",
-        "learning_tests_continuous",
-        "team:rllib",
-        "torch_only",
-    ],
-)
-
-py_test(
-    name = "learning_tests_multi_agent_pendulum_sac_multi_cpu",
-    size = "large",
-    srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
-    args = [
-        "--num-agents=2",
-        "--num-learners=2",
-    ],
-    main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
-    tags = [
-        "exclusive",
-        "learning_tests",
-        "learning_tests_continuous",
-        "team:rllib",
-        "torch_only",
-    ],
-)
-
-py_test(
-    name = "learning_tests_multi_agent_pendulum_sac_multi_gpu",
-    size = "large",
-    timeout = "eternal",
-    srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
-    args = [
-        "--num-agents=2",
-        "--num-learners=2",
-        "--num-gpus-per-learner=1",
-    ],
-    main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
+    main = "examples/algorithms/sac/mountaincar_sac.py",
    tags = [
        "exclusive",
        "learning_tests",
-        "learning_tests_continuous",
        "multi_gpu",
        "team:rllib",
-        "torch_only",
    ],
 )


There are a couple of issues with this py_test definition:

The main attribute on line 1661 points to mountaincar_sac.py, but it should be mujoco_sac.py to match the srcs.

This is a single-GPU test, but it's tagged with multi_gpu on line 1665. It should be gpu for consistency.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/sac/tictactoe_sac.py

+"""Example showing how to train SAC in a multi-agent Pendulum environment.
+
+This example demonstrates Soft Actor-Critic (SAC) in a multi-agent setting where
+multiple independent agents each control their own pendulum. Each agent has its
+own policy that learns to swing up and balance its pendulum.
+
+This example:
+- Trains on the MultiAgentPendulum environment with configurable number of agents
+- Uses a multi-agent prioritized experience replay buffer
+- Configures separate policies for each agent via policy_mapping_fn
+- Applies n-step returns with random n in range [2, 5]
+- Uses automatic entropy tuning with target_entropy="auto"
+
+How to run this script
+----------------------
+`python tictactoe_sac.py --num-agents=2`
+
+To train with more agents:
+`python tictactoe_sac.py --num-agents=4`
+
+To scale up with distributed learning using multiple learners and env-runners:
+`python tictactoe_sac.py --num-learners=2 --num-env-runners=8`
+
+To use a GPU-based learner add the number of GPUs per learners:
+`python tictactoe_sac.py --num-learners=1 --num-gpus-per-learner=1`
+
+For debugging, use the following additional command line options
+`--no-tune --num-env-runners=0 --num-learners=0`
+which should allow you to set breakpoints anywhere in the RLlib code and
+have the execution stop there for inspection and debugging.
+By setting `--num-learners=0` and `--num-env-runners=0` will make them run locally
+instead of remote Ray Actor where breakpoints aren't possible.
+
+For logging to your WandB account, use:
+`--wandb-key=[your WandB API key] --wandb-project=[some project name]
+--wandb-run-name=[optional: WandB run name (within the defined project)]`
+
+Results to expect
+-----------------
+Training should show all agents learning to swing up their pendulums within 500k
+timesteps.


The docstring describes a multi-agent Pendulum environment, but the filename is tictactoe_sac.py. This seems to be a copy-paste error and should be updated to describe the TicTacToe SAC example.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/BUILD.bazel

+    name = "learning_tests_sac_atari_multi_gpu",
    size = "large",
-    srcs = ["examples/algorithms/sac/pendulum_sac.py"],
+    srcs = ["examples/algorithms/sac/atari_sac.py"],
    args = [
        "--as-test",
+        "--num-cpus=16",
+        "--num-env-runners=10",
        "--num-learners=2",
        "--num-gpus-per-learner=1",
+        "--stop-reward=5",
    ],
-    main = "examples/algorithms/sac/pendulum_sac.py",
-    tags = [
-        "exclusive",
-        "learning_tests",
-        "learning_tests_continuous",
-        "multi_gpu",
-        "team:rllib",
-        "torch_only",
-    ],
-)
-
-# MultiAgentPendulum
-py_test(
-    name = "learning_tests_multi_agent_pendulum_sac",
-    size = "large",
-    srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
-    args = [
-        "--as-test",
-        "--num-agents=2",
-        "--num-cpus=4",
-    ],
-    main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
+    main = "examples/algorithms/sac/atari_sac.py",
    tags = [
        "exclusive",
+        "gpu",
        "learning_tests",
-        "learning_tests_continuous",
        "team:rllib",
-        "torch_only",
    ],
 )


This test is configured for multi-GPU execution (num_learners=2, num_gpus_per_learner=1), but it's tagged with gpu. For consistency with other multi-GPU tests in this file (e.g., learning_test_appo_tictactoe_multi_gpu), this should be multi_gpu.

tags = [ "exclusive", "multi_gpu", "learning_tests", "team:rllib", ],

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/appo/atari_appo.py

+The algorithm should reach the default reward threshold of XX on Breakout
+within 10 million timesteps (40 million frames with 4x frame stacking,
+see: `default_timesteps` in the code).


The docstring has a placeholder XX for the reward threshold and mentions Breakout while the default environment for this example is Pong. Please update the placeholder and clarify the environment to avoid confusion.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/appo/cartpole_appo.py

+The number of environment steps can be
+through changed `default_timesteps`. Training is fast on this simple environment,


There's a minor grammatical error in the docstring. It should be "can be changed through" instead of "can be through changed".

Suggested change

The number of environment steps can be

through changed `default_timesteps`. Training is fast on this simple environment,

The number of environment steps can be changed

through `default_timesteps`.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/appo/mujoco_appo.py

+The algorithm should reach the default reward threshold of 9000.0 within
+2 million timesteps (see: `default_timesteps` in the code).


The docstring mentions an expected reward of 9000.0, but the default_reward in the parser is set to 800.0. The default environment is Humanoid-v4, for which 800.0 seems more reasonable. The 9000.0 might be a leftover from a HalfCheetah-v4 example. Please align the docstring with the code.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/appo/stateless_cartpole_appo_with_lstm.py

+The algorithm should reach the default reward threshold of 300.0 (500.0 is the
+maximum) within approximately 2 million timesteps (see: `default_timesteps`
+in the code). The number of environment
+steps can be changed through argparser's `default_timesteps`.


The docstring states an expected reward of 300.0 within 2 million timesteps, but the parser defaults are 350.0 and 5_000_000 timesteps. Please update the docstring to match the code's default parameters for consistency.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/appo/tictactoe_appo.py

+Training will run for 100 thousand timesteps (see: `default_timesteps` in the
+code) for p0 (policy 0) to achieve a mean return of XX compared to the
+random policy. The number of environment steps can be changed through


The docstring contains a placeholder XX for the expected mean return. Please replace this with the actual expected value.

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/sac/atari_sac.py

+`python [script file name].py --num-learners=2 --num-env-runners=8`
+
+To use a GPU-based learner add the number of GPUs per learners:
+`python [script file name].py --num-learners=1 --num-gpus-per-learner=1`


The example usage in the docstring uses [script file name].py as a placeholder. This should be replaced with the actual script name, atari_sac.py, for clarity.

Suggested change

`python [script file name].py --num-learners=2 --num-env-runners=8`

To use a GPU-based learner add the number of GPUs per learners:

`python [script file name].py --num-learners=1 --num-gpus-per-learner=1`

`python atari_sac.py --num-learners=2 --num-env-runners=8`

To use a GPU-based learner add the number of GPUs per learners:

`python atari_sac.py --num-learners=1 --num-gpus-per-learner=1`

gemini-code-assist · 2025-12-19T17:55:02Z

rllib/examples/algorithms/sac/mujoco_sac.py

+
+Results to expect
+-----------------
+Training should reach a reward of ~12,000 within 1M timesteps (~2000 iterations).


The docstring mentions an expected reward of ~12,000, but the parser's default_reward is 800.0. The default environment is Humanoid-v4. The high reward value might be from a different environment like HalfCheetah-v4. Please correct the docstring to match the default configuration.

Signed-off-by: Mark Towers <[email protected]>

…9895) ## Description For #59508 and #59581 that using TicTacToe would cause the following error ``` ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True)) File "ray/python/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "ray/python/ray/tune/trainable/trainable.py", line 328, in train result = self.step() File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step train_results, train_iter_ctx = self._run_one_training_iteration() File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration training_step_return_value = self.training_step() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step return self._training_step_new_api_stack() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack self.local_replay_buffer.add(episodes) File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add existing_eps.concat_episode(eps) File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode sa_episode.concat_episode(other.agent_episodes[agent_id]) File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode assert self.t == other.t_started AssertionError ``` In the multi-agent-episode `concat_episode`, we check if any agent hasn't received their next observation from an observation, action. This results in a hanging action where in one episode is the observation, action then in the next is the resulting observation, reward, etc. This [code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848) check if this has happened then added an extra step at the beginning to include this hanging data. However, in testing, the multi-agent episode `cut` method already implements this (if using `slice` this will cause a hidden bug) meaning that an extra unnecessary step's data is being added resulting in the episode beginnings not lining up. Therefore, this PR removes this code and replaces with a simple check to assume that the hanging action is equivalent to the initial action in the next episode. For testing, I found that the `concat_episode` test was using `slice` which doesn't account for hanging data while `cut` which is used in the env-runner does. I modified the test to be more functional based where I created a custom environment that has agents taking actions at different frequencies then returning as an observation the agent's timestep. This allows us to test through concatenating all episodes of the same ID and checking that the observations increase 0, 1, 2, ... and ensures that no data goes missing for users. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>

…y-project#59895) ## Description For ray-project#59508 and ray-project#59581 that using TicTacToe would cause the following error ``` ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True)) File "ray/python/ray/tune/trainable/trainable.py", line 331, in train raise skipped from exception_cause(skipped) File "ray/python/ray/tune/trainable/trainable.py", line 328, in train result = self.step() File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step train_results, train_iter_ctx = self._run_one_training_iteration() File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration training_step_return_value = self.training_step() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step return self._training_step_new_api_stack() File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack self.local_replay_buffer.add(episodes) File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add existing_eps.concat_episode(eps) File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode sa_episode.concat_episode(other.agent_episodes[agent_id]) File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode assert self.t == other.t_started AssertionError ``` In the multi-agent-episode `concat_episode`, we check if any agent hasn't received their next observation from an observation, action. This results in a hanging action where in one episode is the observation, action then in the next is the resulting observation, reward, etc. This [code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848) check if this has happened then added an extra step at the beginning to include this hanging data. However, in testing, the multi-agent episode `cut` method already implements this (if using `slice` this will cause a hidden bug) meaning that an extra unnecessary step's data is being added resulting in the episode beginnings not lining up. Therefore, this PR removes this code and replaces with a simple check to assume that the hanging action is equivalent to the initial action in the next episode. For testing, I found that the `concat_episode` test was using `slice` which doesn't account for hanging data while `cut` which is used in the env-runner does. I modified the test to be more functional based where I created a custom environment that has agents taking actions at different frequencies then returning as an observation the agent's timestep. This allows us to test through concatenating all episodes of the same ID and checking that the observations increase 0, 1, 2, ... and ensures that no data goes missing for users. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Signed-off-by: jasonwrwang <[email protected]>

Mark Towers and others added 30 commits November 21, 2025 17:06

[rllib] Merge tuned-examples into examples

75bdc1d

Signed-off-by: Mark Towers <[email protected]>

Update BUILD.bazel for tuned-examples new location

b89a7af

Signed-off-by: Mark Towers <[email protected]>

Gemini review

5100f82

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'master' into merge-tuned-examples

36d242f

# Conflicts: # rllib/BUILD.bazel

update offline data path

a0dd0c3

Signed-off-by: Mark Towers <[email protected]>

update tuned_example file paths

5a4d28b

Signed-off-by: Mark Towers <[email protected]>

Fix file paths

0f2d5bd

Signed-off-by: Mark Towers <[email protected]>

Update rllib release test directory and release test paths

dc3be81

Signed-off-by: Mark Towers <[email protected]>

[rllib] Update APPO premerge

38bf29b

Signed-off-by: Mark Towers <[email protected]>

Clean appo folder

6c0dd98

Signed-off-by: Mark Towers <[email protected]>

Add stateles cartpole

5e9d21d

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'master' into appo-premerge

101a5ee

# Conflicts: # rllib/BUILD.bazel

pre-commit

b946c72

Signed-off-by: Mark Towers <[email protected]>

Improve documentation

2234d96

Signed-off-by: Mark Towers <[email protected]>

Fix training scripts

206d8fc

Signed-off-by: Mark Towers <[email protected]>

Change to TicTacToe from Connect4

d9a559e

Signed-off-by: Mark Towers <[email protected]>

Updated to BUILD.bazel to tictactoe file

32f0e68

Signed-off-by: Mark Towers <[email protected]>

kamil code-review

a6270c3

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'master' into appo-premerge

f80bfa7

code-review

617ce9f

Signed-off-by: Mark Towers <[email protected]>

update tictactoe and stop rewards

2892342

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'master' into appo-premerge

05db432

Add nightly tests

f08cfab

Signed-off-by: Mark Towers <[email protected]>

Add default_iters to atari

350c68e

Signed-off-by: Mark Towers <[email protected]>

Update sac examples

9f1bd4d

Signed-off-by: Mark Towers <[email protected]>

Update tic tac toe implementation

15b8376

Signed-off-by: Mark Towers <[email protected]>

Rewrite TicTacToe and add stop rewards / iters for premerge

2b141ac

Signed-off-by: Mark Towers <[email protected]>

Mark Towers added 11 commits December 17, 2025 15:29

Merge branch 'master' into appo-premerge

15d25ef

Fix release tests cluster_compute

fd8a835

Signed-off-by: Mark Towers <[email protected]>

Added docstrings

c068518

Signed-off-by: Mark Towers <[email protected]>

code-review

7b122d4

Signed-off-by: Mark Towers <[email protected]>

remove type: gpu from non gpu nightly

828e383

Signed-off-by: Mark Towers <[email protected]>

code review

43b68f9

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'appo-premerge' into sac-premerge-nightly

dd8a8ce

Update the parameters

b98e625

Signed-off-by: Mark Towers <[email protected]>

Add note about GPU learners

7bb656d

Signed-off-by: Mark Towers <[email protected]>

Fix run name

d989c5e

Signed-off-by: Mark Towers <[email protected]>

pre-commit + more docstring details

319d6b3

Signed-off-by: Mark Towers <[email protected]>

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

Mark Towers added 7 commits December 29, 2025 12:26

Merge branch 'master' into appo-premerge

241fd39

Reduce the number of env-runners from 5 to 4

30a7b6a

Signed-off-by: Mark Towers <[email protected]>

Update documentation

36d7c3f

Signed-off-by: Mark Towers <[email protected]>

Merge branch 'master' into sac-premerge-nightly

866da95

Merge branch 'appo-premerge' into sac-premerge-nightly

797a9a8

Update documentation, remove appo changes

4f0739a

Signed-off-by: Mark Towers <[email protected]>

Fix the tictactoe implementation

791dbca

Signed-off-by: Mark Towers <[email protected]>

pseudo-rnd-thoughts mentioned this pull request Jan 6, 2026

Fix Multi-Agent Episode concatenation for sequential environments #59895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rllib] Add SAC testing to premerge #59581

[rllib] Add SAC testing to premerge #59581

Uh oh!

pseudo-rnd-thoughts commented Dec 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		The number of environment steps can be
		through changed `default_timesteps`. Training is fast on this simple environment,

		The algorithm should reach the default reward threshold of 9000.0 within
		2 million timesteps (see: `default_timesteps` in the code).

[rllib] Add SAC testing to premerge #59581

Are you sure you want to change the base?

[rllib] Add SAC testing to premerge #59581

Uh oh!

Conversation

pseudo-rnd-thoughts commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pseudo-rnd-thoughts commented Dec 19, 2025 •

edited

Loading