Skip to content

Conversation

@pseudo-rnd-thoughts
Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts commented Dec 19, 2025

Description

RLlib team is in the process of updating their algorithm examples and premerge testing.
This PR updates the SAC algorithm examples and add them to BUILD.bazel

Mark Towers and others added 30 commits November 21, 2025 17:06
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
#	rllib/examples/algorithms/bc/cartpole_bc.py
#	rllib/examples/algorithms/bc/cartpole_bc_with_offline_evaluation.py
#	rllib/examples/algorithms/bc/pendulum_bc.py
#	rllib/examples/algorithms/iql/pendulum_iql.py
#	rllib/examples/algorithms/marwil/cartpole_marwil.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
#	rllib/examples/algorithms/appo/cartpole_appo.py
#	rllib/examples/algorithms/appo/halfcheetah_appo.py
#	rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py
#	rllib/examples/algorithms/appo/multi_agent_pong_appo.py
#	rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py
#	rllib/examples/algorithms/appo/pendulum_appo.py
#	rllib/examples/algorithms/appo/pong_appo.py
#	rllib/examples/algorithms/appo/stateless_cartpole_appo.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/BUILD.bazel
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
# Conflicts:
#	rllib/examples/algorithms/appo/halfcheetah_appo.py
#	rllib/examples/algorithms/appo/multi_agent_cartpole_appo.py
#	rllib/examples/algorithms/appo/multi_agent_pong_appo.py
#	rllib/examples/algorithms/appo/multi_agent_stateless_cartpole_appo.py
#	rllib/examples/algorithms/appo/pendulum_appo.py
#	rllib/examples/algorithms/appo/pong_appo.py
#	rllib/examples/algorithms/appo/stateless_cartpole_appo.py
#	rllib/utils/test_utils.py
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Mark Towers added 4 commits December 19, 2025 16:54
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors and expands the APPO and SAC learning regression tests within RLlib. It introduces a new, detailed documentation block in rllib/BUILD.bazel and release/release_tests.yaml outlining test objectives, compute configurations, and environment legends for APPO and SAC. Many existing APPO and SAC tests are removed, and new ones are added or updated to cover various environments (Cartpole, Stateless Cartpole, TicTacToe, Atari, MuJoCo) and compute setups (local CPU, single CPU, single GPU, multi-GPU single/multi-node). The changes include adjusting test names, script paths, resource allocation (num-cpus, num-learners, num-gpus-per-learner, num-env-runners), and success criteria (stop-reward, stop-iters).

Review comments highlight several areas for improvement: updating docstrings in atari_appo.py, mujoco_appo.py, stateless_cartpole_appo_with_lstm.py, tictactoe_appo.py, atari_sac.py, mujoco_sac.py, stateless_cartpole_sac_with_lstm.py, and tictactoe_sac.py to ensure consistency with code parameters (rewards, timesteps, environment names) and correct example usage. There are also comments on fixing incorrect main attribute paths and multi_gpu tags in rllib/BUILD.bazel for SAC tests, and correcting a grammatical error in cartpole_appo.py's docstring. Additionally, the TicTacToe environment (rllib/examples/envs/classes/multi_agent/tic_tac_toe.py) is updated to include a max_timesteps limit, modify reward logic for invalid moves and draws, and improve observation space descriptions and rendering.

Comment on lines 1649 to 1668
py_test(
name = "learning_tests_multi_agent_pendulum_sac_gpu",
name = "learning_tests_sac_mujoco_single_gpu",
size = "large",
srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
srcs = ["examples/algorithms/sac/mujoco_sac.py"],
args = [
"--as-test",
"--num-agents=2",
"--num-cpus=4",
"--num-cpus=8",
"--num-env-runners=5",
"--num-learners=1",
"--num-gpus-per-learner=1",
"--stop-reward=200",
],
main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
tags = [
"exclusive",
"gpu",
"learning_tests",
"learning_tests_continuous",
"team:rllib",
"torch_only",
],
)

py_test(
name = "learning_tests_multi_agent_pendulum_sac_multi_cpu",
size = "large",
srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
args = [
"--num-agents=2",
"--num-learners=2",
],
main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
tags = [
"exclusive",
"learning_tests",
"learning_tests_continuous",
"team:rllib",
"torch_only",
],
)

py_test(
name = "learning_tests_multi_agent_pendulum_sac_multi_gpu",
size = "large",
timeout = "eternal",
srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
args = [
"--num-agents=2",
"--num-learners=2",
"--num-gpus-per-learner=1",
],
main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
main = "examples/algorithms/sac/mountaincar_sac.py",
tags = [
"exclusive",
"learning_tests",
"learning_tests_continuous",
"multi_gpu",
"team:rllib",
"torch_only",
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are a couple of issues with this py_test definition:

  1. The main attribute on line 1661 points to mountaincar_sac.py, but it should be mujoco_sac.py to match the srcs.
  2. This is a single-GPU test, but it's tagged with multi_gpu on line 1665. It should be gpu for consistency.

Comment on lines +1 to +41
"""Example showing how to train SAC in a multi-agent Pendulum environment.
This example demonstrates Soft Actor-Critic (SAC) in a multi-agent setting where
multiple independent agents each control their own pendulum. Each agent has its
own policy that learns to swing up and balance its pendulum.
This example:
- Trains on the MultiAgentPendulum environment with configurable number of agents
- Uses a multi-agent prioritized experience replay buffer
- Configures separate policies for each agent via policy_mapping_fn
- Applies n-step returns with random n in range [2, 5]
- Uses automatic entropy tuning with target_entropy="auto"
How to run this script
----------------------
`python tictactoe_sac.py --num-agents=2`
To train with more agents:
`python tictactoe_sac.py --num-agents=4`
To scale up with distributed learning using multiple learners and env-runners:
`python tictactoe_sac.py --num-learners=2 --num-env-runners=8`
To use a GPU-based learner add the number of GPUs per learners:
`python tictactoe_sac.py --num-learners=1 --num-gpus-per-learner=1`
For debugging, use the following additional command line options
`--no-tune --num-env-runners=0 --num-learners=0`
which should allow you to set breakpoints anywhere in the RLlib code and
have the execution stop there for inspection and debugging.
By setting `--num-learners=0` and `--num-env-runners=0` will make them run locally
instead of remote Ray Actor where breakpoints aren't possible.
For logging to your WandB account, use:
`--wandb-key=[your WandB API key] --wandb-project=[some project name]
--wandb-run-name=[optional: WandB run name (within the defined project)]`
Results to expect
-----------------
Training should show all agents learning to swing up their pendulums within 500k
timesteps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The docstring describes a multi-agent Pendulum environment, but the filename is tictactoe_sac.py. This seems to be a copy-paste error and should be updated to describe the TicTacToe SAC example.

Comment on lines 1629 to 1647
name = "learning_tests_sac_atari_multi_gpu",
size = "large",
srcs = ["examples/algorithms/sac/pendulum_sac.py"],
srcs = ["examples/algorithms/sac/atari_sac.py"],
args = [
"--as-test",
"--num-cpus=16",
"--num-env-runners=10",
"--num-learners=2",
"--num-gpus-per-learner=1",
"--stop-reward=5",
],
main = "examples/algorithms/sac/pendulum_sac.py",
tags = [
"exclusive",
"learning_tests",
"learning_tests_continuous",
"multi_gpu",
"team:rllib",
"torch_only",
],
)

# MultiAgentPendulum
py_test(
name = "learning_tests_multi_agent_pendulum_sac",
size = "large",
srcs = ["examples/algorithms/sac/multi_agent_pendulum_sac.py"],
args = [
"--as-test",
"--num-agents=2",
"--num-cpus=4",
],
main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
main = "examples/algorithms/sac/atari_sac.py",
tags = [
"exclusive",
"gpu",
"learning_tests",
"learning_tests_continuous",
"team:rllib",
"torch_only",
],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test is configured for multi-GPU execution (num_learners=2, num_gpus_per_learner=1), but it's tagged with gpu. For consistency with other multi-GPU tests in this file (e.g., learning_test_appo_tictactoe_multi_gpu), this should be multi_gpu.

    tags = [
        "exclusive",
        "multi_gpu",
        "learning_tests",
        "team:rllib",
    ],

Comment on lines 44 to 46
The algorithm should reach the default reward threshold of XX on Breakout
within 10 million timesteps (40 million frames with 4x frame stacking,
see: `default_timesteps` in the code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring has a placeholder XX for the reward threshold and mentions Breakout while the default environment for this example is Pong. Please update the placeholder and clarify the environment to avoid confusion.

Comment on lines 39 to 40
The number of environment steps can be
through changed `default_timesteps`. Training is fast on this simple environment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor grammatical error in the docstring. It should be "can be changed through" instead of "can be through changed".

Suggested change
The number of environment steps can be
through changed `default_timesteps`. Training is fast on this simple environment,
The number of environment steps can be changed
through `default_timesteps`.

Comment on lines 53 to 54
The algorithm should reach the default reward threshold of 9000.0 within
2 million timesteps (see: `default_timesteps` in the code).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring mentions an expected reward of 9000.0, but the default_reward in the parser is set to 800.0. The default environment is Humanoid-v4, for which 800.0 seems more reasonable. The 9000.0 might be a leftover from a HalfCheetah-v4 example. Please align the docstring with the code.

Comment on lines 48 to 51
The algorithm should reach the default reward threshold of 300.0 (500.0 is the
maximum) within approximately 2 million timesteps (see: `default_timesteps`
in the code). The number of environment
steps can be changed through argparser's `default_timesteps`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring states an expected reward of 300.0 within 2 million timesteps, but the parser defaults are 350.0 and 5_000_000 timesteps. Please update the docstring to match the code's default parameters for consistency.

Comment on lines 46 to 48
Training will run for 100 thousand timesteps (see: `default_timesteps` in the
code) for p0 (policy 0) to achieve a mean return of XX compared to the
random policy. The number of environment steps can be changed through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring contains a placeholder XX for the expected mean return. Please replace this with the actual expected value.

Comment on lines +23 to +26
`python [script file name].py --num-learners=2 --num-env-runners=8`
To use a GPU-based learner add the number of GPUs per learners:
`python [script file name].py --num-learners=1 --num-gpus-per-learner=1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example usage in the docstring uses [script file name].py as a placeholder. This should be replaced with the actual script name, atari_sac.py, for clarity.

Suggested change
`python [script file name].py --num-learners=2 --num-env-runners=8`
To use a GPU-based learner add the number of GPUs per learners:
`python [script file name].py --num-learners=1 --num-gpus-per-learner=1`
`python atari_sac.py --num-learners=2 --num-env-runners=8`
To use a GPU-based learner add the number of GPUs per learners:
`python atari_sac.py --num-learners=1 --num-gpus-per-learner=1`

Results to expect
-----------------
Training should reach a reward of ~12,000 within 1M timesteps (~2000 iterations).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring mentions an expected reward of ~12,000, but the parser's default_reward is 800.0. The default environment is Humanoid-v4. The high reward value might be from a different environment like HalfCheetah-v4. Please correct the docstring to match the default configuration.

Mark Towers added 7 commits December 29, 2025 12:26
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
simonsays1980 pushed a commit that referenced this pull request Jan 8, 2026
…9895)

## Description
For #59508 and
#59581 that using TicTacToe would
cause the following error

```
ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True))
  File "ray/python/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "ray/python/ray/tune/trainable/trainable.py", line 328, in train
    result = self.step()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step
    train_results, train_iter_ctx = self._run_one_training_iteration()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration
    training_step_return_value = self.training_step()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step
    return self._training_step_new_api_stack()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack
    self.local_replay_buffer.add(episodes)
  File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add
    existing_eps.concat_episode(eps)
  File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode
    sa_episode.concat_episode(other.agent_episodes[agent_id])
  File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode
    assert self.t == other.t_started
AssertionError
```

In the multi-agent-episode `concat_episode`, we check if any agent
hasn't received their next observation from an observation, action. This
results in a hanging action where in one episode is the observation,
action then in the next is the resulting observation, reward, etc. This
[code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848)
check if this has happened then added an extra step at the beginning to
include this hanging data.

However, in testing, the multi-agent episode `cut` method already
implements this (if using `slice` this will cause a hidden bug) meaning
that an extra unnecessary step's data is being added resulting in the
episode beginnings not lining up.
Therefore, this PR removes this code and replaces with a simple check to
assume that the hanging action is equivalent to the initial action in
the next episode.

For testing, I found that the `concat_episode` test was using `slice`
which doesn't account for hanging data while `cut` which is used in the
env-runner does. I modified the test to be more functional based where I
created a custom environment that has agents taking actions at different
frequencies then returning as an observation the agent's timestep. This
allows us to test through concatenating all episodes of the same ID and
checking that the observations increase 0, 1, 2, ... and ensures that no
data goes missing for users.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Mark Towers added 2 commits January 12, 2026 10:30
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…y-project#59895)

## Description
For ray-project#59508 and
ray-project#59581 that using TicTacToe would
cause the following error

```
ray::DQN.train() (pid=88183, ip=127.0.0.1, actor_id=2b775f13e808cc4aaaa23bde01000000, repr=DQN(env=<class 'ray.rllib.examples.envs.classes.multi_agent.tic_tac_toe.TicTacToe'>; env-runners=0; learners=0; multi-agent=True))
  File "ray/python/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "ray/python/ray/tune/trainable/trainable.py", line 328, in train
    result = self.step()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 1242, in step
    train_results, train_iter_ctx = self._run_one_training_iteration()
  File "ray/python/ray/rllib/algorithms/algorithm.py", line 3666, in _run_one_training_iteration
    training_step_return_value = self.training_step()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 646, in training_step
    return self._training_step_new_api_stack()
  File "ray/python/ray/rllib/algorithms/dqn/dqn.py", line 668, in _training_step_new_api_stack
    self.local_replay_buffer.add(episodes)
  File "ray/python/ray/rllib/utils/replay_buffers/prioritized_episode_buffer.py", line 314, in add
    existing_eps.concat_episode(eps)
  File "ray/python/ray/rllib/env/multi_agent_episode.py", line 862, in concat_episode
    sa_episode.concat_episode(other.agent_episodes[agent_id])
  File "ray/python/ray/rllib/env/single_agent_episode.py", line 618, in concat_episode
    assert self.t == other.t_started
AssertionError
```

In the multi-agent-episode `concat_episode`, we check if any agent
hasn't received their next observation from an observation, action. This
results in a hanging action where in one episode is the observation,
action then in the next is the resulting observation, reward, etc. This
[code](https://github.com/ray-project/ray/blob/22cf6ef6af2cddc233bca7ce59668ed8f4bbb17e/rllib/env/multi_agent_episode.py#L848)
check if this has happened then added an extra step at the beginning to
include this hanging data.

However, in testing, the multi-agent episode `cut` method already
implements this (if using `slice` this will cause a hidden bug) meaning
that an extra unnecessary step's data is being added resulting in the
episode beginnings not lining up.
Therefore, this PR removes this code and replaces with a simple check to
assume that the hanging action is equivalent to the initial action in
the next episode.

For testing, I found that the `concat_episode` test was using `slice`
which doesn't account for hanging data while `cut` which is used in the
env-runner does. I modified the test to be more functional based where I
created a custom environment that has agents taking actions at different
frequencies then returning as an observation the agent's timestep. This
allows us to test through concatenating all episodes of the same ID and
checking that the observations increase 0, 1, 2, ... and ensures that no
data goes missing for users.

---------

Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Mark Towers added 8 commits January 14, 2026 19:42
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
@pseudo-rnd-thoughts pseudo-rnd-thoughts marked this pull request as ready for review January 20, 2026 19:08
@pseudo-rnd-thoughts pseudo-rnd-thoughts requested a review from a team as a code owner January 20, 2026 19:08
# Conflicts:
#	doc/source/rllib/rllib-algorithms.rst
#	rllib/BUILD.bazel
#	rllib/examples/envs/classes/multi_agent/tic_tac_toe.py
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

"--num-gpus-per-learner=1",
],
main = "examples/algorithms/sac/multi_agent_pendulum_sac.py",
main = "examples/algorithms/sac/mountaincar_sac.py",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test references deleted file as main entry point

High Severity

The test learning_tests_sac_mujoco_single_gpu specifies srcs as mujoco_sac.py but main points to mountaincar_sac.py. Since mountaincar_sac.py is deleted in this same PR, the test will fail at runtime because the main entry point file no longer exists. The main directive needs to match the srcs file (mujoco_sac.py).

Fix in Cursor Fix in Web

self.board = [-x for x in self.board]
reward = {self.current_player: 0.0}
self.current_player = opponent
self.timestep += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TicTacToe double timestep increment causes incorrect game state

High Severity

The step() method already increments self.timestep at line 92 (start of method). The newly added self.timestep += 1 at line 155 in the "continue game" path causes the timestep to be incremented twice per step, making the game end prematurely and corrupting the timestep tracking.

Fix in Cursor Fix in Web

reward = {self.current_player: 0.0}
self.current_player = opponent
self.timestep += 1
self.board = [-x for x in self.board]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TicTacToe double board flip breaks opponent perspective

High Severity

The board is flipped at line 152 to present the opponent's perspective (where their pieces appear as +1). The newly added second flip at line 156 reverses this, causing the opponent to receive the board from the wrong perspective, breaking the game's core observation logic.

Fix in Cursor Fix in Web

head_fcnet_kernel_initializer="orthogonal_",
head_fcnet_kernel_initializer_kwargs={"gain": 0.01},
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LSTM example file missing required LSTM configuration

Medium Severity

The file is named stateless_cartpole_sac_with_lstm.py and BUILD.bazel marks it as an "(SA/D/LSTM)" test, but the DefaultModelConfig doesn't set use_lstm=True. Since use_lstm defaults to False, this example runs without LSTM despite its name suggesting LSTM functionality, defeating the purpose of testing SAC with recurrent networks on a partially observable environment.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the rllib RLlib related issues label Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rllib RLlib related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants