Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] - Algorithm.add_module does not use the module_state argument. #46247

Open
simonsays1980 opened this issue Jun 25, 2024 · 0 comments
Open
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order rllib RLlib related issues rllib-multi-agent An RLlib multi-agent related problem.

Comments

@simonsays1980
Copy link
Collaborator

What happened + What you expected to happen

What happened

Using the Algorithm.add_module with a module_state does not use the module state, but instead loads or builds the module directly from the passed in SingleAgentRLModuleSpec. This results in an error about missing network weights due to the inference-only design.

trying to add module: th-rlm-16
2024-06-24 11:26:56,456 ERROR actor_manager.py:185 -- Worker exception caught during `apply()`: Error(s) in loading state_dict for PPOTorchRLModule:
        Missing key(s) in state_dict: "encoder.actor_encoder.net.mlp.0.weight", "encoder.actor_encoder.net.mlp.0.bias", "encoder.actor_encoder.net.mlp.2.weight", "encoder.actor_encoder.net.mlp.2.bias", "encoder.critic_encoder.net.mlp.0.weight", "encoder.critic_encoder.net.mlp.0.bias", "encoder.critic_encoder.net.mlp.2.weight", "encoder.critic_encoder.net.mlp.2.bias", "vf.net.mlp.0.weight", "vf.net.mlp.0.bias". 
        Unexpected key(s) in state_dict: "encoder.encoder.net.mlp.0.weight", "encoder.encoder.net.mlp.0.bias", "encoder.encoder.net.mlp.2.weight", "encoder.encoder.net.mlp.2.bias". 
Traceback (most recent call last):
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/rllib/utils/actor_manager.py", line 181, in apply
    return func(self, *args, **kwargs)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/rllib/env/env_runner_group.py", line 555, in _set_weights
    env_runner.set_weights(_weights, global_vars)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/rllib/env/multi_agent_env_runner.py", line 672, in set_weights
    self.module.set_state(weights)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/rllib/core/rl_module/marl_module.py", line 334, in set_state
    self._rl_modules[module_id].set_state(state)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/ray/rllib/core/rl_module/torch/torch_rl_module.py", line 73, in set_state
    self.load_state_dict(state_dict)
  File "/home/thwu/miniconda3/envs/rlmodule/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PPOTorchRLModule:
        Missing key(s) in state_dict: "encoder.actor_encoder.net.mlp.0.weight", "encoder.actor_encoder.net.mlp.0.bias", "encoder.actor_encoder.net.mlp.2.weight", "encoder.actor_encoder.net.mlp.2.bias", "encoder.critic_encoder.net.mlp.0.weight", "encoder.critic_encoder.net.mlp.0.bias", "encoder.critic_encoder.net.mlp.2.weight", "encoder.critic_encoder.net.mlp.2.bias", "vf.net.mlp.0.weight", "vf.net.mlp.0.bias". 
        Unexpected key(s) in state_dict: "encoder.encoder.net.mlp.0.weight", "encoder.encoder.net.mlp.0.bias", "encoder.encoder.net.mlp.2.weight", "encoder.encoder.net.mlp.2.bias".

What you expected to happen

That a module state can be loaded into a module when calling Algorithm.add_module with the Learner's module being inference_only=False and the EnvRunner's module being inference_only=True. The module state should at best come from a Learner's module (b/c it has all networks).

Versions / Dependencies

Python 3.11
Ray master

Reproduction script

def add_trained_modules(
    algorithm: rllib_algorithm.Algorithm,
    module_specs: list[serialization.ModuleSpec],
    evaluation_workers: bool = True,
):
    """Add a list of ModuleSpec to an RLLib Algorithm"""
    for module_spec in module_specs:
        # if algorithm.get_module(module_spec.name):
        #     continue

        module = module_spec.load_module()
        model_config_dict = module.config.model_config_dict
        model_config_dict["_inference_only"] = False
        print(f"trying to add module: {module_spec.name}")
        algorithm.add_module(
            module_spec.name,
            rl_module.SingleAgentRLModuleSpec(
                module_class=ppo_torch_rl_module.PPOTorchRLModule,
                observation_space=module.config.observation_space,
                action_space=module.config.action_space,
                model_config_dict=model_config_dict,  # module.config.model_config_dict,
                catalog_class=module.config.catalog_class,
            ),
            module_state=module.get_state(inference_only=False),
            evaluation_workers=evaluation_workers,
        )
        print(f"added module: {module_spec.name}")

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@simonsays1980 simonsays1980 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 25, 2024
@simonsays1980 simonsays1980 self-assigned this Jun 25, 2024
@simonsays1980 simonsays1980 added P0 Issues that should be fixed in short order rllib RLlib related issues rllib-multi-agent An RLlib multi-agent related problem. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order rllib RLlib related issues rllib-multi-agent An RLlib multi-agent related problem.
Projects
None yet
Development

No branches or pull requests

1 participant