Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

Closed
4 tasks done
luchungi opened this issue Jan 5, 2025 · 4 comments
Closed
4 tasks done
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested

Comments

@luchungi
Copy link

luchungi commented Jan 5, 2025

❓ Question

I have a question about vectorized custom environments where the step() and reset() function already have outputs of shape (n_envs, obs_size) or (n_envs) for states, rewards, dones, etc.

Reading the documentation, all the helper function seem to be built for stacking up running multiple envs independently. I have tried a hack where I inherit the from VecEnv in the sample code below. Although it runs, for some reason, reset() is never called despite the dones being true. Running the code, you will see that reset() which has a print('reset') is only called once despite episodes ending.

My question is

  1. Is this not a supported scenario i.e. custom envs that already output vectorized states, rewards, etc without the need to stack them.
  2. If I cannot get the "hack" to work, would using predict() and train() in a loop manually work the same as learn()?
import gymnasium as gym
import numpy as np
from gymnasium import spaces
from stable_baselines3.common.vec_env.base_vec_env import VecEnv
from stable_baselines3 import PPO

class CustomEnv(VecEnv):

    def __init__(self, n, batch_size):

        self.batch_size=batch_size
        self.n = n
        self.observation_space = spaces.Box(low=-np.ones(n), high=np.ones(n))
        self.action_space = spaces.Box(low=-1., high=1.)
        self.total_steps = 10
        self.rng = np.random.default_rng()
        self.num_envs = batch_size


    def reset(self, seed=None, options=None):
        print('reset')
        obs = self.rng.random(size=(self.batch_size, self.n))
        self.curr_step = 0
        return obs

    def step(self, action):

        obs = self.rng.random(size=(self.batch_size, self.n)) # obs is of shape (batch_size, n) and type float
        reward = self.rng.random(size=(self.batch_size,)) # reward is of shape (batch_size) and type float

        info = [{'Timelimit.truncated': False, 'terminal_observation': np.zeros(self.batch_size)} for _ in range(self.batch_size)]
        self.curr_step += 1
        if self.curr_step == self.total_steps:
            done = np.array([True] * self.batch_size)
        else:
            done = np.array([False] * self.batch_size)
        truncated = np.array([False] * self.batch_size)
        return obs, reward, done, info

    def step_async(self, actions):
        print('step_async')
        pass

    def step_wait(self):
        print('step_wait')
        pass

    def seed(self, seed=None):
        print('seed')
        pass

    def close(self):
        print('close')
        if self.logging:
            self.writer.close()

    def env_is_wrapped(self, wrapper_class):
        print(f'env_is_wrapped: {wrapper_class}')
        return False

    def get_attr(self, attr_name, indices=None):
        print(f'get_attr: {attr_name}')
        return getattr(self, attr_name)

    def set_attr(self, attr_name, value, indices=None):
        print(f'set_attr')
        setattr(self, attr_name, value)

    def env_method(self, method_name, *args, **kwargs):
        print(f'env_method: {method_name}')
        return getattr(self, method_name)(*args, **kwargs)

env = CustomEnv(10, 2)
agent = PPO("MlpPolicy", env, verbose=1)
agent.learn(10000)

Checklist

@luchungi luchungi added the question Further information is requested label Jan 5, 2025
@araffin araffin added the custom gym env Issue related to Custom Gym Env label Jan 8, 2025
@araffin
Copy link
Member

araffin commented Jan 8, 2025

have tried a hack where I inherit the from VecEnv

this is not a hack and this is the way to go for vectorized env.

We have several working examples (already in our doc for most):

@luchungi
Copy link
Author

luchungi commented Jan 9, 2025

Thanks for the directions. Much appreciated.

Can I clarify this paragraph in the docs:

When using vectorized environments, the environments are automatically reset at the end of each episode. Thus, the observation returned for the i-th environment when done[i] is true will in fact be the first observation of the next episode, not the last observation of the episode that has just terminated. You can access the “real” final observation of the terminated episode—that is, the one that accompanied the done event provided by the underlying environment—using the terminal_observation keys in the info dicts returned by the VecEnv.

Does it mean that we need to have within env.step() the instructions to reset env[i] and return obs[i] as the observation of a new episode when done[i] is True?

@araffin
Copy link
Member

araffin commented Jan 13, 2025

as the observation of a new episode when done[i] is True?

maybe the easiest to answer your question is to have a look at:

if self.buf_dones[env_idx]:
# save final observation where user can get it, then reset
self.buf_infos[env_idx]["terminal_observation"] = obs
obs, self.reset_infos[env_idx] = self.envs[env_idx].reset()
self._save_obs(env_idx, obs)
return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones), deepcopy(self.buf_infos))

@luchungi
Copy link
Author

I think my customised environment does not facilitate the creations of self.envs as a list of individual envs. In any case, I found a workaround by using the VecEnvWrapper mentioned in the docs. Since for my custom env, all episodes will reset at the same time due to a fix episode length so I added the reset() into the step() function of the VecEnvWrapper as shown below. Without this (i.e. if you remove the if done.all() condition), the envs are not resetting automatically.

import numpy as np
import torch
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3.common.vec_env import VecEnvWrapper
from stable_baselines3 import PPO

class CustomEnv(gym.Env):

    def __init__(self, n, batch_size):
        self.batch_size=batch_size
        self.n = n
        self.observation_space = spaces.Box(low=-np.ones(n), high=np.ones(n))
        self.action_space = spaces.Box(low=-1., high=1.)
        self.total_steps = 10
        self.num_envs = batch_size

    def reset(self, seed=None, options=None):
        obs = torch.randn(size=(self.batch_size, self.n))
        self.curr_step = 0
        return obs, {}

    def step(self, action):

        obs = torch.randn(self.batch_size, self.n)
        reward = torch.randn(self.batch_size)

        info = [{'Timelimit.truncated': False, 'terminal_observation': torch.zeros(self.batch_size, self.n) if self.curr_step == self.total_steps else None} for _ in range(self.batch_size)]
        self.curr_step += 1
        if self.curr_step == self.total_steps:
            done = torch.tensor([True]).repeat(self.batch_size)
        else:
            done = torch.tensor([False]).repeat(self.batch_size)
        truncated = torch.tensor([False]).repeat(self.batch_size)
        return obs, reward, done, truncated, info

class CustomVecEnvWrapper(VecEnvWrapper):

    def __init__(self, venv):
        super().__init__(venv=venv)

    def reset(self):
        print("reset")
        obs, _ = self.venv.reset()
        return obs

    def step_async(self, actions):
           pass

    def step_wait(self):
        pass

    def step(self, actions):
        obs, reward, done, _, info = self.venv.step(actions)
        if done.all():
            obs = self.reset()
        return obs, reward.squeeze().numpy(), done.squeeze().numpy(), info

env = CustomEnv(10, 2)
env = CustomVecEnvWrapper(env)
agent = PPO("MlpPolicy", env, verbose=1, n_steps=10, batch_size=2)
agent.learn(31)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants