[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

luchungi · 2025-01-05T02:11:55Z

❓ Question

I have a question about vectorized custom environments where the step() and reset() function already have outputs of shape (n_envs, obs_size) or (n_envs) for states, rewards, dones, etc.

Reading the documentation, all the helper function seem to be built for stacking up running multiple envs independently. I have tried a hack where I inherit the from VecEnv in the sample code below. Although it runs, for some reason, reset() is never called despite the dones being true. Running the code, you will see that reset() which has a print('reset') is only called once despite episodes ending.

My question is

Is this not a supported scenario i.e. custom envs that already output vectorized states, rewards, etc without the need to stack them.
If I cannot get the "hack" to work, would using predict() and train() in a loop manually work the same as learn()?

import gymnasium as gym
import numpy as np
from gymnasium import spaces
from stable_baselines3.common.vec_env.base_vec_env import VecEnv
from stable_baselines3 import PPO

class CustomEnv(VecEnv):

    def __init__(self, n, batch_size):

        self.batch_size=batch_size
        self.n = n
        self.observation_space = spaces.Box(low=-np.ones(n), high=np.ones(n))
        self.action_space = spaces.Box(low=-1., high=1.)
        self.total_steps = 10
        self.rng = np.random.default_rng()
        self.num_envs = batch_size


    def reset(self, seed=None, options=None):
        print('reset')
        obs = self.rng.random(size=(self.batch_size, self.n))
        self.curr_step = 0
        return obs

    def step(self, action):

        obs = self.rng.random(size=(self.batch_size, self.n)) # obs is of shape (batch_size, n) and type float
        reward = self.rng.random(size=(self.batch_size,)) # reward is of shape (batch_size) and type float

        info = [{'Timelimit.truncated': False, 'terminal_observation': np.zeros(self.batch_size)} for _ in range(self.batch_size)]
        self.curr_step += 1
        if self.curr_step == self.total_steps:
            done = np.array([True] * self.batch_size)
        else:
            done = np.array([False] * self.batch_size)
        truncated = np.array([False] * self.batch_size)
        return obs, reward, done, info

    def step_async(self, actions):
        print('step_async')
        pass

    def step_wait(self):
        print('step_wait')
        pass

    def seed(self, seed=None):
        print('seed')
        pass

    def close(self):
        print('close')
        if self.logging:
            self.writer.close()

    def env_is_wrapped(self, wrapper_class):
        print(f'env_is_wrapped: {wrapper_class}')
        return False

    def get_attr(self, attr_name, indices=None):
        print(f'get_attr: {attr_name}')
        return getattr(self, attr_name)

    def set_attr(self, attr_name, value, indices=None):
        print(f'set_attr')
        setattr(self, attr_name, value)

    def env_method(self, method_name, *args, **kwargs):
        print(f'env_method: {method_name}')
        return getattr(self, method_name)(*args, **kwargs)

env = CustomEnv(10, 2)
agent = PPO("MlpPolicy", env, verbose=1)
agent.learn(10000)

Checklist

I have checked that there is no similar issue in the repo
I have read the documentation
If code there is, it is minimal and working
If code there is, it is formatted using the markdown code blocks for both code and stack traces.

The text was updated successfully, but these errors were encountered:

araffin · 2025-01-08T11:21:11Z

have tried a hack where I inherit the from VecEnv

this is not a hack and this is the way to go for vectorized env.

We have several working examples (already in our doc for most):

luchungi · 2025-01-09T04:46:49Z

Thanks for the directions. Much appreciated.

Can I clarify this paragraph in the docs:

When using vectorized environments, the environments are automatically reset at the end of each episode. Thus, the observation returned for the i-th environment when done[i] is true will in fact be the first observation of the next episode, not the last observation of the episode that has just terminated. You can access the “real” final observation of the terminated episode—that is, the one that accompanied the done event provided by the underlying environment—using the terminal_observation keys in the info dicts returned by the VecEnv.

Does it mean that we need to have within env.step() the instructions to reset env[i] and return obs[i] as the observation of a new episode when done[i] is True?

araffin · 2025-01-13T08:47:03Z

as the observation of a new episode when done[i] is True?

maybe the easiest to answer your question is to have a look at:

stable-baselines3/stable_baselines3/common/vec_env/dummy_vec_env.py

Lines 68 to 73 in b7c64a1

    
               if self.buf_dones[env_idx]: 
        
                   # save final observation where user can get it, then reset 
        
                   self.buf_infos[env_idx]["terminal_observation"] = obs 
        
                   obs, self.reset_infos[env_idx] = self.envs[env_idx].reset() 
        
               self._save_obs(env_idx, obs) 
        
           return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones), deepcopy(self.buf_infos))

luchungi · 2025-01-13T09:41:53Z

I think my customised environment does not facilitate the creations of self.envs as a list of individual envs. In any case, I found a workaround by using the VecEnvWrapper mentioned in the docs. Since for my custom env, all episodes will reset at the same time due to a fix episode length so I added the reset() into the step() function of the VecEnvWrapper as shown below. Without this (i.e. if you remove the if done.all() condition), the envs are not resetting automatically.

import numpy as np
import torch
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3.common.vec_env import VecEnvWrapper
from stable_baselines3 import PPO

class CustomEnv(gym.Env):

    def __init__(self, n, batch_size):
        self.batch_size=batch_size
        self.n = n
        self.observation_space = spaces.Box(low=-np.ones(n), high=np.ones(n))
        self.action_space = spaces.Box(low=-1., high=1.)
        self.total_steps = 10
        self.num_envs = batch_size

    def reset(self, seed=None, options=None):
        obs = torch.randn(size=(self.batch_size, self.n))
        self.curr_step = 0
        return obs, {}

    def step(self, action):

        obs = torch.randn(self.batch_size, self.n)
        reward = torch.randn(self.batch_size)

        info = [{'Timelimit.truncated': False, 'terminal_observation': torch.zeros(self.batch_size, self.n) if self.curr_step == self.total_steps else None} for _ in range(self.batch_size)]
        self.curr_step += 1
        if self.curr_step == self.total_steps:
            done = torch.tensor([True]).repeat(self.batch_size)
        else:
            done = torch.tensor([False]).repeat(self.batch_size)
        truncated = torch.tensor([False]).repeat(self.batch_size)
        return obs, reward, done, truncated, info

class CustomVecEnvWrapper(VecEnvWrapper):

    def __init__(self, venv):
        super().__init__(venv=venv)

    def reset(self):
        print("reset")
        obs, _ = self.venv.reset()
        return obs

    def step_async(self, actions):
           pass

    def step_wait(self):
        pass

    def step(self, actions):
        obs, reward, done, _, info = self.venv.step(actions)
        if done.all():
            obs = self.reset()
        return obs, reward.squeeze().numpy(), done.squeeze().numpy(), info

env = CustomEnv(10, 2)
env = CustomVecEnvWrapper(env)
agent = PPO("MlpPolicy", env, verbose=1, n_steps=10, batch_size=2)
agent.learn(31)

luchungi added the question Further information is requested label Jan 5, 2025

araffin added the custom gym env Issue related to Custom Gym Env label Jan 8, 2025

luchungi closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

luchungi commented Jan 5, 2025

araffin commented Jan 8, 2025

luchungi commented Jan 9, 2025

araffin commented Jan 13, 2025

luchungi commented Jan 13, 2025

[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

[Question] Vectorized custom environments that output (num_envs, obs_size) without stacking #2066

Comments

luchungi commented Jan 5, 2025

❓ Question

Checklist

araffin commented Jan 8, 2025

luchungi commented Jan 9, 2025

araffin commented Jan 13, 2025

luchungi commented Jan 13, 2025