Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The X server sometimes work but sometimes stop. #22

Open
JisuHann opened this issue Aug 13, 2023 · 2 comments
Open

The X server sometimes work but sometimes stop. #22

JisuHann opened this issue Aug 13, 2023 · 2 comments

Comments

@JisuHann
Copy link

Hi. Thanks for your work!

I followed your instructions and checked sometimes the X server works, and sometimes it doesn't. To be specific, the screen just stops and don't move. This happens right after the initialization stage. The final code of the screen is shown as follows:

08/13 23:26:33 INFO: Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7f7c403d73d0>, 'scenes': ['FloorPlan16_physics', 'FloorPlan17_physics', 'FloorPlan18_physics', 'FloorPlan19_physics', 'FloorPlan20_physics'], 'env_args': {'gridSize': 0.25, 'width': 224, 'height': 224, 'visibilityDistance': 1.0, 'agentMode': 'arm', 'fieldOfView': 100, 'agentControllerType': 'mid-level', 'server_class': <class 'ai2thor.fifo_server.FifoServer'>, 'useMassThreshold': True, 'massThreshold': 10, 'autoSimulation': False, 'autoSyncTransforms': True, 'renderDepthImage': True, 'x_display': '0.1'}, 'max_steps': 200, 'sensors': [<ithor_arm.ithor_arm_sensors.DepthSensorThor object at 0x7f7cccede370>, <allenact_plugins.ithor_plugin.ithor_sensors.RGBSensorThor object at 0x7f7c403c7ac0>, <ithor_arm.ithor_arm_sensors.RelativeAgentArmToObjectSensor object at 0x7f7c403c7c40>, <ithor_arm.ithor_arm_sensors.RelativeObjectToGoalSensor object at 0x7f7c403c7d90>, <ithor_arm.ithor_arm_sensors.PickedUpObjSensor object at 0x7f7c403d70a0>], 'action_space': Discrete(13), 'seed': 506456969, 'deterministic_cudnn': False, 'rewards_config': {'step_penalty': -0.01, 'goal_success_reward': 10.0, 'pickup_success_reward': 5.0, 'failed_stop_reward': 0.0, 'shaping_weight': 1.0, 'failed_action_penalty': -0.03}, 'scene_period': 'manual', 'sampler_mode': 'train', 'cap_training': None} [vector_sampled_tasks.py: 975]

After this there is no console information that is given, and I confirmed that the entire system is not working. Do you know when does this happen and how can I solve this?

When I terminated the process the error is given as follows:

Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll
if not wait([self.sentinel], timeout):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll
if not wait([self.sentinel], timeout):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

Lastly, I want to ask is there any upgrade plans for this framework. Compared to the allenact repository, this framework may be seen as quite outdated (e.g., the ai2thor version is 0.0.1, but the current version is 5.0.0). I'd really appreciate if this is taken into consideration.

Thank you.

@Lucaweihs
Copy link
Contributor

Hi @JisuHann ,

Can you try reducing the number of processes used during training (i.e. change this line) and see if this allows training to proceed?

A newer version of this codebase can be found at https://github.com/allenai/disturb-free which was a follow-up work. The version of ai2thor used by that work is still <5.0.0 but it is more recent.

@JisuHann
Copy link
Author

JisuHann commented Aug 16, 2023

Thank you for quick response, @Lucaweihs !

First of all, I tried reducing the number of processes but also stopping issue happened again.
GPU spec is 2 RTX A6000 with 100 CPU cores, and num_processes that have been experimented is 2 (1 per GPU) to 40 (20 processes per GPU). With only less number of num_processes, not stopping issue happen on very low possibility.

To debug this, I captured one phenomenon on this issue. While each episode, I've tried to print where does the stopping point happens. It turns out it happened on the yield statement: Before and after statement worked well, but somewhat reason it stopped at the yield statement while getting the observation_space_command or action_space_command at every step.
I confirmed that the command and res object outputs well, for example:
command would be observation_space_command or action_space_command, and res can be Dict(depth_lowres:Box(-2.0, 18.0, (224, 224, 1), float32), rgb_lowres:Box(-2.1179039478302, 2.640000104904175, (224, 224, 3), float32), relative_agent_arm_to_obj:Box(-100.0, 100.0, (3,), float32), relative_obj_to_goal:Box(-100.0, 100.0, (3,), float32), pickedup_object:Box(0.0, 1.0, (1,), float32)).
So my question is have you experienced problem held in yield statement and if so how can I solve this?

Second, I've tried the disturb-free repository that you recommended, and the same experiment happens again (even with less number of processes). By the way, I've tried to another machine with 4 GeForce RTX 3090, but it does not work as well.

I would attach the details of my machine.

  • Platform: Linux-6.2.0-26-generic-x86_64-with-debian-bookworm-sid
  • 4 GeForce RTX 3090, CUDA Version: 11.7
  • Python version: 3.7.12
  • PyTorch version: 1.13.0+cu117
  • Tensorflow version: 2.7.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants