The X server sometimes work but sometimes stop. #22

JisuHann · 2023-08-13T14:47:58Z

Hi. Thanks for your work!

I followed your instructions and checked sometimes the X server works, and sometimes it doesn't. To be specific, the screen just stops and don't move. This happens right after the initialization stage. The final code of the screen is shown as follows:

08/13 23:26:33 INFO: Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7f7c403d73d0>, 'scenes': ['FloorPlan16_physics', 'FloorPlan17_physics', 'FloorPlan18_physics', 'FloorPlan19_physics', 'FloorPlan20_physics'], 'env_args': {'gridSize': 0.25, 'width': 224, 'height': 224, 'visibilityDistance': 1.0, 'agentMode': 'arm', 'fieldOfView': 100, 'agentControllerType': 'mid-level', 'server_class': <class 'ai2thor.fifo_server.FifoServer'>, 'useMassThreshold': True, 'massThreshold': 10, 'autoSimulation': False, 'autoSyncTransforms': True, 'renderDepthImage': True, 'x_display': '0.1'}, 'max_steps': 200, 'sensors': [<ithor_arm.ithor_arm_sensors.DepthSensorThor object at 0x7f7cccede370>, <allenact_plugins.ithor_plugin.ithor_sensors.RGBSensorThor object at 0x7f7c403c7ac0>, <ithor_arm.ithor_arm_sensors.RelativeAgentArmToObjectSensor object at 0x7f7c403c7c40>, <ithor_arm.ithor_arm_sensors.RelativeObjectToGoalSensor object at 0x7f7c403c7d90>, <ithor_arm.ithor_arm_sensors.PickedUpObjSensor object at 0x7f7c403d70a0>], 'action_space': Discrete(13), 'seed': 506456969, 'deterministic_cudnn': False, 'rewards_config': {'step_penalty': -0.01, 'goal_success_reward': 10.0, 'pickup_success_reward': 5.0, 'failed_stop_reward': 0.0, 'shaping_weight': 1.0, 'failed_action_penalty': -0.03}, 'scene_period': 'manual', 'sampler_mode': 'train', 'cap_training': None} [vector_sampled_tasks.py: 975]

After this there is no console information that is given, and I confirmed that the entire system is not working. Do you know when does this happen and how can I solve this?

When I terminated the process the error is given as follows:

Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll
if not wait([self.sentinel], timeout):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/popen_forkserver.py", line 65, in poll
if not wait([self.sentinel], timeout):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/im2/anaconda3/envs/gos/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/im2/anaconda3/envs/gos/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

Lastly, I want to ask is there any upgrade plans for this framework. Compared to the allenact repository, this framework may be seen as quite outdated (e.g., the ai2thor version is 0.0.1, but the current version is 5.0.0). I'd really appreciate if this is taken into consideration.

Thank you.

Lucaweihs · 2023-08-15T22:20:42Z

Hi @JisuHann ,

Can you try reducing the number of processes used during training (i.e. change this line) and see if this allows training to proceed?

A newer version of this codebase can be found at https://github.com/allenai/disturb-free which was a follow-up work. The version of ai2thor used by that work is still <5.0.0 but it is more recent.

JisuHann · 2023-08-16T12:43:09Z

Thank you for quick response, @Lucaweihs !

First of all, I tried reducing the number of processes but also stopping issue happened again.
GPU spec is 2 RTX A6000 with 100 CPU cores, and num_processes that have been experimented is 2 (1 per GPU) to 40 (20 processes per GPU). With only less number of num_processes, not stopping issue happen on very low possibility.

To debug this, I captured one phenomenon on this issue. While each episode, I've tried to print where does the stopping point happens. It turns out it happened on the yield statement: Before and after statement worked well, but somewhat reason it stopped at the yield statement while getting the observation_space_command or action_space_command at every step.
I confirmed that the command and res object outputs well, for example:
command would be observation_space_command or action_space_command, and res can be Dict(depth_lowres:Box(-2.0, 18.0, (224, 224, 1), float32), rgb_lowres:Box(-2.1179039478302, 2.640000104904175, (224, 224, 3), float32), relative_agent_arm_to_obj:Box(-100.0, 100.0, (3,), float32), relative_obj_to_goal:Box(-100.0, 100.0, (3,), float32), pickedup_object:Box(0.0, 1.0, (1,), float32)).
So my question is have you experienced problem held in yield statement and if so how can I solve this?

Second, I've tried the disturb-free repository that you recommended, and the same experiment happens again (even with less number of processes). By the way, I've tried to another machine with 4 GeForce RTX 3090, but it does not work as well.

I would attach the details of my machine.

Platform: Linux-6.2.0-26-generic-x86_64-with-debian-bookworm-sid
4 GeForce RTX 3090, CUDA Version: 11.7
Python version: 3.7.12
PyTorch version: 1.13.0+cu117
Tensorflow version: 2.7.4

JisuHann mentioned this issue Aug 13, 2023

The X server screen sometimes work but sometimes stop. allenai/m-vole#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The X server sometimes work but sometimes stop. #22

The X server sometimes work but sometimes stop. #22

JisuHann commented Aug 13, 2023

Lucaweihs commented Aug 15, 2023

JisuHann commented Aug 16, 2023 •

edited

Loading

The X server sometimes work but sometimes stop. #22

The X server sometimes work but sometimes stop. #22

Comments

JisuHann commented Aug 13, 2023

Lucaweihs commented Aug 15, 2023

JisuHann commented Aug 16, 2023 • edited Loading

JisuHann commented Aug 16, 2023 •

edited

Loading