Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to launch robot server within docker container. #81

Open
szhaovas opened this issue Jun 8, 2024 · 6 comments
Open

Unable to launch robot server within docker container. #81

szhaovas opened this issue Jun 8, 2024 · 6 comments

Comments

@szhaovas
Copy link

szhaovas commented Jun 8, 2024

Hello developers, thank you for maintaining robo-gym!

I have been having troubles running robo-gym inside a docker container. My goal is to run the robo-gym server side from within the container, and run the robo-gym training script on my host machine. However, I cannot seem to launch robot server inside the container, and the application always stalls on the step Starting Robot Server....

I initially thought it to be a docker port problem, but even if I launched both the server and the training script within the same docker container, I still could not launch the server, as shown (test.py in the right pane is simply the Random Agent MiR100 Simulation Environment example in README):
Screenshot 2024-06-07 at 15 24 19

Steps to reproduce

  • I have pushed my docker image to dockerhub docker pull szhaovas/robogym_test.
  • Launch a container with docker run --rm -it szhaovas/robogym_test.
  • Within the container terminal, run start-server-manager && attach-to-server-manager.
  • Right click to split the tmux pane, and on the other pane, run python3 test.py.

My setup

  • Probably doesn't matter, but my host machine is a MacBook Air M2 2023, and its OS is Ventura 13.5.2.
  • My docker version is 26.1.1.

Additional info

  • I think the problem is specific with docker only, since I was able to run the same example in a Ubuntu 20.04 ROS Noetic virtual machine.
  • The line that doesn't return seems to be the self.tmux_srv.new_session line inside ServerManager.new_session().
  • When running kill-all-robot-servers, it returns the error message error connecting to /tmp/tmux-0/ServerManager (No such file or directory).

Thanks in advance!

@jr-b-reiterer
Copy link
Contributor

Hi @szhaovas,

have you tried replacing gui=True by gui=False in the environment initialization?

@szhaovas
Copy link
Author

szhaovas commented Jul 1, 2024

Hi @jr-b-reiterer,

Thank you for the reply. Yes, I replaced gui=True with gui=False.
The test.py file in the forked repo I shared above contains the test script I was running.

@jr-b-reiterer
Copy link
Contributor

When I test with your image, the behaviour is different: I get past the lines from your screenshot, but then the reset fails. The warning from gym I get there gave me a hint that you are using a too new version of gym, 0.26. robo-gym in the present version is compatible with gym up to 0.21 only because of their API change. (An upgrade of robo-gym is in the works internally.)

Back to your observation: I am using Docker 20.10.21 on Ubuntu 20.04. I am not sure if any difference here could cause the problem. You could test if it is different when you run your test script not in a tmux pane but in a separate terminal that you connect to your running container in addition:
docker exec -it <container name> bash

@szhaovas
Copy link
Author

szhaovas commented Jul 2, 2024

Hi @jr-b-reiterer,

I downgraded gym to 0.21, and I am now getting the same error as you. I tried both running docker exec -it <container name> bash and running the test script in a tmux pane, and in both cases, I am no longer stuck at "Starting new robot server", but get an error at reset.
Do you know how I might fix the reset error? Thanks!
Screenshot 2024-07-02 at 15 55 56

@jr-b-reiterer
Copy link
Contributor

I am not sure it will fix your issue, but apparently your downgrade of gym was not successful. The passive env checker that outputs the warning in your screenshot does not exist in Gym v0.21, see
https://github.com/openai/gym/blob/v0.21.0/gym/utils/passive_env_checker.py
vs
https://github.com/openai/gym/blob/0.26.0/gym/utils/passive_env_checker.py

@szhaovas
Copy link
Author

Update: got it to work with 2 fixes!

  1. The gym version had to be downgraded to 0.18.3 as described in a previous issue.
  2. Some container ports had to be mapped to host ports when launching the container.
     - On Linux machines, simply launch container with option --network host
     - (Hacky. Please let me know if anyone has a cleaner solution) MacOS didn't support host network mode, so instead I had to map ports specifically for robot server and server manager. What worked for me was docker run --rm -it -p 47000-47100:47000-47100 -p 50100-50200:50100-50200 <image>.
       - Within the container, find 3 instances of find_free_port() within <robogym_server_modules>/server_manager/server.py (should be on L69, L75, L78), and give each of these a lower_bound and upper_bound within the range of mapped ports. Make sure they don't overlap, so in my case, I had find_free_port(47000, 47030), find_free_port(47030, 47060), find_free_port(47060, 47100). Now the robogym training script on the host machine can communicate with docker:
Screenshot 2024-08-15 at 18 12 07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants