Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container start sometimes fails on host reboot #28

Open
jeflem opened this issue Aug 4, 2024 · 2 comments
Open

Container start sometimes fails on host reboot #28

jeflem opened this issue Aug 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jeflem
Copy link
Owner

jeflem commented Aug 4, 2024

If systemd wants to stop an Ananke container default timeout is 10 seconds, which often too short for shutting down all JLab sessions and JHub gracefully resulting not automatically restarting containers after system reboot. Adding --stop-timeout=30 to the podman generate systemd line in run.sh should solve this problem (not tested).

@jeflem jeflem added the bug Something isn't working label Aug 4, 2024
@jeflem jeflem changed the title Container stop timeout to small Container stop timeout too small Aug 4, 2024
@jeflem
Copy link
Owner Author

jeflem commented Aug 10, 2024

Seems that stop timeout isn't the problem (Podman sets it to 70 seconds), but the start timeout, which is not set by Podman. Could be set to infinity, see https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#TimeoutStartSec=

@jeflem jeflem changed the title Container stop timeout too small Container start sometimes fails on host reboot Aug 27, 2024
@jeflem
Copy link
Owner Author

jeflem commented Aug 27, 2024

It's not an issue of start or stop timeouts. Both values are set to 60 seconds on dev/Ananke 0.5. The core issue seems to be nvidia-persistenced.service coming up too slowly. The Ananke container's systemd unit in principle could wait for nvidia-persistenced.service (via --after and --requires arguments to podman generate systemd). But the nvidia service runs as root and Ananke runs as user. Seems that user services are not allowed to depend on root services (see discussion in systemd issue 3312).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant