-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fallback on local execution if slurm fails #323
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #323 +/- ##
==========================================
+ Coverage 74.81% 75.22% +0.41%
==========================================
Files 32 32
Lines 4892 4913 +21
==========================================
+ Hits 3660 3696 +36
+ Misses 1232 1217 -15 ☔ View full report in Codecov by Sentry. |
Isn't this a bit dangerous? It could lead to the thundering herd problem if submitting slurm jobs fails for some reason. It also won't apply to |
I think the risk is very low. This goes back to the previous behavior, which has seen issues only one week (IIRC) when there were kafka server connection issues. And, this fallback should™️ never happen in the regular case, or rather many other things will break if there's an issue with slurm. About the cluster variables, I believe we don't want to have them running locally, at least from the listener process, right? So, I think the current behavior will just have the submitter to fail when trying to start the slurm job. I could add a log message if that's what you mean? |
Ok, sure 👍
I was mostly thinking of the external colleagues, but if they don't want to test the cluster variables then yeah we can just let them fail. Otherwise we could add a database option to make slurm a noop. |
I've added a limitation of local runners anyway, and unit test. I'll merge tomorrow if there are no more comments.
Since they want to run it in an environment without access to a cluster with slurm, I assume it's clear cluster variables aren't supported. But we can see and add a more obvious options if that become an issue. |
This is mainly to enable external colleagues to start testing DAMNIT without access to slurm.
It would also be useful later this year if #255 fails dramatically for reasons...