Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback on local execution if slurm fails #323

Merged
merged 4 commits into from
Sep 11, 2024

Conversation

tmichela
Copy link
Member

@tmichela tmichela commented Aug 30, 2024

This is mainly to enable external colleagues to start testing DAMNIT without access to slurm.
It would also be useful later this year if #255 fails dramatically for reasons...

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Project coverage is 75.22%. Comparing base (c295f8d) to head (c08cd9b).
Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
damnit/backend/listener.py 85.71% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #323      +/-   ##
==========================================
+ Coverage   74.81%   75.22%   +0.41%     
==========================================
  Files          32       32              
  Lines        4892     4913      +21     
==========================================
+ Hits         3660     3696      +36     
+ Misses       1232     1217      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Member

Isn't this a bit dangerous? It could lead to the thundering herd problem if submitting slurm jobs fails for some reason. It also won't apply to cluster=True variables.

@tmichela
Copy link
Member Author

tmichela commented Sep 2, 2024

I think the risk is very low. This goes back to the previous behavior, which has seen issues only one week (IIRC) when there were kafka server connection issues. And, this fallback should™️ never happen in the regular case, or rather many other things will break if there's an issue with slurm.
I can track the number of running threads with an arbitrary limit, if you're still too worried about the risk. Do you prefer to queue or drop new runs if this scenario happens?

About the cluster variables, I believe we don't want to have them running locally, at least from the listener process, right? So, I think the current behavior will just have the submitter to fail when trying to start the slurm job. I could add a log message if that's what you mean?

@JamesWrigley
Copy link
Member

Ok, sure 👍

About the cluster variables, I believe we don't want to have them running locally, at least from the listener process, right? So, I think the current behavior will just have the submitter to fail when trying to start the slurm job. I could add a log message if that's what you mean?

I was mostly thinking of the external colleagues, but if they don't want to test the cluster variables then yeah we can just let them fail. Otherwise we could add a database option to make slurm a noop.

@tmichela
Copy link
Member Author

tmichela commented Sep 9, 2024

I've added a limitation of local runners anyway, and unit test. I'll merge tomorrow if there are no more comments.

I was mostly thinking of the external colleagues, but if they don't want to test the cluster variables then yeah we can just let them fail.

Since they want to run it in an environment without access to a cluster with slurm, I assume it's clear cluster variables aren't supported. But we can see and add a more obvious options if that become an issue.

@tmichela tmichela merged commit 8275706 into master Sep 11, 2024
6 checks passed
@tmichela tmichela deleted the feat/listenerLocalProcessFallback branch September 11, 2024 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants