Xfel gui sacct fix#1188
Open
dwmoreau wants to merge 3 commits into
Open
Conversation
load JobMonitor queried every active job with a separate sacct subprocess per submission id every 5 s, flooding NERSC's slurm accounting DB (ticket INC0256443). Batch all submission ids into one 'sacct --jobs=...' call per poll cycle, add TIMEOUT to the skip list so terminal jobs are not re-polled, and raise the poll interval to 10 s.
ensembles per-chunk, reconcile in bulk Query live job state with a single squeue (slurmctld) per cycle instead of sacct; fall back to one batched sacct only for jobs that have left the queue, to classify their terminal state once -- keeping load off the shared slurm accounting DB (ticket INC0256443). Cache terminal per-chunk statuses so finished chunks are not re-queried, and aggregate ensemble jobs by precedence instead of collapsing to UNKWN. Persist all status changes per cycle in a single UPDATE, and repaint the jobs tab only on a snapshot change.
phyy-nx
requested changes
Jun 24, 2026
phyy-nx
left a comment
Contributor
There was a problem hiding this comment.
Changeset looks great, just a few questions
…ve-job filter
Address review feedback on the batched sacct tracking:
- Inline SubmissionTracker._aggregate back into track(). It was only
reachable via the non-slurm base-class track_many path and had a single
call site, so the static-method extraction added indirection without
benefit. Behavior is unchanged (unanimous chunk statuses collapse to that
status, otherwise UNKWN).
- JobMonitor now filters active jobs against CACHEABLE_TERMINAL imported from
submission_tracker instead of a hand-maintained literal list. This adds the
previously-missing ERR to the terminal set and gives a single source of
truth, while keeping UNKWN excluded (it stays transient and re-tracked).
phyy-nx
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduce Slurm load from the xfel GUI job monitor
Summary
The cctbx.xfel GUI's JobMonitor thread polls the status of submitted processing jobs. Its previous implementation ran a separate sacct subprocess for every submission id of every active job, every 5 seconds, querying Slurm's accounting database (slurmdbd). On the shared NERSC cctbx account this produced ~6 sacct queries/second sustained, slowing Slurm cluster-wide.
This PR reworks how the GUI tracks job status to put minimal, bounded load on Slurm.
Changes
Net effect: from dozens of accounting-DB queries every few seconds to ~one controller query per 10s, independent of job count.
Files
state-mapping helper.