Skip to content

Xfel gui sacct fix#1188

Open
dwmoreau wants to merge 3 commits into
masterfrom
xfel_gui_sacct_fix
Open

Xfel gui sacct fix#1188
dwmoreau wants to merge 3 commits into
masterfrom
xfel_gui_sacct_fix

Conversation

@dwmoreau

Copy link
Copy Markdown
Contributor

Reduce Slurm load from the xfel GUI job monitor

Summary

The cctbx.xfel GUI's JobMonitor thread polls the status of submitted processing jobs. Its previous implementation ran a separate sacct subprocess for every submission id of every active job, every 5 seconds, querying Slurm's accounting database (slurmdbd). On the shared NERSC cctbx account this produced ~6 sacct queries/second sustained, slowing Slurm cluster-wide.

This PR reworks how the GUI tracks job status to put minimal, bounded load on Slurm.

Changes

  • Poll the Slurm controller instead of the accounting database. Live status now comes from a single squeue call per cycle (in-memory slurmctld state — cheap), with a single batched sacct used only to resolve the final outcome of jobs that have already left the queue. In steady state this places essentially zero load on slurmdbd.
  • One query per cycle, not one per job. All active jobs are queried together rather than spawning a subprocess per submission id. Finished jobs (including TIMEOUT) are not re-queried.
  • Sensible status for multi-part (ensemble) jobs. Jobs split across several sub-jobs now report a derived status (RUNNING if any part runs, FAILED if any part fails, DONE only when all finish) instead of showing "unknown"; finished sub-jobs are cached and not re-queried.
  • Bulk DB writes and reduced UI churn. Status changes for a cycle are written to the application database in a single UPDATE, and the Jobs tab is redrawn only when something actually changed.
  • Poll interval relaxed from 5s to 10s.

Net effect: from dozens of accounting-DB queries every few seconds to ~one controller query per 10s, independent of job count.

Files

  • xfel/ui/components/submission_tracker.py — squeue-based batched query with sacct terminal-state fallback; ensemble aggregation; per-chunk terminal cache; shared
    state-mapping helper.
  • xfel/ui/db/xfel_db.py — update_job_statuses() bulk status-update helper.
  • xfel/ui/components/xfel_gui_init.py — JobMonitor reconcile loop (batched query, bulk write, delta-only refresh).

  load

JobMonitor queried every active job with a separate sacct subprocess per submission id every 5 s, flooding NERSC's slurm accounting DB (ticket INC0256443).
  Batch all submission ids into one 'sacct --jobs=...' call per poll cycle, add TIMEOUT to the skip list so terminal jobs are not re-polled, and raise the poll interval
  to 10 s.
  ensembles per-chunk, reconcile in bulk

Query live job state with a single squeue (slurmctld) per cycle instead of sacct; fall back to one batched sacct only for
  jobs that have left the queue, to classify their terminal state once -- keeping load off the shared slurm accounting DB (ticket INC0256443). Cache terminal per-chunk
  statuses so finished chunks are not re-queried, and aggregate ensemble jobs by precedence instead of collapsing to UNKWN. Persist all status changes per cycle in a
  single UPDATE, and repaint the jobs tab only on a snapshot change.
@dwmoreau dwmoreau requested a review from phyy-nx June 24, 2026 14:10

@phyy-nx phyy-nx left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changeset looks great, just a few questions

Comment thread xfel/ui/components/submission_tracker.py Outdated
Comment thread xfel/ui/components/xfel_gui_init.py Outdated
…ve-job filter

  Address review feedback on the batched sacct tracking:

  - Inline SubmissionTracker._aggregate back into track(). It was only
    reachable via the non-slurm base-class track_many path and had a single
    call site, so the static-method extraction added indirection without
    benefit. Behavior is unchanged (unanimous chunk statuses collapse to that
    status, otherwise UNKWN).

  - JobMonitor now filters active jobs against CACHEABLE_TERMINAL imported from
    submission_tracker instead of a hand-maintained literal list. This adds the
    previously-missing ERR to the terminal set and gives a single source of
    truth, while keeping UNKWN excluded (it stays transient and re-tracked).
@dwmoreau dwmoreau marked this pull request as ready for review June 26, 2026 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants