Skip to content

feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046

Open
krusche wants to merge 15 commits into
stagingfrom
feat/queue-monitoring
Open

feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046
krusche wants to merge 15 commits into
stagingfrom
feat/queue-monitoring

Conversation

@krusche

@krusche krusche commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

Adds queue visibility to Helios — per-label-set queue depth, self-hosted runner inventory, build-time percentiles, per-job ETA, stuck-job classification, and SLO alerts — closing the regression Artemis maintainers see vs the previous Bamboo dashboard.

All scheduled jobs + controllers are gated by helios.queue.enabled and default off. The Flyway migration runs regardless (table creation only); rollback is a forward-fix migration, not a flag flip.

What's in here

  • DB (V56__add_workflow_job_and_runner_inventory.sql)workflow_job, runner, queue_wait_stat, queue_alert_rule, queue_alert_event, with a partial index on status='queued', GIN on labels, and FK to repository(repository_id).
  • IngestionGitHubWorkflowJobMessageHandler also persists durable workflow_job rows + a Caffeine hot index, wrapped in try/catch so a failure cannot poison the existing deployment-timing path. New GitHubSelfHostedRunnerMessageHandler for org-level runner events.
  • Reconcilers (each @ConditionalOnProperty helios.queue.enabled; @EnableScheduling lives on a @Profile("!openapi") config so they never fire during OpenAPI doc generation) — runner inventory (60s), in-progress job filler with last_reconcile_attempt_at backoff (30s), stuck-job classifier (60s), hourly p50/p90/p95 rollup (5min), 30-day backfill (admin-triggered, self-throttling to 180 req/min).
  • REST layerEtagCache + GitHubRestClient with If-None-Match / 304 reuse and rate-limit metrics.
  • ETA — label-superset capacity, 3s Caffeine cache, configurable GitHub-hosted concurrency ceiling.
  • AlertsQueueAlertEvaluator with dedup-via-open-events, a real HH:mm-HH:mm quiet window, email channel + template; pluggable AlertChannel interface.
  • ControllersWorkflowQueueController (depth/jobs/stats/alerts CRUD/backfill) and RunnerController (@PreAuthorize("isAuthenticated()")). Reads are permitAll (app-wide GET policy); rule writes require @EnforceAtLeastWritePermission and that the path {repoId} matches the authorized (X-Repository-Id) repo.
  • ClientThemeService extracted from MainLayoutComponent so the new HeliosLineChartComponent (PrimeNG <p-chart> + Chart.js, with chartjs-adapter-date-fns) reacts to dark-mode toggles. Routes: /repo/:id/ci-cd/queue (overview, runners, stats, alerts) + admin-only top-level /queue.
  • Tests — full server suite 496 tests green; client suite 221 tests green.

Deep review — issues found & fixed

All 7 follow-ups flagged in the first review are now resolved:

  1. Backfill self-invocationstart() dispatches through a proxied WorkflowJobBackfillExecutor (so @Async runs off-thread), and each page is persisted through a proxied saveJobPage(@Transactional) so the transaction is real and wraps only the DB writes (not the GitHub pagination — no connection held across network I/O).
  2. Chart time-axischartjs-adapter-date-fns is imported by HeliosLineChartComponent and pinned in package.json; the stats page no longer throws.
  3. Alert-rule writes 403'd — the backend derives the WRITE role from X-Repository-Id, but queue.api.ts (a raw HttpClient) never sent it, and BearerInterceptor replaced the whole header object (dropping any caller header). Fixed: BearerInterceptor now uses setHeaders, and the write calls send X-Repository-Id matching the path repo. Covered by queue.api.spec.ts.
  4. LabelSets separator — was a raw, invisible 0x01 control byte that read as String.join("", …) (i.e. as the very collision the comment claimed to prevent). Replaced with a named "�" constant — byte-identical at runtime, so no labelSetHash migration, but visible and reformat-safe.
  5. QueueIndexService counter drift — now tracks per-job JobState; a redelivered webhook in the same state is a no-op, so the queued counter no longer drifts on retries.
  6. Quiet hoursinQuietWindow parses HH:mm-HH:mm as a real window (was a cron fire-moment that only suppressed for one minute).
  7. Stats percentiles — sample-weighted percentile across buckets (was an unweighted mean of per-bucket p95s).

Additional hardening from a second deep review:

  • Cross-repo write IDORcreateRule/updateRule/deleteRule now require the path {repoId} to equal the authorized (X-Repository-Id) repo (403 otherwise); previously a user with write access to repo A could target repo B via the path. New WorkflowQueueControllerAuthzTest.
  • Reconcile backoff was a no-optouchReconcileAttempt / markMissingOffline now use @Modifying(clearAutomatically = true, flushAutomatically = true). Inside the @Transactional reconcile() loop the saved entities are the same rows the bulk UPDATE touches; without flush-before / clear-after, the entity flush at commit clobbered the updated last_reconcile_attempt_at, so the same jobs were re-reconciled every 30s (hammering GitHub).
  • Stuck-job classifier lower-cases both job and runner labels before matching (uppercase runs-on labels never matched).
  • Alert email template uses the plain ${key} substitution engine (removed FreeMarker ${x!"-"} / <#if> syntax that rendered literally).
  • /alerts/events hoursBack clamped to 1h..90d (findRecent has no LIMIT).

Remaining decisions (not bugs — a call to make before/within prod enablement)

  • Multi-instance: the @Scheduled reconcilers and alert open-event creation assume a single app instance (no leader election / distributed lock).
  • Quiet-window timezone: the window is interpreted in the server's local time — decide whose zone should apply.
  • NATS handlers always-ack (no retry / dead-letter).

Verification

  • ./gradlew :application-server:test — 496 tests, 0 failures.
  • Client unit suite — 221 tests (incl. queue.api.spec.ts asserting the write header is sent).
  • ./gradlew :application-server:generateOpenApiDocs + Validate OpenAPI — green.
  • flywayValidate / validate-migrationsV56 applies cleanly after V53V55.
  • CI on this PR: server-tests, client-tests, validate-migrations, Validate OpenAPI, lint (server / client / keycloakify), CodeQL — all green (Codacy is non-blocking).

Still to do

  • Manual staging walkthrough once GitHub App permissions are updated (subscribe self_hosted_runner, add org administration:read), with helios.queue.enabled=true:
    • webhook job appears in /ci-cd/queue within 5s
    • take a self-hosted runner offline → transitions to OFFLINE within 60s
    • create an alert rule (exercises the write path end-to-end), trigger a low QUEUE_P95_OVER → email fires once and auto-clears
    • POST /api/queue/admin/backfill → rate-limit metric stays healthy
  • Decide the three "remaining decisions" above before promoting helios.queue.enabled=true to prod.

🤖 Generated with Claude Code

Adds visibility into GitHub Actions queue state — per-label-set depth,
self-hosted runner inventory, build-time percentiles, per-job ETA,
stuck-job classification, and SLO alerts — closing the regression
Artemis maintainers see versus their previous Bamboo dashboard.

Server (Spring Boot, gated by helios.queue.enabled=false):
- V51 migration: workflow_job, runner, queue_wait_stat, queue_alert_rule,
  queue_alert_event (with partial index on status='queued', GIN on labels,
  FK to repository(repository_id))
- WorkflowJobPersistenceService runs alongside the existing deployment
  timing path inside try/catch so failures cannot poison NATS redelivery
- New GitHubSelfHostedRunnerMessageHandler for org-level events
- EtagCache + GitHubRestClient with If-None-Match + 304 reuse and
  rate-limit metrics; reconcilers (runner inventory, in-progress job
  filler with last_reconcile_attempt_at backoff, hourly p50/p90/p95
  rollup, 30-day backfill that self-throttles to 180 req/min)
- QueueEtaService with label-superset capacity, 3s Caffeine cache,
  configurable GitHub-hosted concurrency ceiling
- StuckJobClassifier (PENDING_APPROVAL / NO_RUNNER_ONLINE / RUNNERS_BUSY
  / CONCURRENCY_LOCK / UNKNOWN), WorkflowYamlCache (snakeyaml)
- QueueAlertEvaluator with dedup via open events, quiet-hours cron
- WorkflowQueueController + RunnerController + DTOs
- Email template + QueueAlertEmailPayload, 3 new NotificationPreference.Type
  values, findUsersByTypeEnabled query

Client (Angular 20):
- ThemeService extracted from MainLayoutComponent so the new
  HeliosLineChartComponent (PrimeNG p-chart + Chart.js) can observe
  dark-mode toggles
- /repo/:repositoryId/ci-cd/queue routes: overview, runners, stats,
  alerts; admin-only top-level /queue
- Manual queue.api.ts pending OpenAPI regen against the new controllers

Tests: 70 new (55 server, 15 client) + 2 pre-existing tests adjusted
for the payload-record arity change and the 3 new enum values. Full
server suite (426 tests) and client suite green.

Known follow-ups from deep review (tracked but not yet fixed):
- @async self-invocation in WorkflowJobBackfillService.start()
- Chart.js time-axis adapter import missing
- Manual queueApi() lacks BearerInterceptor wiring
- LabelSets.hash separator collision for adjacency boundary inputs
- QueueIndexService drifts on redelivered status updates
- QueueAlertEvaluator.inQuietHours treats cron as a moment, not a window

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 13:07
@krusche krusche requested a review from a team as a code owner May 18, 2026 13:07
@codacy-production

codacy-production Bot commented May 18, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 2 minor

Alerts:
⚠ 2 issues (≤ 0 issues of at least minor severity)

Results:
2 new issues

Category Results
CodeStyle 2 minor

View in Codacy

🔴 Metrics 730 complexity

Metric Results
Complexity ⚠️ 730 (≤ 20 complexity)

View in Codacy

🟢 Coverage 69.51% diff coverage · +1.60% coverage variation

Metric Results
Coverage variation +1.60% coverage variation (-1.00%)
Diff coverage 69.51% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (c5c815e) 15395 7500 48.72%
Head commit (1c82e62) 16617 (+1222) 8362 (+862) 50.32% (+1.60%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#1046) 1233 857 69.51%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Replaces / augments the previous smoke-test scaffolding with tests
that exercise full code paths and pin behaviour matching real bugs
called out in the deep review.

Server (8 new + 2 expanded):
- InProgressJobReconcilerFullPathTest — happy path: REST call fires,
  runner_id/labels/runner_kind filled in; last_reconcile_attempt_at
  touched even when REST returns 304/empty; one REST call per unique
  workflow run regardless of job count.
- EmailAlertChannelTest — recipient resolution per rule kind, per-user
  failure isolation, no-recipients no-op, RUNNER_OFFLINE → correct
  preference type.
- StuckJobClassifierEndToEndTest — exercises the public classify()
  loop end-to-end, asserts is_stuck/queued_reason/stuck_detected_at
  persistence for each candidate.
- WorkflowJobBackfillServiceTest — running flag toggle, double-start
  semantics (sentinel for the @async self-invocation bug).
- WorkflowJobPersistenceServiceTest — added idempotent-re-upsert and
  status-case-preservation tests (the partial index WHERE
  status='queued' is case-sensitive in Postgres).
- QueueIndexServiceDriftTest — @disabled sentinels for the
  redelivery-drift bug (PR #1046 follow-up #5).
- QuietHoursWindowTest — @disabled sentinels for the cron-as-moment
  vs window bug (PR #1046 follow-up #6).
- QueueStatsAveragingTest — @disabled sentinel for the
  per-bucket-percentile averaging bug (PR #1046 follow-up #7).

Client (1 new + 1 expanded):
- helios-line-chart.component.spec.ts — datasets per series, palette
  uniqueness, options rebuild when dark mode toggles.
- theme.service.spec.ts — adds DOM-side-effect coverage (the
  dark-mode-enabled class on <html> follows the signal via effect()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GitHub Actions queue monitoring to Helios, spanning persistence, reconciliation, alerting, REST APIs, notification delivery, documentation, and Angular queue dashboards.

Changes:

  • Adds queue/runner database schema, ingestion, reconciliation, ETA, stuck classification, and alert evaluation.
  • Adds REST controllers and Angular pages/components for queue depth, jobs, runners, stats, and alerts.
  • Adds queue alert email templates, notification preference types, admin docs, and supporting tests.

Reviewed changes

Copilot reviewed 85 out of 85 changed files in this pull request and generated 45 comments.

Show a summary per file
File Description
server/notification/src/main/resources/email-templates/queue-alert.html Adds queue alert email template.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceServiceTest.java Tests workflow job persistence derivation/upsert behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueControllerTest.java Tests queue depth/jobs controller responses.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerControllerTest.java Tests runner listing/pool endpoints.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifierTest.java Tests stuck-job classification paths.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconcilerTest.java Tests runner inventory reconciliation.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconcilerTest.java Tests in-progress job reconciler no-op/backoff behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexServiceTest.java Tests hot queue counter behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaServiceTest.java Tests ETA capacity and hosted-runner behavior.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/LabelSetsTest.java Tests label canonicalization/hash/runner kind helpers.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandlerTest.java Tests self-hosted runner webhook handling.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluatorTest.java Tests alert firing, dedup, clearing, and quiet-hours logic.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobTimingServiceTest.java Updates workflow job payload construction for new fields.
server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandlerTest.java Tests workflow job handler ordering/failure isolation.
server/application-server/src/test/java/de/tum/cit/aet/helios/notification/NotificationPreferenceServiceTest.java Updates notification preference default expectations.
server/application-server/src/test/java/de/tum/cit/aet/helios/github/EtagCacheTest.java Tests ETag cache behavior.
server/application-server/src/main/resources/db/migration/V51__add_workflow_job_and_runner_inventory.sql Adds queue, runner, stats, and alert tables/indexes.
server/application-server/src/main/resources/application-staging.yml Adds staging queue/GitHub REST configuration.
server/application-server/src/main/resources/application-prod.yml Adds production queue/GitHub REST configuration.
server/application-server/src/main/resources/application-dev.yml Adds development queue/GitHub REST configuration.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowYamlCache.java Adds cached workflow YAML fetch/parse helper.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobRepository.java Adds workflow job repository queries.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceService.java Persists workflow_job webhook payloads.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJob.java Adds workflow job JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueController.java Adds queue stats/jobs/depth/alerts/backfill API.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerController.java Adds runner inventory API.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/QueueDtos.java Adds queue/runner/alert DTO records.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifier.java Adds scheduled stuck-job classification.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/RunnerRepository.java Adds runner repository queries/update.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/Runner.java Adds runner JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/WorkflowJobBackfillService.java Adds admin-triggered workflow job backfill.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconciler.java Adds scheduled runner inventory polling.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/QueueWaitStatRollup.java Adds hourly queue/run percentile rollup.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconciler.java Adds job runner/label reconciliation.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStatRepository.java Adds queue stats repository queries.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStat.java Adds queue stats JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexService.java Adds hot queue counter service.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaService.java Adds queue ETA calculation service.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRuleRepository.java Adds alert rule repository.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRule.java Adds alert rule JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEventRepository.java Adds alert event repository.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEvent.java Adds alert event JPA entity.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/LabelSets.java Adds label canonicalization/hash helpers.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerPayload.java Adds self-hosted runner webhook payload model.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandler.java Adds self-hosted runner webhook handler.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluator.java Adds scheduled queue alert evaluator.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/EmailAlertChannel.java Adds email alert channel implementation.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/AlertChannel.java Adds alert channel interface.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobPayload.java Extends workflow job payload fields.
server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandler.java Adds queue persistence/indexing to workflow job handling.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreferenceRepository.java Adds enabled-user lookup for notification type.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreference.java Adds queue alert notification preference types.
server/application-server/src/main/java/de/tum/cit/aet/helios/notification/email/QueueAlertEmailPayload.java Adds queue alert email payload.
server/application-server/src/main/java/de/tum/cit/aet/helios/github/GitHubRestClient.java Adds ETag-aware GitHub REST client.
server/application-server/src/main/java/de/tum/cit/aet/helios/github/EtagCache.java Adds conditional GET cache.
docs/admin/queue-monitoring.rst Adds queue monitoring rollout/admin docs.
client/src/app/pages/queue/runner-list/runner-list.component.ts Adds runner list page.
client/src/app/pages/queue/queue.routes.ts Adds queue child routes.
client/src/app/pages/queue/queue.api.ts Adds manual queue/runner API wrapper.
client/src/app/pages/queue/queue-stats/queue-stats.component.ts Adds queue stats page.
client/src/app/pages/queue/queue-overview.component.ts Adds queue overview page.
client/src/app/pages/queue/queue-alerts/queue-alerts.component.ts Adds alert rules/events page.
client/src/app/pages/queue/components/runner-pool-panel.component.ts Adds runner pool panel component.
client/src/app/pages/queue/components/runner-pool-panel.component.spec.ts Adds runner pool component tests.
client/src/app/pages/queue/components/queued-reason-chip.component.ts Adds queued reason chip component.
client/src/app/pages/queue/components/queued-reason-chip.component.spec.ts Adds queued reason chip tests.
client/src/app/pages/queue/components/queued-jobs-table.component.ts Adds queued jobs table component.
client/src/app/pages/queue/components/queued-jobs-table.component.spec.ts Adds queued jobs table tests.
client/src/app/pages/queue/components/queue-depth-panel.component.ts Adds queue depth panel component.
client/src/app/pages/queue/components/queue-depth-panel.component.spec.ts Adds queue depth panel tests.
client/src/app/pages/main-layout/main-layout.component.ts Moves dark-mode state into ThemeService.
client/src/app/core/services/theme.service.ts Adds shared theme service.
client/src/app/core/services/theme.service.spec.ts Adds theme service tests.
client/src/app/components/navigation-bar/navigation-bar.component.ts Adds queue navigation entries.
client/src/app/components/charts/helios-line-chart.component.ts Adds shared line chart wrapper.
client/src/app/app.routes.ts Adds repo queue and admin queue routes.
client/package.json Adds chart/date dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +123 to +130
CREATE TABLE queue_alert_rule (
id BIGSERIAL PRIMARY KEY,
kind VARCHAR(32) NOT NULL,
threshold_seconds INT,
window_minutes INT NOT NULL DEFAULT 5,
repository_id BIGINT,
label_set_hash CHAR(40),
channels TEXT[] NOT NULL DEFAULT '{EMAIL}',
Comment on lines +111 to +113
CONSTRAINT uq_queue_wait_stat_natural
UNIQUE (repository_id, workflow_name, job_name, head_branch,
label_set_hash, bucket_start)
Comment on lines +41 to +44
@RestController
@RequestMapping("/api/queue")
@RequiredArgsConstructor
public class WorkflowQueueController {
Comment on lines +242 to +244
@EnforceAtLeastWritePermission
@PostMapping("/admin/backfill")
public ResponseEntity<String> startBackfill() {
Comment on lines +219 to +222
return ruleRepository.findById(id).map(rule -> {
applyDto(rule, body);
rule.setRepositoryId(repoId);
return ResponseEntity.ok(toDto(ruleRepository.save(rule)));
Comment on lines +86 to +90
/** Returns rate-limit remaining from the most recent response, or -1 if unknown. */
public int rateLimitRemaining() {
// Caller-driven monitoring point; relies on Sentry breadcrumbs / metrics in production.
return -1;
}
Comment on lines +171 to +178
private Integer medianRunDuration(Long repositoryId) {
// Cheap fallback: median over the last 50 completed jobs in this repo.
List<WorkflowJob> recent = workflowJobRepository
.findByRepositoryIdAndStatus(repositoryId, "completed")
.stream()
.filter(j -> j.getRunDurationSeconds() != null)
.limit(50)
.toList();
Comment on lines +165 to +167
CREATE INDEX idx_queue_alert_event_open
ON queue_alert_event (rule_id)
WHERE cleared_at IS NULL;
Comment on lines +249 to +256
private void applyDto(QueueAlertRule rule, AlertRuleDto body) {
rule.setKind(QueueAlertRule.Kind.valueOf(body.kind()));
rule.setThresholdSeconds(body.thresholdSeconds());
rule.setWindowMinutes(body.windowMinutes() == null ? 5 : body.windowMinutes());
rule.setLabelSetHash(body.labelSetHash());
rule.setChannels(body.channels() == null ? List.of("EMAIL") : body.channels());
rule.setEnabled(body.enabled());
rule.setQuietHoursCron(body.quietHoursCron());
Comment on lines +47 to +50
Map<List<String>, List<Runner>> byLabels = new HashMap<>();
for (Runner r : runnerRepository.findAll()) {
byLabels.computeIfAbsent(r.getLabels() == null ? List.of() : r.getLabels(),
k -> new ArrayList<>()).add(r);
krusche and others added 4 commits May 19, 2026 15:45
Resolves the bulk of the review comments from the deep review +
Copilot reviewer + checkstyle on PR #1046.

Schema (V51):
- Extend chk_notification_type so the 3 new enum values can be saved
- queue_wait_stat natural-key columns NOT NULL DEFAULT '' so ON CONFLICT
  dedups correctly (Postgres NULL-distinct semantics would have inserted
  duplicates indefinitely)
- queue_alert_event open-event partial index made UNIQUE so concurrent
  evaluator threads can't race and double-fire emails
- Rename quiet_hours_cron → quiet_window (cron-as-moment was wrong; new
  semantics are HH:mm-HH:mm local-time ranges)

Runtime correctness:
- WorkflowJobBackfillService: dispatch through a separate proxied
  WorkflowJobBackfillExecutor (Spring @async self-invocation bug fix);
  URL-encode the `created>=...` filter; paginate /actions/runs/{id}/jobs;
  add an abort() flag; rollupRange() historical buckets after backfill
- InProgressJobReconciler: paginate run jobs endpoint too
- RunnerInventoryReconciler: explicit empty-inventory handling — empty
  list now marks all online runners offline (previously skipped)
- GitHubSelfHostedRunnerMessageHandler: verify org.login matches config;
  add `deleted` to the action switch (GitHub's actual removal action);
  canonicalize labels before save so RunnerController.pools groups
  correctly even when label order differs
- QueueIndexService: per-job state tracking so duplicate webhook
  delivery doesn't drift the counter
- QueueEtaService: cache by job id (depends on per-job position, not
  just label-set); exclude the job being estimated from queueAhead;
  return null ETA when capacity is 0; replace
  findByRepositoryIdAndStatus + .limit(50) in Java with a real ORDER BY
  + bounded query
- QueueAlertEvaluator: parse quiet windows as HH:mm-HH:mm ranges
  (handles overnight); apply rule.labelSetHash to all 3 measurements;
  org-wide rules now go to findForRuleWindow (which honours NULL
  repositoryId) instead of substituting 0L; STUCK_JOBS_OVER counts only
  rows that are still status='queued'
- QueueAlertRule.Kind.unit() — explicit SECONDS vs COUNT to remove the
  "threshold in seconds" misnomer for runner-offline / stuck-jobs
- Stats endpoint: sample-weighted percentile aggregate (closer
  approximation; documented limitation vs raw-sample percentile)

Security / scoping:
- WorkflowQueueController and RunnerController feature-gated by
  helios.queue.enabled (matches the schedulers)
- updateRule / deleteRule now scoped by (id, repositoryId) — caller
  can't edit or delete a rule from another repo by guessing its id
- Backfill endpoint upgraded to @EnforceAdmin
- AlertRuleDto: @NotNull + @pattern + @min validation actually wired up
- /jobs endpoint: pageable LIMIT pushed into SQL, capped at 500

Client:
- ThemeService DOM toggle (already extracted)
- HeliosLineChartComponent: import 'chartjs-adapter-date-fns' for side
  effect (the time-scale would otherwise throw at runtime)
- queue.api.ts: rename quietHoursCron → quietWindow
- queue-alerts: per-kind threshold-unit label ("seconds" vs "count");
  null-id template guard so strict templates pass; quietWindow input
- queue-stats: filters → signals (effect now refetches on change);
  toSignal(paramMap) so repositoryId is reactive (repo-switch
  no longer leaves stale polling)
- queue-overview: org-wide /queue route uses orgDepth instead of
  spinning forever waiting for a non-existent repositoryId; reactive
  repositoryId
- queue-alerts: reactive repositoryId
- app.routes: top-level /queue exposes only the overview (stats and
  alerts require a repositoryId)
- navigation-bar: admin-only "Org Queue" entry
- eslint.config.js: ignore dist/ (was failing lint on generated output)
- yarn.lock regenerated for chart.js / date-fns / chartjs-adapter
- openapi.yaml regenerated; SDK regenerated via npm run generate:openapi

OpenAPI profile:
- helios.queue.enabled=true so the new controllers are scanned
- GitHubRestClient MeterRegistry made optional (uses SimpleMeterRegistry
  when actuator isn't auto-configured, e.g. in the openapi profile)

Tests:
- QueueIndexServiceDriftTest: re-enabled (was @disabled); 4 active tests
  covering redelivery, separate jobs, completion of unknown job
- QuietHoursWindowTest: rewritten against the new HH:mm-HH:mm semantics
  (same-day, overnight, invalid input)
- QueueStatsAveragingTest: re-enabled (sample-weighted percentile)
- WorkflowJobBackfillServiceTest: rewritten to verify the proxied async
  dispatch + idempotent start() + abort() short-circuit
- QueueEtaServiceTest: new test for queueAhead=0 single-job-queue case
  (ETA ≈ 0 instead of one full p50run as before)
- QueueIndexServiceTest: status case-handling updated to match new
  JobState.fromStatus mapping
- All WebMvcTest slices now declare helios.queue.enabled=true so the
  feature-gated controllers load

Full suite (446 server + 20 client) green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_to_helios_deployment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- LabelSets.hash: SHA-1 → SHA-256 (Codacy critical security finding;
  hash is for bucketing, not crypto, but static analysis can't tell)
- Widen label_set_hash column CHAR(40) → CHAR(64) to fit SHA-256 hex
- Fix 6 minor checkstyle/comprehensibility findings: import order,
  line length, variable-declaration distance, Javadoc summary period
- Regenerate openapi.yaml + client SDK to reflect the column widening

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 27, 2026
krusche added 3 commits May 27, 2026 22:32
The PR's V52__add_workflow_job_and_runner_inventory.sql collides with
V52__add_deployment_workflow_run_id_index.sql that landed on staging
later. Flyway rejects duplicate version numbers
(CompositeMigrationResolver line 93), which is what tanked
server-tests and validate-migrations on the last CI run.

Bumped to V54 (V53 is reserved for the in-flight #1098 approval-flow
migration, so this avoids a second collision when that lands).

No SQL changes — pure file rename. Migration content is identical.
@bassner

bassner commented May 28, 2026

Copy link
Copy Markdown
Member

@Claudia-Anthropica review

@github-actions github-actions Bot removed the stale label May 28, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jun 5, 2026
@github-actions github-actions Bot closed this Jun 19, 2026
@krusche krusche reopened this Jun 19, 2026
@github-actions github-actions Bot removed the stale label Jun 20, 2026
krusche and others added 2 commits June 20, 2026 23:55
# Conflicts:
#	client/src/app/core/modules/openapi/@tanstack/angular-query-experimental.gen.ts
#	client/src/app/core/modules/openapi/index.ts
#	client/src/app/core/modules/openapi/schemas.gen.ts
#	client/src/app/core/modules/openapi/sdk.gen.ts
#	client/src/app/core/modules/openapi/types.gen.ts
#	server/application-server/openapi.yaml
#	server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreference.java
#	server/application-server/src/test/java/de/tum/cit/aet/helios/notification/NotificationPreferenceServiceTest.java
…abel case, alert email)

Three verified, self-contained fixes found while deep-reviewing #1046:

- Scheduling no longer runs under the `openapi` profile. @EnableScheduling moved
  off AsyncConfig into a new SchedulingConfig gated `@Profile("!openapi")`. The
  openapi profile boots the app against empty H2 just to dump the spec; the queue
  @scheduled reconcilers were firing there and throwing
  "Table WORKFLOW_JOB/QUEUE_WAIT_STAT not found" and calling the live GitHub API.
  Verified: generateOpenApiDocs now boots with 0 scheduled-task errors and the
  queue endpoints still appear in the spec.

- StuckJobClassifier.matchingRunners lower-cases the job's labels too, not only the
  runner labels. Previously an uppercase runs-on label (e.g. "GPU") never matched a
  runner labelled "gpu", yielding a false NO_RUNNER_ONLINE classification.

- queue-alert.html used FreeMarker syntax (${x!"-"} defaults, <#if x??>) but the
  email engine is a plain ${key} regex substitution — so measured/threshold values
  rendered blank and the <#if> lines showed as literal text. Switched to plain keys.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
: job.getLabels().stream().map(s -> s.toLowerCase(Locale.ROOT)).toList();
return online.stream()
.filter(r -> r.getLabels() != null
&& r.getLabels().stream().map(s -> s.toLowerCase(Locale.ROOT)).toList().containsAll(needed))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [checkstyle] <com.puppycrawl.tools.checkstyle.checks.sizes.LineLengthCheck> reported by reviewdog 🐶
Line is longer than 100 characters (found 104).

@krusche

krusche commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

Deep review (Claude) — updated from staging + 5-agent audit

Merged latest staging into the branch (resolved the openapi-gen + NotificationPreference conflicts) and fixed a silent migration collision: this PR's V54 clashed with staging's V54, and its chk_notification_type rewrite omitted DEPLOYMENT_APPROVAL_REQUEST — applied after staging's V54 it would have dropped that type. Renumbered to V56 and restored all 7 types. Server 487/488 tests pass (the 1 failure is the pre-existing Docker-provider WorkflowRunOrphanSweepIntegrationTest, env-only), client 218 pass.

✅ Fixed & pushed (28fafec, plus the merge commit)

  1. Scheduling ran under the openapi profile → boot threw Table WORKFLOW_JOB/QUEUE_WAIT_STAT not found and the runner reconciler called live GitHub during every generateOpenApiDocs. Moved @EnableScheduling to a @Profile("!openapi") config. Verified: 0 scheduled-task errors on boot, queue endpoints still in the spec.
  2. StuckJobClassifier matched job labels un-lowercased against lower-cased runner labels → any uppercase runs-on label → false NO_RUNNER_ONLINE. Now lower-cases both.
  3. queue-alert.html used FreeMarker (${x!"-"}, <#if>) but the engine is plain ${key} regex → measured/threshold rendered blank + literal <#if> text. Converted to plain keys.

🔴 HIGH — need a decision / careful fix (not auto-fixed; would risk the "make it reliable" goal to do blind)

  • Authorization / cross-repo IDOR. Queue/runner/alert GETs have no repo-scoping — any authenticated user can read any repo's queue depth, job names/branches, percentiles, alert rules (WorkflowQueueController), and the whole org's runner inventory (RunnerController is isAuthenticated() only). createRule checks WRITE but not against the path repoId (X-Repository-Id header drives the role) → WRITE on repo A can create a rule on repo B. The unscoped GET-by-repoId is a pre-existing Helios pattern (WorkflowController), so read-scoping is a cross-cutting call; the cross-repo createRule write is the most concrete new hole.
  • Runner reconciler dies/thrashes on ETag 304. GitHubRestClient.get() returns empty on 304; the reconciler treats empty as both "stop paging" and (with sawAnyResponse) "mark offline." Once inventory is stable→304, it no-ops forever (a crashed runner with no webhook never gets marked offline); with >100 runners, a 304 on page 2 truncates seen and flips healthy page-2 runners OFFLINE. Fix: on 304 return the cached body (EtagCache.getBody); don't offline on a truncated page.
  • NATS handlers always ack → no retry, silent data loss. A transient DB error in the workflow_job upsert permanently loses that job's final state (the "swallow to avoid redelivery" comment is moot — redelivery never happens). Architectural (affects all handlers): nack/term on transient failure or a DLQ.
  • QueueIndexService queue-depth counter desyncs on 4h state eviction, out-of-order events, and resets to 0 on restart while jobs are still queued. Prefer deriving depth from a DB COUNT(queued) (or periodic reconcile) over an in-memory delta counter.
  • Alert open-event dedup is check-then-insert with no conflict handling → the partial unique index throws under concurrent/multi-instance eval, and the swallow-in-loop can roll back the surrounding @Transactional. Use an upsert / catch the constraint.

🟡 MEDIUM

  • /stats percentiles are a sample-weighted mean of per-bucket percentiles — not a real window percentile (Javadoc admits "approximation"; the test only proves "beats naive mean"). Heterogeneous buckets inflate/deflate it. Needs raw samples/histograms for accuracy.
  • ETA: fallback median uses nearest-rank upper-middle ([10,20]→20) while the rollup uses PERCENTILE_CONT (→15) — divergent; reference lookup ignores labelSetHash (matrix jobs get an arbitrary label set's p50); model mixes queue models; no staleness bound on the "last 50 completions."
  • PERCENTILE_CONT (double) truncated into INT columns — biases stored percentiles low.
  • @Transactional reconcilers make GitHub HTTP calls inside the transaction (InProgressJobReconciler, StuckJobClassifier) → connection-pool pinning + clobber risk against the webhook path (no @Version/lock; a stale reconciler save can revert completedqueued).
  • WorkflowJobBackfillService.ingestRunJobs @Transactional is dead (self-invoked private) → backfill isn't atomic per run; unguarded OffsetDateTime.parse can abort the loop.
  • markMissingOffline flips ALL runners offline on a single empty/blip GitHub response — needs an N-consecutive-miss guard.
  • V56 adds the 3 queue notification types to the CHECK but no preference backfillproduct call: default-ON for all users would mass-email on every breach; opt-in means no emails until a user enables it (and an already-"open" event then won't re-fire). Decide the intended default + fix the open-event-suppresses-refire interaction.
  • Quiet-hours uses unanchored LocalTime.now() (host TZ/DST shifts the window); measureQueueP95 5-min window vs hourly buckets (alert lingers after the queue drains / false-resolves).
  • LabelSets.hash joins with an empty separator["x","y"] and ["xy"] collide (latent; no test covers it — the "adjacency test" one agent cited doesn't exist).
  • WorkflowYamlCache negative-caches transient failures for 1h; GitHubRestClient ignores Link pagination, can't distinguish 403/429 from 304, no Retry-After, rateLimitRemaining() is a -1 stub.
  • /stats & /events unbounded (no clamp on hoursBack, no LIMIT); runner-list polls every 5s org-wide; jobs runs ETA per row (N+1, 3s-cached).
  • Self-hosted runner handler can create a phantom row from a late deleted/offline event and overwrite fresh data with stale (no event-timestamp guard).

✔️ Verified correct (so you don't re-check)

V56 columns/types/lengths/nullability/CHECK all match the entities; no SQL injection; AlertRuleDto validation is solid; no [innerHTML] XSS in the client; EtagCache is thread-safe + bounded (Caffeine); WorkflowJobPersistenceService upsert is null-safe + clamps negatives; QueueIndexService redelivery (same-state) is idempotent.

I fixed the three clear, self-contained bugs that affect boot/CI and obviously-wrong behavior. The HIGH items above are either pre-existing-pattern (authz) or need an architectural/product decision (NATS retry, counter source-of-truth, alert backfill default) — happy to implement any of them once you choose the direction; the 304-reconciler and dead-@Transactional ones I can fix straight away if you want them in this PR.

krusche and others added 3 commits June 21, 2026 01:17
Continued deep-review fixes for the queue-monitoring PR.

- Reject cross-repo alert-rule writes. @EnforceAtLeastWritePermission grants
  the WRITE role for the repo in the X-Repository-Id header (RepositoryContext),
  but createRule/updateRule/deleteRule addressed the repo via the {repoId} path
  variable -- a user with write access to repo A could create/edit/delete alert
  rules on repo B by keeping A in the header and B in the path. Add
  assertRepoInContext(repoId) to all three writes (403 on mismatch). New
  WorkflowQueueControllerAuthzTest covers reject/allow for every write path.

- WorkflowJobBackfillService.ingestRunJobs: the @transactional on the
  self-invoked method was dead (the proxy was bypassed, so it ran with no
  transaction). Move the DB writes into a proxied saveJobPage(@transactional)
  so the annotation actually applies, while keeping the GitHub pagination
  outside the transaction -- a DB connection is never held open across the
  network I/O.

- WorkflowJobRepository.touchReconcileAttempt and
  RunnerRepository.markMissingOffline: add
  @Modifying(clearAutomatically = true, flushAutomatically = true). Inside the
  @transactional reconcile() loop the saved entities are the same rows the bulk
  UPDATE touches; without flush-before / clear-after, the pending entity flush
  at commit clobbered the bulk-updated lastReconcileAttemptAt, so the backoff
  never stuck and the same jobs were re-reconciled every 30s (hammering GitHub).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The label-set hash join used a raw, invisible SOH (0x01) control byte embedded
directly in the string literal -- it rendered as String.join("", canonical) in
every editor, grep, and review tool, reading as an empty separator (i.e. as the
exact "abc" adjacency-collision bug the comment claims to prevent). Any reformat,
copy-paste, or "fix the empty string" cleanup could silently strip the byte and
reintroduce the collision with no visible diff.

Replace it with a named LABEL_SEPARATOR = "�" constant. This compiles to the
identical U+0001 character, so the hash is byte-identical to before
(hash(["a","bc"]) == e24681d1... unchanged -- no labelSetHash migration), but the
delimiter is now visible and self-documenting. Behavior verified unchanged by
LabelSetsTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /repos/{repoId}/alerts/events endpoint passed hoursBack straight into
now().minusHours(hoursBack), and findRecent has no LIMIT — so hoursBack=
Integer.MAX_VALUE (or any large value) returned an effectively unbounded result
set, and a negative value pushed `since` into the future. Clamp to 1h..90d, the
same bounding pattern the controller already applies to the jobs `limit` and the
stats `window`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* {@code @EnforceAtLeastWritePermission} grants the WRITE role for the repo in the
* X-Repository-Id header (the request's {@link RepositoryContext}), not for the {@code repoId}
* path variable. Guard that they are the same repo, so a user with write access to repo A cannot
* create/edit/delete alert rules on repo B by keeping A in the header while putting B in the path.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [checkstyle] <com.puppycrawl.tools.checkstyle.checks.sizes.LineLengthCheck> reported by reviewdog 🐶
Line is longer than 100 characters (found 101).

The queue alert-rule write UI (create/update/delete) could never succeed. The
backend derives the WRITE role from the X-Repository-Id header (see
GitHubJwtAuthenticationConverter) and returns no authorities when it is absent,
so @EnforceAtLeastWritePermission 403'd every write. queue.api.ts uses a raw
Angular HttpClient (a temporary manual client) which — unlike the generated
@hey-api client wired up by RepositoryFilterGuard — never sent the header.

Two fixes:
- BearerInterceptor: clone with setHeaders instead of replacing the whole
  headers object, so caller-set headers (X-Repository-Id) survive. Replacing
  `headers` wholesale silently dropped them (the existing "find a better
  solution" TODO).
- queue.api.ts: send X-Repository-Id matching the path repo on createRule/
  updateRule/deleteRule. Matching the path also satisfies the controller's new
  path-vs-context (IDOR) check.

Adds queue.api.spec.ts asserting the header is sent on each write — the bug was
invisible to every existing test (GETs are permitAll; no e2e covers writes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants