feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046
feat(queue): GitHub Actions queue monitoring, ETA, and SLO alerts#1046krusche wants to merge 15 commits into
Conversation
Adds visibility into GitHub Actions queue state — per-label-set depth, self-hosted runner inventory, build-time percentiles, per-job ETA, stuck-job classification, and SLO alerts — closing the regression Artemis maintainers see versus their previous Bamboo dashboard. Server (Spring Boot, gated by helios.queue.enabled=false): - V51 migration: workflow_job, runner, queue_wait_stat, queue_alert_rule, queue_alert_event (with partial index on status='queued', GIN on labels, FK to repository(repository_id)) - WorkflowJobPersistenceService runs alongside the existing deployment timing path inside try/catch so failures cannot poison NATS redelivery - New GitHubSelfHostedRunnerMessageHandler for org-level events - EtagCache + GitHubRestClient with If-None-Match + 304 reuse and rate-limit metrics; reconcilers (runner inventory, in-progress job filler with last_reconcile_attempt_at backoff, hourly p50/p90/p95 rollup, 30-day backfill that self-throttles to 180 req/min) - QueueEtaService with label-superset capacity, 3s Caffeine cache, configurable GitHub-hosted concurrency ceiling - StuckJobClassifier (PENDING_APPROVAL / NO_RUNNER_ONLINE / RUNNERS_BUSY / CONCURRENCY_LOCK / UNKNOWN), WorkflowYamlCache (snakeyaml) - QueueAlertEvaluator with dedup via open events, quiet-hours cron - WorkflowQueueController + RunnerController + DTOs - Email template + QueueAlertEmailPayload, 3 new NotificationPreference.Type values, findUsersByTypeEnabled query Client (Angular 20): - ThemeService extracted from MainLayoutComponent so the new HeliosLineChartComponent (PrimeNG p-chart + Chart.js) can observe dark-mode toggles - /repo/:repositoryId/ci-cd/queue routes: overview, runners, stats, alerts; admin-only top-level /queue - Manual queue.api.ts pending OpenAPI regen against the new controllers Tests: 70 new (55 server, 15 client) + 2 pre-existing tests adjusted for the payload-record arity change and the 3 new enum values. Full server suite (426 tests) and client suite green. Known follow-ups from deep review (tracked but not yet fixed): - @async self-invocation in WorkflowJobBackfillService.start() - Chart.js time-axis adapter import missing - Manual queueApi() lacks BearerInterceptor wiring - LabelSets.hash separator collision for adjacency boundary inputs - QueueIndexService drifts on redelivered status updates - QueueAlertEvaluator.inQuietHours treats cron as a moment, not a window Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| CodeStyle | 2 minor |
🔴 Metrics 730 complexity
Metric Results Complexity ⚠️ 730 (≤ 20 complexity)
🟢 Coverage 69.51% diff coverage · +1.60% coverage variation
Metric Results Coverage variation ✅ +1.60% coverage variation (-1.00%) Diff coverage ✅ 69.51% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (c5c815e) 15395 7500 48.72% Head commit (1c82e62) 16617 (+1222) 8362 (+862) 50.32% (+1.60%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#1046) 1233 857 69.51% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Replaces / augments the previous smoke-test scaffolding with tests that exercise full code paths and pin behaviour matching real bugs called out in the deep review. Server (8 new + 2 expanded): - InProgressJobReconcilerFullPathTest — happy path: REST call fires, runner_id/labels/runner_kind filled in; last_reconcile_attempt_at touched even when REST returns 304/empty; one REST call per unique workflow run regardless of job count. - EmailAlertChannelTest — recipient resolution per rule kind, per-user failure isolation, no-recipients no-op, RUNNER_OFFLINE → correct preference type. - StuckJobClassifierEndToEndTest — exercises the public classify() loop end-to-end, asserts is_stuck/queued_reason/stuck_detected_at persistence for each candidate. - WorkflowJobBackfillServiceTest — running flag toggle, double-start semantics (sentinel for the @async self-invocation bug). - WorkflowJobPersistenceServiceTest — added idempotent-re-upsert and status-case-preservation tests (the partial index WHERE status='queued' is case-sensitive in Postgres). - QueueIndexServiceDriftTest — @disabled sentinels for the redelivery-drift bug (PR #1046 follow-up #5). - QuietHoursWindowTest — @disabled sentinels for the cron-as-moment vs window bug (PR #1046 follow-up #6). - QueueStatsAveragingTest — @disabled sentinel for the per-bucket-percentile averaging bug (PR #1046 follow-up #7). Client (1 new + 1 expanded): - helios-line-chart.component.spec.ts — datasets per series, palette uniqueness, options rebuild when dark mode toggles. - theme.service.spec.ts — adds DOM-side-effect coverage (the dark-mode-enabled class on <html> follows the signal via effect()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds GitHub Actions queue monitoring to Helios, spanning persistence, reconciliation, alerting, REST APIs, notification delivery, documentation, and Angular queue dashboards.
Changes:
- Adds queue/runner database schema, ingestion, reconciliation, ETA, stuck classification, and alert evaluation.
- Adds REST controllers and Angular pages/components for queue depth, jobs, runners, stats, and alerts.
- Adds queue alert email templates, notification preference types, admin docs, and supporting tests.
Reviewed changes
Copilot reviewed 85 out of 85 changed files in this pull request and generated 45 comments.
Show a summary per file
| File | Description |
|---|---|
| server/notification/src/main/resources/email-templates/queue-alert.html | Adds queue alert email template. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceServiceTest.java | Tests workflow job persistence derivation/upsert behavior. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueControllerTest.java | Tests queue depth/jobs controller responses. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerControllerTest.java | Tests runner listing/pool endpoints. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifierTest.java | Tests stuck-job classification paths. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconcilerTest.java | Tests runner inventory reconciliation. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconcilerTest.java | Tests in-progress job reconciler no-op/backoff behavior. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexServiceTest.java | Tests hot queue counter behavior. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaServiceTest.java | Tests ETA capacity and hosted-runner behavior. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/LabelSetsTest.java | Tests label canonicalization/hash/runner kind helpers. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandlerTest.java | Tests self-hosted runner webhook handling. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluatorTest.java | Tests alert firing, dedup, clearing, and quiet-hours logic. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobTimingServiceTest.java | Updates workflow job payload construction for new fields. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandlerTest.java | Tests workflow job handler ordering/failure isolation. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/notification/NotificationPreferenceServiceTest.java | Updates notification preference default expectations. |
| server/application-server/src/test/java/de/tum/cit/aet/helios/github/EtagCacheTest.java | Tests ETag cache behavior. |
| server/application-server/src/main/resources/db/migration/V51__add_workflow_job_and_runner_inventory.sql | Adds queue, runner, stats, and alert tables/indexes. |
| server/application-server/src/main/resources/application-staging.yml | Adds staging queue/GitHub REST configuration. |
| server/application-server/src/main/resources/application-prod.yml | Adds production queue/GitHub REST configuration. |
| server/application-server/src/main/resources/application-dev.yml | Adds development queue/GitHub REST configuration. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowYamlCache.java | Adds cached workflow YAML fetch/parse helper. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobRepository.java | Adds workflow job repository queries. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJobPersistenceService.java | Persists workflow_job webhook payloads. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/WorkflowJob.java | Adds workflow job JPA entity. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/WorkflowQueueController.java | Adds queue stats/jobs/depth/alerts/backfill API. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/RunnerController.java | Adds runner inventory API. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/web/QueueDtos.java | Adds queue/runner/alert DTO records. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/StuckJobClassifier.java | Adds scheduled stuck-job classification. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/RunnerRepository.java | Adds runner repository queries/update. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/Runner.java | Adds runner JPA entity. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/WorkflowJobBackfillService.java | Adds admin-triggered workflow job backfill. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/RunnerInventoryReconciler.java | Adds scheduled runner inventory polling. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/QueueWaitStatRollup.java | Adds hourly queue/run percentile rollup. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/reconcile/InProgressJobReconciler.java | Adds job runner/label reconciliation. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStatRepository.java | Adds queue stats repository queries. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueWaitStat.java | Adds queue stats JPA entity. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueIndexService.java | Adds hot queue counter service. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueEtaService.java | Adds queue ETA calculation service. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRuleRepository.java | Adds alert rule repository. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertRule.java | Adds alert rule JPA entity. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEventRepository.java | Adds alert event repository. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/QueueAlertEvent.java | Adds alert event JPA entity. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/LabelSets.java | Adds label canonicalization/hash helpers. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerPayload.java | Adds self-hosted runner webhook payload model. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/github/GitHubSelfHostedRunnerMessageHandler.java | Adds self-hosted runner webhook handler. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/QueueAlertEvaluator.java | Adds scheduled queue alert evaluator. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/EmailAlertChannel.java | Adds email alert channel implementation. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/queue/alert/AlertChannel.java | Adds alert channel interface. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobPayload.java | Extends workflow job payload fields. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/workflow/github/GitHubWorkflowJobMessageHandler.java | Adds queue persistence/indexing to workflow job handling. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreferenceRepository.java | Adds enabled-user lookup for notification type. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreference.java | Adds queue alert notification preference types. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/notification/email/QueueAlertEmailPayload.java | Adds queue alert email payload. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/github/GitHubRestClient.java | Adds ETag-aware GitHub REST client. |
| server/application-server/src/main/java/de/tum/cit/aet/helios/github/EtagCache.java | Adds conditional GET cache. |
| docs/admin/queue-monitoring.rst | Adds queue monitoring rollout/admin docs. |
| client/src/app/pages/queue/runner-list/runner-list.component.ts | Adds runner list page. |
| client/src/app/pages/queue/queue.routes.ts | Adds queue child routes. |
| client/src/app/pages/queue/queue.api.ts | Adds manual queue/runner API wrapper. |
| client/src/app/pages/queue/queue-stats/queue-stats.component.ts | Adds queue stats page. |
| client/src/app/pages/queue/queue-overview.component.ts | Adds queue overview page. |
| client/src/app/pages/queue/queue-alerts/queue-alerts.component.ts | Adds alert rules/events page. |
| client/src/app/pages/queue/components/runner-pool-panel.component.ts | Adds runner pool panel component. |
| client/src/app/pages/queue/components/runner-pool-panel.component.spec.ts | Adds runner pool component tests. |
| client/src/app/pages/queue/components/queued-reason-chip.component.ts | Adds queued reason chip component. |
| client/src/app/pages/queue/components/queued-reason-chip.component.spec.ts | Adds queued reason chip tests. |
| client/src/app/pages/queue/components/queued-jobs-table.component.ts | Adds queued jobs table component. |
| client/src/app/pages/queue/components/queued-jobs-table.component.spec.ts | Adds queued jobs table tests. |
| client/src/app/pages/queue/components/queue-depth-panel.component.ts | Adds queue depth panel component. |
| client/src/app/pages/queue/components/queue-depth-panel.component.spec.ts | Adds queue depth panel tests. |
| client/src/app/pages/main-layout/main-layout.component.ts | Moves dark-mode state into ThemeService. |
| client/src/app/core/services/theme.service.ts | Adds shared theme service. |
| client/src/app/core/services/theme.service.spec.ts | Adds theme service tests. |
| client/src/app/components/navigation-bar/navigation-bar.component.ts | Adds queue navigation entries. |
| client/src/app/components/charts/helios-line-chart.component.ts | Adds shared line chart wrapper. |
| client/src/app/app.routes.ts | Adds repo queue and admin queue routes. |
| client/package.json | Adds chart/date dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| CREATE TABLE queue_alert_rule ( | ||
| id BIGSERIAL PRIMARY KEY, | ||
| kind VARCHAR(32) NOT NULL, | ||
| threshold_seconds INT, | ||
| window_minutes INT NOT NULL DEFAULT 5, | ||
| repository_id BIGINT, | ||
| label_set_hash CHAR(40), | ||
| channels TEXT[] NOT NULL DEFAULT '{EMAIL}', |
| CONSTRAINT uq_queue_wait_stat_natural | ||
| UNIQUE (repository_id, workflow_name, job_name, head_branch, | ||
| label_set_hash, bucket_start) |
| @RestController | ||
| @RequestMapping("/api/queue") | ||
| @RequiredArgsConstructor | ||
| public class WorkflowQueueController { |
| @EnforceAtLeastWritePermission | ||
| @PostMapping("/admin/backfill") | ||
| public ResponseEntity<String> startBackfill() { |
| return ruleRepository.findById(id).map(rule -> { | ||
| applyDto(rule, body); | ||
| rule.setRepositoryId(repoId); | ||
| return ResponseEntity.ok(toDto(ruleRepository.save(rule))); |
| /** Returns rate-limit remaining from the most recent response, or -1 if unknown. */ | ||
| public int rateLimitRemaining() { | ||
| // Caller-driven monitoring point; relies on Sentry breadcrumbs / metrics in production. | ||
| return -1; | ||
| } |
| private Integer medianRunDuration(Long repositoryId) { | ||
| // Cheap fallback: median over the last 50 completed jobs in this repo. | ||
| List<WorkflowJob> recent = workflowJobRepository | ||
| .findByRepositoryIdAndStatus(repositoryId, "completed") | ||
| .stream() | ||
| .filter(j -> j.getRunDurationSeconds() != null) | ||
| .limit(50) | ||
| .toList(); |
| CREATE INDEX idx_queue_alert_event_open | ||
| ON queue_alert_event (rule_id) | ||
| WHERE cleared_at IS NULL; |
| private void applyDto(QueueAlertRule rule, AlertRuleDto body) { | ||
| rule.setKind(QueueAlertRule.Kind.valueOf(body.kind())); | ||
| rule.setThresholdSeconds(body.thresholdSeconds()); | ||
| rule.setWindowMinutes(body.windowMinutes() == null ? 5 : body.windowMinutes()); | ||
| rule.setLabelSetHash(body.labelSetHash()); | ||
| rule.setChannels(body.channels() == null ? List.of("EMAIL") : body.channels()); | ||
| rule.setEnabled(body.enabled()); | ||
| rule.setQuietHoursCron(body.quietHoursCron()); |
| Map<List<String>, List<Runner>> byLabels = new HashMap<>(); | ||
| for (Runner r : runnerRepository.findAll()) { | ||
| byLabels.computeIfAbsent(r.getLabels() == null ? List.of() : r.getLabels(), | ||
| k -> new ArrayList<>()).add(r); |
Resolves the bulk of the review comments from the deep review + Copilot reviewer + checkstyle on PR #1046. Schema (V51): - Extend chk_notification_type so the 3 new enum values can be saved - queue_wait_stat natural-key columns NOT NULL DEFAULT '' so ON CONFLICT dedups correctly (Postgres NULL-distinct semantics would have inserted duplicates indefinitely) - queue_alert_event open-event partial index made UNIQUE so concurrent evaluator threads can't race and double-fire emails - Rename quiet_hours_cron → quiet_window (cron-as-moment was wrong; new semantics are HH:mm-HH:mm local-time ranges) Runtime correctness: - WorkflowJobBackfillService: dispatch through a separate proxied WorkflowJobBackfillExecutor (Spring @async self-invocation bug fix); URL-encode the `created>=...` filter; paginate /actions/runs/{id}/jobs; add an abort() flag; rollupRange() historical buckets after backfill - InProgressJobReconciler: paginate run jobs endpoint too - RunnerInventoryReconciler: explicit empty-inventory handling — empty list now marks all online runners offline (previously skipped) - GitHubSelfHostedRunnerMessageHandler: verify org.login matches config; add `deleted` to the action switch (GitHub's actual removal action); canonicalize labels before save so RunnerController.pools groups correctly even when label order differs - QueueIndexService: per-job state tracking so duplicate webhook delivery doesn't drift the counter - QueueEtaService: cache by job id (depends on per-job position, not just label-set); exclude the job being estimated from queueAhead; return null ETA when capacity is 0; replace findByRepositoryIdAndStatus + .limit(50) in Java with a real ORDER BY + bounded query - QueueAlertEvaluator: parse quiet windows as HH:mm-HH:mm ranges (handles overnight); apply rule.labelSetHash to all 3 measurements; org-wide rules now go to findForRuleWindow (which honours NULL repositoryId) instead of substituting 0L; STUCK_JOBS_OVER counts only rows that are still status='queued' - QueueAlertRule.Kind.unit() — explicit SECONDS vs COUNT to remove the "threshold in seconds" misnomer for runner-offline / stuck-jobs - Stats endpoint: sample-weighted percentile aggregate (closer approximation; documented limitation vs raw-sample percentile) Security / scoping: - WorkflowQueueController and RunnerController feature-gated by helios.queue.enabled (matches the schedulers) - updateRule / deleteRule now scoped by (id, repositoryId) — caller can't edit or delete a rule from another repo by guessing its id - Backfill endpoint upgraded to @EnforceAdmin - AlertRuleDto: @NotNull + @pattern + @min validation actually wired up - /jobs endpoint: pageable LIMIT pushed into SQL, capped at 500 Client: - ThemeService DOM toggle (already extracted) - HeliosLineChartComponent: import 'chartjs-adapter-date-fns' for side effect (the time-scale would otherwise throw at runtime) - queue.api.ts: rename quietHoursCron → quietWindow - queue-alerts: per-kind threshold-unit label ("seconds" vs "count"); null-id template guard so strict templates pass; quietWindow input - queue-stats: filters → signals (effect now refetches on change); toSignal(paramMap) so repositoryId is reactive (repo-switch no longer leaves stale polling) - queue-overview: org-wide /queue route uses orgDepth instead of spinning forever waiting for a non-existent repositoryId; reactive repositoryId - queue-alerts: reactive repositoryId - app.routes: top-level /queue exposes only the overview (stats and alerts require a repositoryId) - navigation-bar: admin-only "Org Queue" entry - eslint.config.js: ignore dist/ (was failing lint on generated output) - yarn.lock regenerated for chart.js / date-fns / chartjs-adapter - openapi.yaml regenerated; SDK regenerated via npm run generate:openapi OpenAPI profile: - helios.queue.enabled=true so the new controllers are scanned - GitHubRestClient MeterRegistry made optional (uses SimpleMeterRegistry when actuator isn't auto-configured, e.g. in the openapi profile) Tests: - QueueIndexServiceDriftTest: re-enabled (was @disabled); 4 active tests covering redelivery, separate jobs, completion of unknown job - QuietHoursWindowTest: rewritten against the new HH:mm-HH:mm semantics (same-day, overnight, invalid input) - QueueStatsAveragingTest: re-enabled (sample-weighted percentile) - WorkflowJobBackfillServiceTest: rewritten to verify the proxied async dispatch + idempotent start() + abort() short-circuit - QueueEtaServiceTest: new test for queueAhead=0 single-job-queue case (ETA ≈ 0 instead of one full p50run as before) - QueueIndexServiceTest: status case-handling updated to match new JobState.fromStatus mapping - All WebMvcTest slices now declare helios.queue.enabled=true so the feature-gated controllers load Full suite (446 server + 20 client) green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_to_helios_deployment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- LabelSets.hash: SHA-1 → SHA-256 (Codacy critical security finding; hash is for bucketing, not crypto, but static analysis can't tell) - Widen label_set_hash column CHAR(40) → CHAR(64) to fit SHA-256 hex - Fix 6 minor checkstyle/comprehensibility findings: import order, line length, variable-declaration distance, Javadoc summary period - Regenerate openapi.yaml + client SDK to reflect the column widening Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions. |
# Conflicts: # client/yarn.lock
The PR's V52__add_workflow_job_and_runner_inventory.sql collides with V52__add_deployment_workflow_run_id_index.sql that landed on staging later. Flyway rejects duplicate version numbers (CompositeMigrationResolver line 93), which is what tanked server-tests and validate-migrations on the last CI run. Bumped to V54 (V53 is reserved for the in-flight #1098 approval-flow migration, so this avoids a second collision when that lands). No SQL changes — pure file rename. Migration content is identical.
|
@Claudia-Anthropica review |
|
There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions. |
# Conflicts: # client/src/app/core/modules/openapi/@tanstack/angular-query-experimental.gen.ts # client/src/app/core/modules/openapi/index.ts # client/src/app/core/modules/openapi/schemas.gen.ts # client/src/app/core/modules/openapi/sdk.gen.ts # client/src/app/core/modules/openapi/types.gen.ts # server/application-server/openapi.yaml # server/application-server/src/main/java/de/tum/cit/aet/helios/notification/NotificationPreference.java # server/application-server/src/test/java/de/tum/cit/aet/helios/notification/NotificationPreferenceServiceTest.java
…abel case, alert email) Three verified, self-contained fixes found while deep-reviewing #1046: - Scheduling no longer runs under the `openapi` profile. @EnableScheduling moved off AsyncConfig into a new SchedulingConfig gated `@Profile("!openapi")`. The openapi profile boots the app against empty H2 just to dump the spec; the queue @scheduled reconcilers were firing there and throwing "Table WORKFLOW_JOB/QUEUE_WAIT_STAT not found" and calling the live GitHub API. Verified: generateOpenApiDocs now boots with 0 scheduled-task errors and the queue endpoints still appear in the spec. - StuckJobClassifier.matchingRunners lower-cases the job's labels too, not only the runner labels. Previously an uppercase runs-on label (e.g. "GPU") never matched a runner labelled "gpu", yielding a false NO_RUNNER_ONLINE classification. - queue-alert.html used FreeMarker syntax (${x!"-"} defaults, <#if x??>) but the email engine is a plain ${key} regex substitution — so measured/threshold values rendered blank and the <#if> lines showed as literal text. Switched to plain keys. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| : job.getLabels().stream().map(s -> s.toLowerCase(Locale.ROOT)).toList(); | ||
| return online.stream() | ||
| .filter(r -> r.getLabels() != null | ||
| && r.getLabels().stream().map(s -> s.toLowerCase(Locale.ROOT)).toList().containsAll(needed)) |
There was a problem hiding this comment.
🚫 [checkstyle] <com.puppycrawl.tools.checkstyle.checks.sizes.LineLengthCheck> reported by reviewdog 🐶
Line is longer than 100 characters (found 104).
Deep review (Claude) — updated from staging + 5-agent auditMerged latest ✅ Fixed & pushed (
|
Continued deep-review fixes for the queue-monitoring PR.
- Reject cross-repo alert-rule writes. @EnforceAtLeastWritePermission grants
the WRITE role for the repo in the X-Repository-Id header (RepositoryContext),
but createRule/updateRule/deleteRule addressed the repo via the {repoId} path
variable -- a user with write access to repo A could create/edit/delete alert
rules on repo B by keeping A in the header and B in the path. Add
assertRepoInContext(repoId) to all three writes (403 on mismatch). New
WorkflowQueueControllerAuthzTest covers reject/allow for every write path.
- WorkflowJobBackfillService.ingestRunJobs: the @transactional on the
self-invoked method was dead (the proxy was bypassed, so it ran with no
transaction). Move the DB writes into a proxied saveJobPage(@transactional)
so the annotation actually applies, while keeping the GitHub pagination
outside the transaction -- a DB connection is never held open across the
network I/O.
- WorkflowJobRepository.touchReconcileAttempt and
RunnerRepository.markMissingOffline: add
@Modifying(clearAutomatically = true, flushAutomatically = true). Inside the
@transactional reconcile() loop the saved entities are the same rows the bulk
UPDATE touches; without flush-before / clear-after, the pending entity flush
at commit clobbered the bulk-updated lastReconcileAttemptAt, so the backoff
never stuck and the same jobs were re-reconciled every 30s (hammering GitHub).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The label-set hash join used a raw, invisible SOH (0x01) control byte embedded
directly in the string literal -- it rendered as String.join("", canonical) in
every editor, grep, and review tool, reading as an empty separator (i.e. as the
exact "abc" adjacency-collision bug the comment claims to prevent). Any reformat,
copy-paste, or "fix the empty string" cleanup could silently strip the byte and
reintroduce the collision with no visible diff.
Replace it with a named LABEL_SEPARATOR = "�" constant. This compiles to the
identical U+0001 character, so the hash is byte-identical to before
(hash(["a","bc"]) == e24681d1... unchanged -- no labelSetHash migration), but the
delimiter is now visible and self-documenting. Behavior verified unchanged by
LabelSetsTest.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /repos/{repoId}/alerts/events endpoint passed hoursBack straight into
now().minusHours(hoursBack), and findRecent has no LIMIT — so hoursBack=
Integer.MAX_VALUE (or any large value) returned an effectively unbounded result
set, and a negative value pushed `since` into the future. Clamp to 1h..90d, the
same bounding pattern the controller already applies to the jobs `limit` and the
stats `window`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| * {@code @EnforceAtLeastWritePermission} grants the WRITE role for the repo in the | ||
| * X-Repository-Id header (the request's {@link RepositoryContext}), not for the {@code repoId} | ||
| * path variable. Guard that they are the same repo, so a user with write access to repo A cannot | ||
| * create/edit/delete alert rules on repo B by keeping A in the header while putting B in the path. |
There was a problem hiding this comment.
🚫 [checkstyle] <com.puppycrawl.tools.checkstyle.checks.sizes.LineLengthCheck> reported by reviewdog 🐶
Line is longer than 100 characters (found 101).
The queue alert-rule write UI (create/update/delete) could never succeed. The backend derives the WRITE role from the X-Repository-Id header (see GitHubJwtAuthenticationConverter) and returns no authorities when it is absent, so @EnforceAtLeastWritePermission 403'd every write. queue.api.ts uses a raw Angular HttpClient (a temporary manual client) which — unlike the generated @hey-api client wired up by RepositoryFilterGuard — never sent the header. Two fixes: - BearerInterceptor: clone with setHeaders instead of replacing the whole headers object, so caller-set headers (X-Repository-Id) survive. Replacing `headers` wholesale silently dropped them (the existing "find a better solution" TODO). - queue.api.ts: send X-Repository-Id matching the path repo on createRule/ updateRule/deleteRule. Matching the path also satisfies the controller's new path-vs-context (IDOR) check. Adds queue.api.spec.ts asserting the header is sent on each write — the bug was invisible to every existing test (GETs are permitAll; no e2e covers writes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions. |
Summary
Adds queue visibility to Helios — per-label-set queue depth, self-hosted runner inventory, build-time percentiles, per-job ETA, stuck-job classification, and SLO alerts — closing the regression Artemis maintainers see vs the previous Bamboo dashboard.
All scheduled jobs + controllers are gated by
helios.queue.enabledand default off. The Flyway migration runs regardless (table creation only); rollback is a forward-fix migration, not a flag flip.What's in here
V56__add_workflow_job_and_runner_inventory.sql) —workflow_job,runner,queue_wait_stat,queue_alert_rule,queue_alert_event, with a partial index onstatus='queued', GIN on labels, and FK torepository(repository_id).GitHubWorkflowJobMessageHandleralso persists durableworkflow_jobrows + a Caffeine hot index, wrapped in try/catch so a failure cannot poison the existing deployment-timing path. NewGitHubSelfHostedRunnerMessageHandlerfor org-level runner events.@ConditionalOnProperty helios.queue.enabled;@EnableSchedulinglives on a@Profile("!openapi")config so they never fire during OpenAPI doc generation) — runner inventory (60s), in-progress job filler withlast_reconcile_attempt_atbackoff (30s), stuck-job classifier (60s), hourly p50/p90/p95 rollup (5min), 30-day backfill (admin-triggered, self-throttling to 180 req/min).EtagCache+GitHubRestClientwithIf-None-Match/ 304 reuse and rate-limit metrics.QueueAlertEvaluatorwith dedup-via-open-events, a realHH:mm-HH:mmquiet window, email channel + template; pluggableAlertChannelinterface.WorkflowQueueController(depth/jobs/stats/alerts CRUD/backfill) andRunnerController(@PreAuthorize("isAuthenticated()")). Reads arepermitAll(app-wide GET policy); rule writes require@EnforceAtLeastWritePermissionand that the path{repoId}matches the authorized (X-Repository-Id) repo.ThemeServiceextracted fromMainLayoutComponentso the newHeliosLineChartComponent(PrimeNG<p-chart>+ Chart.js, withchartjs-adapter-date-fns) reacts to dark-mode toggles. Routes:/repo/:id/ci-cd/queue(overview, runners, stats, alerts) + admin-only top-level/queue.Deep review — issues found & fixed
All 7 follow-ups flagged in the first review are now resolved:
start()dispatches through a proxiedWorkflowJobBackfillExecutor(so@Asyncruns off-thread), and each page is persisted through a proxiedsaveJobPage(@Transactional)so the transaction is real and wraps only the DB writes (not the GitHub pagination — no connection held across network I/O).chartjs-adapter-date-fnsis imported byHeliosLineChartComponentand pinned inpackage.json; the stats page no longer throws.X-Repository-Id, butqueue.api.ts(a rawHttpClient) never sent it, andBearerInterceptorreplaced the whole header object (dropping any caller header). Fixed:BearerInterceptornow usessetHeaders, and the write calls sendX-Repository-Idmatching the path repo. Covered byqueue.api.spec.ts.LabelSetsseparator — was a raw, invisible0x01control byte that read asString.join("", …)(i.e. as the very collision the comment claimed to prevent). Replaced with a named"�"constant — byte-identical at runtime, so nolabelSetHashmigration, but visible and reformat-safe.QueueIndexServicecounter drift — now tracks per-jobJobState; a redelivered webhook in the same state is a no-op, so the queued counter no longer drifts on retries.inQuietWindowparsesHH:mm-HH:mmas a real window (was a cron fire-moment that only suppressed for one minute).Additional hardening from a second deep review:
createRule/updateRule/deleteRulenow require the path{repoId}to equal the authorized (X-Repository-Id) repo (403 otherwise); previously a user with write access to repo A could target repo B via the path. NewWorkflowQueueControllerAuthzTest.touchReconcileAttempt/markMissingOfflinenow use@Modifying(clearAutomatically = true, flushAutomatically = true). Inside the@Transactional reconcile()loop the saved entities are the same rows the bulkUPDATEtouches; without flush-before / clear-after, the entity flush at commit clobbered the updatedlast_reconcile_attempt_at, so the same jobs were re-reconciled every 30s (hammering GitHub).runs-onlabels never matched).${key}substitution engine (removed FreeMarker${x!"-"}/<#if>syntax that rendered literally)./alerts/eventshoursBackclamped to 1h..90d (findRecenthas noLIMIT).Remaining decisions (not bugs — a call to make before/within prod enablement)
@Scheduledreconcilers and alert open-event creation assume a single app instance (no leader election / distributed lock).Verification
./gradlew :application-server:test— 496 tests, 0 failures.queue.api.spec.tsasserting the write header is sent)../gradlew :application-server:generateOpenApiDocs+ Validate OpenAPI — green.flywayValidate/validate-migrations—V56applies cleanly afterV53–V55.Still to do
self_hosted_runner, add orgadministration:read), withhelios.queue.enabled=true:/ci-cd/queuewithin 5sQUEUE_P95_OVER→ email fires once and auto-clears/api/queue/admin/backfill→ rate-limit metric stays healthyhelios.queue.enabled=trueto prod.🤖 Generated with Claude Code