Skip to content

monitoring: add consolidated workloads CPU & memory dashboard#210

Open
ejahnGithub wants to merge 5 commits into
sigstore:mainfrom
ejahnGithub:workloads-cpu-memory-dashboard
Open

monitoring: add consolidated workloads CPU & memory dashboard#210
ejahnGithub wants to merge 5 commits into
sigstore:mainfrom
ejahnGithub:workloads-cpu-memory-dashboard

Conversation

@ejahnGithub
Copy link
Copy Markdown

Summary

Adds a single GCP Monitoring dashboard Workloads CPU & Memory that consolidates CPU and memory across all Sigstore GKE workloads (grouped by namespace / container_name), so oncallers do not have to navigate multiple metric pages while investigating resource issues.

It includes (per namespace / container):

  • CPU usage (cores) + Memory used (bytes)
  • CPU & Memory limit utilization
  • CPU & Memory request utilization
  • Container restarts (delta 5m)
  • Node CPU & Memory allocatable utilization
  • Ephemeral storage used
  • Pod network RX / TX
  • Running containers per namespace

Aggregations use REDUCE_MAX on utilization charts so a single hot replica stays visible, and REDUCE_SUM on usage charts to show total per workload.

Wired in via a new google_monitoring_dashboard.workloads resource in gcp/modules/monitoring/infra/dashboards.tf (same pattern as the existing clients, timestamp_authority, and rekor_v1 dashboards).

Testing

  • JSON syntax validated
  • terraform fmt -recursive clean
  • Tested locally by importing the dashboard JSON into staging via the Cloud Console UI

Will roll out to staging first via the usual Sigstore CI flow once merged.

Issue

Resolves sigstore/public-good-instance#1122

Eugene Jahn and others added 5 commits May 7, 2026 10:39
Adds a single GCP Monitoring dashboard that surfaces CPU and memory
across all Sigstore GKE workloads (grouped by namespace / container),
so oncall does not have to navigate multiple metric pages while
investigating resource issues.

The dashboard includes:
  - CPU usage in cores (rate of core_usage_time)
  - Memory used (non-evictable bytes)
  - CPU/memory limit utilization (REDUCE_MAX so a hot replica is visible)
  - CPU/memory request utilization (REDUCE_MAX)
  - Container restart deltas
  - Node CPU allocatable utilization

Resolves sigstore/public-good-instance#1122

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
The xyChart threshold schema does not accept color/direction for these
chart types; the dashboard create rejects them. Keep just the value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
Heights of 16 in a 12-column mosaic produced very tall narrow tiles.
Use h=4 (standard) for charts and keep h=4 for the overview banner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
…iles to workloads dashboard

Mirrors the standard GKE Workloads dashboard so oncall does not have
to navigate to multiple pages to find resource usage charts:

  - Pod network received / sent (per namespace)
  - Ephemeral storage used (per container)
  - Node memory allocatable utilization (sibling of node CPU)
  - Running containers per namespace (uptime count)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
Previous tile used ALIGN_COUNT + REDUCE_SUM, which sums sample counts
within the alignment window and is an approximation of container
count. Switch to ALIGN_MEAN per series + REDUCE_COUNT across series
so the y-axis is the exact number of running containers per
namespace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Eugene Jahn <ejahn@sigstore.dev>
@ejahnGithub ejahnGithub requested a review from a team as a code owner May 7, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant