-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Incident: INC0010019 (Severity 1)
Service: Azure Container Apps — octopetsapi (rg-octopets-v2, sub ca5ce512-88e1-44b1-97c6-22caf84fb2b0)
Timeline (UTC):
- 06:02:50: Incident opened — reports of 500 errors; view details slow; add-to-cart unresponsive.
- 06:06–06:10: Collected app state, latest revision, logs, and 60m metrics.
- 06:10: Findings: repeated System.OutOfMemoryException during request handling in Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation() (ListingEndpoints.cs:18) via MapListingEndpoints (line 53). CPU low; memory ~78–79% with 0.5 vCPU/1Gi; single replica with KEDA HTTP scaler (concurrentRequests=10).
- 06:11: Mitigation applied: scaled to 1 vCPU, 2Gi memory; replicas min=2, max=4; new revision octopetsapi--0000004.
- 06:12+: Post-mitigation monitoring initiated (every 10 minutes up to 60 minutes) to confirm 5xx decline and resource headroom.
Diagnostics summary:
- Logs (latest revision): dozens of unhandled exceptions logged by Kestrel (Microsoft.AspNetCore.Server.Kestrel[13]) with System.OutOfMemoryException.
- Stack traces consistently implicate AReallyExpensiveOperation() in ListingEndpoints.cs:18, called from the listing endpoint delegate at ListingEndpoints.cs:53.
- Metrics (last 60m): CPU mostly 0–18%; Memory sustained ~78–79% of 1Gi before scaling; request counts low to moderate during error window; 500s observed in logs correlated with the listing endpoint.
Hypotheses and likely root cause:
- AReallyExpensiveOperation likely allocates large data structures or processes large payloads without streaming/pagination, causing peak memory spikes and OOM under concurrent load.
- Single-replica and low memory limit exacerbated the impact.
Recommended code fixes:
- Refactor AReallyExpensiveOperation:
- Replace materialization of large collections with streaming (IAsyncEnumerable) or pagination.
- Avoid ToList/ToArray on large queries; prefer AsAsyncEnumerable with yield.
- Cap response sizes; enforce request limits and validation.
- Use pooled buffers (ArrayPool) for transient large buffers; avoid repeated large string concatenation; prefer StringBuilder with pre-sized capacity if needed.
- Ensure async I/O to avoid synchronous blocking and unnecessary buffering.
- Add CancellationToken support and timeouts; return 429/503 under pressure via a circuit-breaker.
- Error handling: catch exceptions at the endpoint boundary to return structured 5xx with correlation IDs and quickly free resources; add telemetry for allocation sizes and GC stats.
- Concurrency controls: rate-limit this endpoint or reduce per-replica concurrentRequests; consider a queue/background worker if the operation is inherently heavy.
IaC and configuration observations (drift):
- Live config prior to mitigation: 0.5 vCPU, 1Gi memory, min/max replicas 1, HTTP scaler concurrentRequests=10.
- Live config after mitigation: 1 vCPU, 2Gi memory, minReplicas=2, maxReplicas=4.
- If IaC defines different resources/replicas or lacks autoscaling safeguards, update templates to match the new baseline and include KEDA HTTP concurrency tuned for memory-heavy endpoints (e.g., lower concurrency per replica).
- Automated IaC scan did not retrieve templates from this repo; please verify infra definitions and align CPU/memory/replica settings accordingly.
Actionable tasks:
- Audit and refactor ListingEndpoints.AReallyExpensiveOperation() to remove unbounded allocations; implement streaming/pagination; add unit/integration tests with large datasets.
- Add guards and error handling around the listing endpoint; return controlled errors; add metrics for allocations and GC.
- Introduce concurrency limits/rate-limiting for the listing endpoint; consider cache/worker patterns.
- Update IaC to set container resources to at least 1 vCPU/2Gi; set minReplicas>=2; tune KEDA HTTP concurrency (e.g., 5–8) for memory-heavy operations.
- Add Application Insights/OTel tracing and memory counters; create dashboards and alerts for 5xx and memory >80%.
- Load/soak test the endpoint to confirm headroom and SLO compliance.
Acceptance criteria:
- Reproduce original scenario via load tests without OOM at target traffic; 5xx rate < 0.1% for the listing endpoint.
- Memory utilization <80% at P95 under expected concurrency; no OutOfMemoryException in logs for 24h.
- IaC and runtime configuration are consistent; autoscaling protects against spikes.
Note: If assignment to specific owners is required, please route to repo maintainers. Including owners familiar with ListingEndpoints.cs is recommended.
This issue was created by srea-octopets-007--b51b793d
Tracked by the SRE agent here
Copilot
Metadata
Metadata
Assignees
Labels
No labels