Production incident: 500 errors/non-responsiveness in Container App — findings, mitigation, and fixes

Incident: INC0010019 (Severity 1)
Service: Azure Container Apps — octopetsapi (rg-octopets-v2, sub ca5ce512-88e1-44b1-97c6-22caf84fb2b0)
Timeline (UTC):
- 06:02:50: Incident opened — reports of 500 errors; view details slow; add-to-cart unresponsive.
- 06:06–06:10: Collected app state, latest revision, logs, and 60m metrics.
- 06:10: Findings: repeated System.OutOfMemoryException during request handling in Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation() (ListingEndpoints.cs:18) via MapListingEndpoints (line 53). CPU low; memory ~78–79% with 0.5 vCPU/1Gi; single replica with KEDA HTTP scaler (concurrentRequests=10).
- 06:11: Mitigation applied: scaled to 1 vCPU, 2Gi memory; replicas min=2, max=4; new revision octopetsapi--0000004.
- 06:12+: Post-mitigation monitoring initiated (every 10 minutes up to 60 minutes) to confirm 5xx decline and resource headroom.

Diagnostics summary:
- Logs (latest revision): dozens of unhandled exceptions logged by Kestrel (Microsoft.AspNetCore.Server.Kestrel[13]) with System.OutOfMemoryException.
- Stack traces consistently implicate AReallyExpensiveOperation() in ListingEndpoints.cs:18, called from the listing endpoint delegate at ListingEndpoints.cs:53.
- Metrics (last 60m): CPU mostly 0–18%; Memory sustained ~78–79% of 1Gi before scaling; request counts low to moderate during error window; 500s observed in logs correlated with the listing endpoint.

Hypotheses and likely root cause:
- AReallyExpensiveOperation likely allocates large data structures or processes large payloads without streaming/pagination, causing peak memory spikes and OOM under concurrent load.
- Single-replica and low memory limit exacerbated the impact.

Recommended code fixes:
- Refactor AReallyExpensiveOperation:
  - Replace materialization of large collections with streaming (IAsyncEnumerable) or pagination.
  - Avoid ToList/ToArray on large queries; prefer AsAsyncEnumerable with yield.
  - Cap response sizes; enforce request limits and validation.
  - Use pooled buffers (ArrayPool<byte>) for transient large buffers; avoid repeated large string concatenation; prefer StringBuilder with pre-sized capacity if needed.
  - Ensure async I/O to avoid synchronous blocking and unnecessary buffering.
  - Add CancellationToken support and timeouts; return 429/503 under pressure via a circuit-breaker.
- Error handling: catch exceptions at the endpoint boundary to return structured 5xx with correlation IDs and quickly free resources; add telemetry for allocation sizes and GC stats.
- Concurrency controls: rate-limit this endpoint or reduce per-replica concurrentRequests; consider a queue/background worker if the operation is inherently heavy.

IaC and configuration observations (drift):
- Live config prior to mitigation: 0.5 vCPU, 1Gi memory, min/max replicas 1, HTTP scaler concurrentRequests=10.
- Live config after mitigation: 1 vCPU, 2Gi memory, minReplicas=2, maxReplicas=4.
- If IaC defines different resources/replicas or lacks autoscaling safeguards, update templates to match the new baseline and include KEDA HTTP concurrency tuned for memory-heavy endpoints (e.g., lower concurrency per replica).
- Automated IaC scan did not retrieve templates from this repo; please verify infra definitions and align CPU/memory/replica settings accordingly.

Actionable tasks:
- [ ] Audit and refactor ListingEndpoints.AReallyExpensiveOperation() to remove unbounded allocations; implement streaming/pagination; add unit/integration tests with large datasets.
- [ ] Add guards and error handling around the listing endpoint; return controlled errors; add metrics for allocations and GC.
- [ ] Introduce concurrency limits/rate-limiting for the listing endpoint; consider cache/worker patterns.
- [ ] Update IaC to set container resources to at least 1 vCPU/2Gi; set minReplicas>=2; tune KEDA HTTP concurrency (e.g., 5–8) for memory-heavy operations.
- [ ] Add Application Insights/OTel tracing and memory counters; create dashboards and alerts for 5xx and memory >80%.
- [ ] Load/soak test the endpoint to confirm headroom and SLO compliance.

Acceptance criteria:
- Reproduce original scenario via load tests without OOM at target traffic; 5xx rate < 0.1% for the listing endpoint.
- Memory utilization <80% at P95 under expected concurrency; no OutOfMemoryException in logs for 24h.
- IaC and runtime configuration are consistent; autoscaling protects against spikes.

Note: If assignment to specific owners is required, please route to repo maintainers. Including owners familiar with ListingEndpoints.cs is recommended.
---
*This issue was created by srea-octopets-007--b51b793d*
Tracked by the SRE agent [here](https://portal.azure.com/?feature.customPortal=false&feature.canmodifystamps=true&feature.fastmanifest=false&nocdn=force&websitesextension_loglevel=verbose&Microsoft_Azure_PaasServerless=beta&microsoft_azure_paasserverless_assettypeoptions=%7B%22SreAgentCustomMenu%22%3A%7B%22options%22%3A%22%22%7D%7D#view/Microsoft_Azure_PaasServerless/AgentFrameBlade.ReactView/id/%2Fsubscriptions%2Fca5ce512-88e1-44b1-97c6-22caf84fb2b0%2FresourceGroups%2Frg-octopets-v2%2Fproviders%2FMicrosoft.App%2Fagents%2Fsrea-octopets-007/sreLink/%2Fviews%2Factivities%2Fthreads%2Fe976dcac-185f-463f-aeeb-6c0d5c96089f)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Production incident: 500 errors/non-responsiveness in Container App — findings, mitigation, and fixes #7

Note: If assignment to specific owners is required, please route to repo maintainers. Including owners familiar with ListingEndpoints.cs is recommended.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Production incident: 500 errors/non-responsiveness in Container App — findings, mitigation, and fixes #7

Description

Note: If assignment to specific owners is required, please route to repo maintainers. Including owners familiar with ListingEndpoints.cs is recommended.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions