Skip to content

Production incident: 500 errors/non-responsiveness in Container App — findings, mitigation, and fixes #7

@dm-chelupati

Description

@dm-chelupati

Incident: INC0010019 (Severity 1)
Service: Azure Container Apps — octopetsapi (rg-octopets-v2, sub ca5ce512-88e1-44b1-97c6-22caf84fb2b0)
Timeline (UTC):

  • 06:02:50: Incident opened — reports of 500 errors; view details slow; add-to-cart unresponsive.
  • 06:06–06:10: Collected app state, latest revision, logs, and 60m metrics.
  • 06:10: Findings: repeated System.OutOfMemoryException during request handling in Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation() (ListingEndpoints.cs:18) via MapListingEndpoints (line 53). CPU low; memory ~78–79% with 0.5 vCPU/1Gi; single replica with KEDA HTTP scaler (concurrentRequests=10).
  • 06:11: Mitigation applied: scaled to 1 vCPU, 2Gi memory; replicas min=2, max=4; new revision octopetsapi--0000004.
  • 06:12+: Post-mitigation monitoring initiated (every 10 minutes up to 60 minutes) to confirm 5xx decline and resource headroom.

Diagnostics summary:

  • Logs (latest revision): dozens of unhandled exceptions logged by Kestrel (Microsoft.AspNetCore.Server.Kestrel[13]) with System.OutOfMemoryException.
  • Stack traces consistently implicate AReallyExpensiveOperation() in ListingEndpoints.cs:18, called from the listing endpoint delegate at ListingEndpoints.cs:53.
  • Metrics (last 60m): CPU mostly 0–18%; Memory sustained ~78–79% of 1Gi before scaling; request counts low to moderate during error window; 500s observed in logs correlated with the listing endpoint.

Hypotheses and likely root cause:

  • AReallyExpensiveOperation likely allocates large data structures or processes large payloads without streaming/pagination, causing peak memory spikes and OOM under concurrent load.
  • Single-replica and low memory limit exacerbated the impact.

Recommended code fixes:

  • Refactor AReallyExpensiveOperation:
    • Replace materialization of large collections with streaming (IAsyncEnumerable) or pagination.
    • Avoid ToList/ToArray on large queries; prefer AsAsyncEnumerable with yield.
    • Cap response sizes; enforce request limits and validation.
    • Use pooled buffers (ArrayPool) for transient large buffers; avoid repeated large string concatenation; prefer StringBuilder with pre-sized capacity if needed.
    • Ensure async I/O to avoid synchronous blocking and unnecessary buffering.
    • Add CancellationToken support and timeouts; return 429/503 under pressure via a circuit-breaker.
  • Error handling: catch exceptions at the endpoint boundary to return structured 5xx with correlation IDs and quickly free resources; add telemetry for allocation sizes and GC stats.
  • Concurrency controls: rate-limit this endpoint or reduce per-replica concurrentRequests; consider a queue/background worker if the operation is inherently heavy.

IaC and configuration observations (drift):

  • Live config prior to mitigation: 0.5 vCPU, 1Gi memory, min/max replicas 1, HTTP scaler concurrentRequests=10.
  • Live config after mitigation: 1 vCPU, 2Gi memory, minReplicas=2, maxReplicas=4.
  • If IaC defines different resources/replicas or lacks autoscaling safeguards, update templates to match the new baseline and include KEDA HTTP concurrency tuned for memory-heavy endpoints (e.g., lower concurrency per replica).
  • Automated IaC scan did not retrieve templates from this repo; please verify infra definitions and align CPU/memory/replica settings accordingly.

Actionable tasks:

  • Audit and refactor ListingEndpoints.AReallyExpensiveOperation() to remove unbounded allocations; implement streaming/pagination; add unit/integration tests with large datasets.
  • Add guards and error handling around the listing endpoint; return controlled errors; add metrics for allocations and GC.
  • Introduce concurrency limits/rate-limiting for the listing endpoint; consider cache/worker patterns.
  • Update IaC to set container resources to at least 1 vCPU/2Gi; set minReplicas>=2; tune KEDA HTTP concurrency (e.g., 5–8) for memory-heavy operations.
  • Add Application Insights/OTel tracing and memory counters; create dashboards and alerts for 5xx and memory >80%.
  • Load/soak test the endpoint to confirm headroom and SLO compliance.

Acceptance criteria:

  • Reproduce original scenario via load tests without OOM at target traffic; 5xx rate < 0.1% for the listing endpoint.
  • Memory utilization <80% at P95 under expected concurrency; no OutOfMemoryException in logs for 24h.
  • IaC and runtime configuration are consistent; autoscaling protects against spikes.

Note: If assignment to specific owners is required, please route to repo maintainers. Including owners familiar with ListingEndpoints.cs is recommended.

This issue was created by srea-octopets-007--b51b793d
Tracked by the SRE agent here

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions