Skip to content

Durable Functions Power: Add GetDurableExecutionHistory as primary diagnostic and document child context error types #153

@singledigit

Description

@singledigit

Description

The Durable Functions Power's troubleshooting guidance directs users to CloudWatch logs as the primary diagnostic tool. However, CloudWatch logs only show what happens during Lambda invocations — they don't show what happens between invocations (timeouts firing, callbacks arriving, context state transitions). This gap led to a multi-hour wild goose chase debugging a straightforward error handling issue.

What Happened

We had a waitForCallback with a 15-second timeout inside runInChildContext. When no callback arrived, the player got stuck — no timeout prompt, no auto-skip.

CloudWatch logs showed:

  • Question sent ✅
  • Lambda suspended ✅
  • One more invocation 15s later that replayed and suspended again
  • No "TIMEOUT FIRED" log message
  • No further invocations

Our conclusion from logs alone: "The timeout doesn't fire inside runInChildContext. This is a runtime bug."

We were wrong. The GetDurableExecutionHistory API told a completely different story:

Event 37: CallbackStarted (Timeout: 15)     ← timeout registered
Event 40: InvocationCompleted                ← Lambda suspends
Event 41: CallbackTimedOut                   ← TIMEOUT DID FIRE ✅
Event 42: ContextFailed (WaitForCallback)    ← error propagated
Event 43: ContextFailed (RunInChildContext)  ← child context failed
Event 45: ExecutionFailed                    ← entire execution crashed

The real bug: Inside runInChildContext, the timeout throws a ChildContextError wrapping the timeout, not a CallbackError. Our catch (error instanceof CallbackError) didn't match, so the error escaped, killed the child context, and crashed the execution. A 5-minute fix once we knew the actual error type.

Suggested Changes to the Power

1. GetDurableExecutionHistory as Primary Diagnostic

The troubleshooting steering should include GetDurableExecutionHistory as the primary diagnostic tool, not just CloudWatch logs.

## Primary Diagnostic: GetDurableExecutionHistory

CloudWatch logs only show what happens DURING Lambda invocations.
The execution history shows everything — including events BETWEEN invocations:

- Callback arrivals and timeouts
- Context state transitions  
- The exact event that caused a failure

Always check execution history BEFORE diving into CloudWatch logs.

Example workflow:

# Get the execution ARN from CloudWatch logs (it's in every log line)
# Then query the history:
aws lambda get-durable-execution-history \
  --durable-execution-arn "arn:aws:lambda:...:function:my-fn:version/durable-execution/exec-id/invocation-id"

Key events to look for:

  • CallbackTimedOut — timeout fired but may not have been caught
  • ContextFailed — child context threw an unhandled error
  • ExecutionFailed — entire execution crashed
  • CallbackStarted with Timeout field — confirms timeout was registered

2. Document Error Types Inside Child Contexts

In replay-model-rules.md or a new error-handling.md:

## Error Types Inside Child Contexts

When catching errors from durable operations inside `runInChildContext`, 
the error type may differ from the top-level context:

// ❌ WRONG — only catches top-level timeouts:
catch (error) {
  if (error instanceof CallbackError) { ... }
}

// ✅ CORRECT — catches timeouts at any nesting level:
function isCallbackTimeout(error: unknown): boolean {
  if (error instanceof CallbackError) return true;
  const msg = (error as any)?.message ?? '';
  return msg.includes('timed out') || msg.includes('Callback');
}

catch (error) {
  if (isCallbackTimeout(error)) { ... }
}

3. Combined Example Needed

The SDK repo has a child-context example (no timeout) and a timeout example (no child context), but no example combining both. This is the exact pattern that tripped us up. A waitForCallback with timeout inside runInChildContext example would be valuable.

Environment

  • Runtime: Node.js
  • SDK: @aws-lambda/durable-execution-sdk-js
  • Power: aws-lambda-durable-functions-power

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions