Description
The Durable Functions Power's troubleshooting guidance directs users to CloudWatch logs as the primary diagnostic tool. However, CloudWatch logs only show what happens during Lambda invocations — they don't show what happens between invocations (timeouts firing, callbacks arriving, context state transitions). This gap led to a multi-hour wild goose chase debugging a straightforward error handling issue.
What Happened
We had a waitForCallback with a 15-second timeout inside runInChildContext. When no callback arrived, the player got stuck — no timeout prompt, no auto-skip.
CloudWatch logs showed:
- Question sent ✅
- Lambda suspended ✅
- One more invocation 15s later that replayed and suspended again
- No "TIMEOUT FIRED" log message
- No further invocations
Our conclusion from logs alone: "The timeout doesn't fire inside runInChildContext. This is a runtime bug."
We were wrong. The GetDurableExecutionHistory API told a completely different story:
Event 37: CallbackStarted (Timeout: 15) ← timeout registered
Event 40: InvocationCompleted ← Lambda suspends
Event 41: CallbackTimedOut ← TIMEOUT DID FIRE ✅
Event 42: ContextFailed (WaitForCallback) ← error propagated
Event 43: ContextFailed (RunInChildContext) ← child context failed
Event 45: ExecutionFailed ← entire execution crashed
The real bug: Inside runInChildContext, the timeout throws a ChildContextError wrapping the timeout, not a CallbackError. Our catch (error instanceof CallbackError) didn't match, so the error escaped, killed the child context, and crashed the execution. A 5-minute fix once we knew the actual error type.
Suggested Changes to the Power
1. GetDurableExecutionHistory as Primary Diagnostic
The troubleshooting steering should include GetDurableExecutionHistory as the primary diagnostic tool, not just CloudWatch logs.
## Primary Diagnostic: GetDurableExecutionHistory
CloudWatch logs only show what happens DURING Lambda invocations.
The execution history shows everything — including events BETWEEN invocations:
- Callback arrivals and timeouts
- Context state transitions
- The exact event that caused a failure
Always check execution history BEFORE diving into CloudWatch logs.
Example workflow:
# Get the execution ARN from CloudWatch logs (it's in every log line)
# Then query the history:
aws lambda get-durable-execution-history \
--durable-execution-arn "arn:aws:lambda:...:function:my-fn:version/durable-execution/exec-id/invocation-id"
Key events to look for:
CallbackTimedOut — timeout fired but may not have been caught
ContextFailed — child context threw an unhandled error
ExecutionFailed — entire execution crashed
CallbackStarted with Timeout field — confirms timeout was registered
2. Document Error Types Inside Child Contexts
In replay-model-rules.md or a new error-handling.md:
## Error Types Inside Child Contexts
When catching errors from durable operations inside `runInChildContext`,
the error type may differ from the top-level context:
// ❌ WRONG — only catches top-level timeouts:
catch (error) {
if (error instanceof CallbackError) { ... }
}
// ✅ CORRECT — catches timeouts at any nesting level:
function isCallbackTimeout(error: unknown): boolean {
if (error instanceof CallbackError) return true;
const msg = (error as any)?.message ?? '';
return msg.includes('timed out') || msg.includes('Callback');
}
catch (error) {
if (isCallbackTimeout(error)) { ... }
}
3. Combined Example Needed
The SDK repo has a child-context example (no timeout) and a timeout example (no child context), but no example combining both. This is the exact pattern that tripped us up. A waitForCallback with timeout inside runInChildContext example would be valuable.
Environment
- Runtime: Node.js
- SDK:
@aws-lambda/durable-execution-sdk-js
- Power:
aws-lambda-durable-functions-power
Description
The Durable Functions Power's troubleshooting guidance directs users to CloudWatch logs as the primary diagnostic tool. However, CloudWatch logs only show what happens during Lambda invocations — they don't show what happens between invocations (timeouts firing, callbacks arriving, context state transitions). This gap led to a multi-hour wild goose chase debugging a straightforward error handling issue.
What Happened
We had a
waitForCallbackwith a 15-second timeout insiderunInChildContext. When no callback arrived, the player got stuck — no timeout prompt, no auto-skip.CloudWatch logs showed:
Our conclusion from logs alone: "The timeout doesn't fire inside
runInChildContext. This is a runtime bug."We were wrong. The
GetDurableExecutionHistoryAPI told a completely different story:The real bug: Inside
runInChildContext, the timeout throws aChildContextErrorwrapping the timeout, not aCallbackError. Ourcatch (error instanceof CallbackError)didn't match, so the error escaped, killed the child context, and crashed the execution. A 5-minute fix once we knew the actual error type.Suggested Changes to the Power
1. GetDurableExecutionHistory as Primary Diagnostic
The troubleshooting steering should include
GetDurableExecutionHistoryas the primary diagnostic tool, not just CloudWatch logs.Example workflow:
Key events to look for:
CallbackTimedOut— timeout fired but may not have been caughtContextFailed— child context threw an unhandled errorExecutionFailed— entire execution crashedCallbackStartedwithTimeoutfield — confirms timeout was registered2. Document Error Types Inside Child Contexts
In
replay-model-rules.mdor a newerror-handling.md:3. Combined Example Needed
The SDK repo has a child-context example (no timeout) and a timeout example (no child context), but no example combining both. This is the exact pattern that tripped us up. A
waitForCallbackwith timeout insiderunInChildContextexample would be valuable.Environment
@aws-lambda/durable-execution-sdk-jsaws-lambda-durable-functions-power