Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion aws-lambda-durable-functions-power/POWER.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Load the appropriate reference file based on what the user is working on:
- **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](steering/testing-patterns.md)
- **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](steering/deployment-iac.md)
- **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](steering/advanced-patterns.md)
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)

## Quick Reference

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide:

```
Diagnose durable function execution issue:
- Function: <function-name>:<alias> (must be qualified ARN)
- Execution ID: <execution-id>
- Durable Execution ARN: <durable-execution-arn>
- Region: <region> (infer from ARN)

CRITICAL SAFETY RULES:
- This is READ-ONLY diagnosis
Expand All @@ -29,31 +29,115 @@ CRITICAL SAFETY RULES:
- Only suggest manual remediation if user explicitly requests it

Steps:
1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id>
2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
3. Check for stuck operations (PENDING/RUNNING status)
4. Identify failed operations and error messages
5. Calculate operation durations and timeline
6. Diagnose specific issue:
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback
- Failed operations: Show error and retry attempts
- Timeout: Calculate total duration, identify slow operations
- Unexpected behavior: Compare operation order with expected flow
7. Provide specific recommendations and next steps

Use jq for JSON parsing and analysis.
0. If the user provides a function name + alias (e.g., my-function:prod) instead of a full ARN:
- Resolve the alias to a version: aws lambda get-alias --function-name <functionName> --name <alias> --region <region> --query 'FunctionVersion' --output text
- List executions for that function: aws lambda list-durable-executions-by-function --function-name <functionName>:<version> --region <region>
- Ask the user to identify the execution, or use the most recent one.

1. Fetch the execution history directly:
Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data

2. If the command succeeds, analyze and provide a user-friendly diagnosis:
a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT)
b. Identify the root cause:
- Failed operations: Show the EXACT error message verbatim in a code block
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting
- Timeout: Show which operation was running when timeout occurred
- Unexpected behavior: Compare operation order with expected flow
c. Calculate operation durations and timeline
d. Provide a clear, plain-language explanation of what went wrong and why

3. If the command fails:
- Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN.
- Permissions/network error: check that your caller identity has lambda:GetDurableExecutionHistory on the function ARN.
- In either case, direct them to the console as a fallback (see step 4)

4. ALWAYS provide a direct link to the Execution Details page in the Lambda console.
Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>)
to extract region, functionName, functionVersion, executionName, and invocationId, then construct:
https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId>

Frame it as: "**[View this execution in the console](<url>)**"

5. Provide specific, actionable next steps based on the diagnosis.

6. If unable to determine the root cause from execution history:
- Provide the console link (step 4)
- Offer to fetch the log group and pull relevant logs:
a. Get the log group:
aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup'
b. Query logs filtered by invocation ID (parsed from the ARN):
aws logs filter-log-events --log-group-name <logGroup> --region <region> --filter-pattern '"<invocationId>"'
c. If the function uses SDK structured logging (context.logger), query for step-level logs.
Use execution start time from step 2 as start-time, and execution end time (or current time if still running) as end-time:
QUERY_ID=$(aws logs start-query --log-group-name <logGroup> --region <region> \
--start-time <startEpoch> --end-time <endEpoch> \
--query-string 'fields @timestamp, @message | filter executionName = "<executionName>" | sort @timestamp asc' \
--query 'queryId' --output text)
Then poll for results:
aws logs get-query-results --query-id $QUERY_ID --region <region>
- If unable to determine the log group, ask the user to check the function's logs

OUTPUT FORMAT:
- Lead with a short summary: "Your execution [status] because [reason]."
- Show the error message verbatim in a code block if one exists.
- Explain what this means in plain language.
- Provide the console link for visual debugging.
- List recommended next steps.
Comment thread
vishalsatam marked this conversation as resolved.

Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool.
```

## Example Usage

```
User: "My durable function execution abc-123 is stuck on my-function:prod"
User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2"
Comment thread
vishalsatam marked this conversation as resolved.

Claude: [Spawns Task agent with troubleshooting instructions]
Agent: [Runs get-durable-execution-history command]
Agent: [Analyzes with jq queries]
Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'.
Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours.
Recommendation: Check if approval email was sent, or manually send callback."]
Claude: [Presents findings and offers to send manual callback if needed]

Claude: Your execution is **waiting for an external callback** that hasn't arrived yet.

The operation `wait-for-approval` has been pending since 2026-02-14 (3 days).
It will time out in 12 hours if no callback is received.

Callback ID: `xyz789`

**[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)**
```
https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456
```
Comment thread
vishalsatam marked this conversation as resolved.

**Next steps:**
1. Check if the approval email was sent to the recipient
2. Verify the callback ID `xyz789` was included in the email
3. If you'd like me to help send the callback manually, let me know
```

```
User: "Execution def-456 failed on order-processor:live in us-east-1"

Agent: [Runs get-durable-execution-history command]
Agent: [Analyzes with jq queries]

Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts.

Error message:
```
PaymentGatewayError: Card declined - insufficient funds (code: card_declined)
```

This means the payment provider rejected the charge. The durable function
retried 3 times with exponential backoff but the error persisted because
it's a permanent failure (not a transient network issue).

**[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)**
```
https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789
```

**Next steps:**
1. This is a business logic failure, not an infrastructure issue
2. Consider adding a non-retryable error classification for `card_declined`
3. Implement a compensation step to release the reserved inventory
```