Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion aws-lambda-durable-functions-power/POWER.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Load the appropriate reference file based on what the user is working on:
- **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](steering/testing-patterns.md)
- **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](steering/deployment-iac.md)
- **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](steering/advanced-patterns.md)
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)

## Quick Reference

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide:

```
Diagnose durable function execution issue:
- Function: <function-name>:<alias> (must be qualified ARN)
- Execution ID: <execution-id>
- Durable Execution ARN: <durable-execution-arn>
- Region: <region> (infer from ARN)

CRITICAL SAFETY RULES:
- This is READ-ONLY diagnosis
Expand All @@ -29,31 +29,99 @@ CRITICAL SAFETY RULES:
- Only suggest manual remediation if user explicitly requests it

Steps:
1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id>
2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
3. Check for stuck operations (PENDING/RUNNING status)
4. Identify failed operations and error messages
5. Calculate operation durations and timeline
6. Diagnose specific issue:
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback
- Failed operations: Show error and retry attempts
- Timeout: Calculate total duration, identify slow operations
- Unexpected behavior: Compare operation order with expected flow
7. Provide specific recommendations and next steps

Use jq for JSON parsing and analysis.
1. Fetch the execution history directly:
Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data

2. If the command succeeds, analyze and provide a user-friendly diagnosis:
a. Report the execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
Comment thread
vishalsatam marked this conversation as resolved.
Outdated
b. Identify the root cause:
- Failed operations: Show the EXACT error message verbatim in a code block
- Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting
- Timeout: Show which operation was running when timeout occurred
- Unexpected behavior: Compare operation order with expected flow
c. Calculate operation durations and timeline
d. Provide a clear, plain-language explanation of what went wrong and why

3. If the command fails:
- Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN.
- Permissions/network error: Suggest checking IAM permissions (lambda:GetDurableExecutionHistory)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Permissions/network error: check that your caller identity has ambda:GetDurableExecutionHistory on the function ARN. This is not part of AWSLambdaBasicDurableExecutionRolePolicy."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point - general note, do we clarify anywhere in the power about the permission model? I think it might be useful to have some context about permissions you need within the lambda loop (e.g. checkpoint, get state - included in AWSLambdaBasicDurableExecutionRolePolicy), and permissions outside of the loop (like this one).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add "Permissions/network error: check that your caller identity has ambda:GetDurableExecutionHistory on the function ARN."

don't think we need to mention "AWSLambdaBasicDurableExecutionRolePolicy" since the user could be signed in with any role.

- In either case, direct them to the console as a fallback (see step 4)

4. ALWAYS provide a direct link to the Execution Details page in the Lambda console.
Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>)
to extract region, functionName, functionVersion, executionName, and invocationId, then construct:
https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId>

Frame it as: "**[View this execution in the console](<url>)**"

5. Provide specific, actionable next steps based on the diagnosis.

6. If unable to determine the root cause from execution history:
- Provide the console link (step 4)
- Offer to fetch the log group from the function configuration (function name and version are in the ARN):
Comment thread
vishalsatam marked this conversation as resolved.
Outdated
aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup'
- If unable to determine the log group, ask the user to check the function's logs

OUTPUT FORMAT:
- Lead with a short summary: "Your execution [status] because [reason]."
- Show the error message verbatim in a code block if one exists.
- Explain what this means in plain language.
- Provide the console link for visual debugging.
- List recommended next steps.
Comment thread
vishalsatam marked this conversation as resolved.

Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool.
```

## Example Usage

```
User: "My durable function execution abc-123 is stuck on my-function:prod"
User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2"
Comment thread
vishalsatam marked this conversation as resolved.

Agent: [Runs get-durable-execution-history command]
Agent: [Analyzes with jq queries]

Claude: Your execution is **waiting for an external callback** that hasn't arrived yet.

The operation `wait-for-approval` has been pending since 2026-02-14 (3 days).
It will time out in 12 hours if no callback is received.

Callback ID: `xyz789`

**[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)**
```
https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456
```
Comment thread
vishalsatam marked this conversation as resolved.

**Next steps:**
1. Check if the approval email was sent to the recipient
2. Verify the callback ID `xyz789` was included in the email
3. If you'd like me to help send the callback manually, let me know
```

```
User: "Execution def-456 failed on order-processor:live in us-east-1"

Claude: [Spawns Task agent with troubleshooting instructions]
Agent: [Runs get-durable-execution-history command]
Agent: [Analyzes with jq queries]
Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'.
Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours.
Recommendation: Check if approval email was sent, or manually send callback."]
Claude: [Presents findings and offers to send manual callback if needed]

Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts.

Error message:
```
PaymentGatewayError: Card declined - insufficient funds (code: card_declined)
```

This means the payment provider rejected the charge. The durable function
retried 3 times with exponential backoff but the error persisted because
it's a permanent failure (not a transient network issue).

**[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)**
```
https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789
```

**Next steps:**
1. This is a business logic failure, not an infrastructure issue
2. Consider adding a non-retryable error classification for `card_declined`
3. Implement a compensation step to release the reserved inventory
```