-
Notifications
You must be signed in to change notification settings - Fork 7
Improve troubleshooting execution steering guide #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide: | |
|
|
||
| ``` | ||
| Diagnose durable function execution issue: | ||
| - Function: <function-name>:<alias> (must be qualified ARN) | ||
| - Execution ID: <execution-id> | ||
| - Durable Execution ARN: <durable-execution-arn> | ||
| - Region: <region> (infer from ARN) | ||
|
|
||
| CRITICAL SAFETY RULES: | ||
| - This is READ-ONLY diagnosis | ||
|
|
@@ -29,31 +29,99 @@ CRITICAL SAFETY RULES: | |
| - Only suggest manual remediation if user explicitly requests it | ||
|
|
||
| Steps: | ||
| 1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id> | ||
| 2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT) | ||
| 3. Check for stuck operations (PENDING/RUNNING status) | ||
| 4. Identify failed operations and error messages | ||
| 5. Calculate operation durations and timeline | ||
| 6. Diagnose specific issue: | ||
| - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback | ||
| - Failed operations: Show error and retry attempts | ||
| - Timeout: Calculate total duration, identify slow operations | ||
| - Unexpected behavior: Compare operation order with expected flow | ||
| 7. Provide specific recommendations and next steps | ||
|
|
||
| Use jq for JSON parsing and analysis. | ||
| 1. Fetch the execution history directly: | ||
| Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data | ||
|
|
||
| 2. If the command succeeds, analyze and provide a user-friendly diagnosis: | ||
| a. Report the execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT) | ||
| b. Identify the root cause: | ||
| - Failed operations: Show the EXACT error message verbatim in a code block | ||
| - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting | ||
| - Timeout: Show which operation was running when timeout occurred | ||
| - Unexpected behavior: Compare operation order with expected flow | ||
| c. Calculate operation durations and timeline | ||
| d. Provide a clear, plain-language explanation of what went wrong and why | ||
|
|
||
| 3. If the command fails: | ||
| - Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN. | ||
| - Permissions/network error: Suggest checking IAM permissions (lambda:GetDurableExecutionHistory) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point - general note, do we clarify anywhere in the power about the permission model? I think it might be useful to have some context about permissions you need within the lambda loop (e.g. checkpoint, get state - included in AWSLambdaBasicDurableExecutionRolePolicy), and permissions outside of the loop (like this one).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add "Permissions/network error: check that your caller identity has ambda:GetDurableExecutionHistory on the function ARN." don't think we need to mention "AWSLambdaBasicDurableExecutionRolePolicy" since the user could be signed in with any role. |
||
| - In either case, direct them to the console as a fallback (see step 4) | ||
|
|
||
| 4. ALWAYS provide a direct link to the Execution Details page in the Lambda console. | ||
| Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>) | ||
| to extract region, functionName, functionVersion, executionName, and invocationId, then construct: | ||
| https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId> | ||
|
|
||
| Frame it as: "**[View this execution in the console](<url>)**" | ||
|
|
||
| 5. Provide specific, actionable next steps based on the diagnosis. | ||
|
|
||
| 6. If unable to determine the root cause from execution history: | ||
| - Provide the console link (step 4) | ||
| - Offer to fetch the log group from the function configuration (function name and version are in the ARN): | ||
|
vishalsatam marked this conversation as resolved.
Outdated
|
||
| aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup' | ||
| - If unable to determine the log group, ask the user to check the function's logs | ||
|
|
||
| OUTPUT FORMAT: | ||
| - Lead with a short summary: "Your execution [status] because [reason]." | ||
| - Show the error message verbatim in a code block if one exists. | ||
| - Explain what this means in plain language. | ||
| - Provide the console link for visual debugging. | ||
| - List recommended next steps. | ||
|
vishalsatam marked this conversation as resolved.
|
||
|
|
||
| Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool. | ||
| ``` | ||
|
|
||
| ## Example Usage | ||
|
|
||
| ``` | ||
| User: "My durable function execution abc-123 is stuck on my-function:prod" | ||
| User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2" | ||
|
vishalsatam marked this conversation as resolved.
|
||
|
|
||
| Agent: [Runs get-durable-execution-history command] | ||
| Agent: [Analyzes with jq queries] | ||
|
|
||
| Claude: Your execution is **waiting for an external callback** that hasn't arrived yet. | ||
|
|
||
| The operation `wait-for-approval` has been pending since 2026-02-14 (3 days). | ||
| It will time out in 12 hours if no callback is received. | ||
|
|
||
| Callback ID: `xyz789` | ||
|
|
||
| **[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)** | ||
| ``` | ||
| https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456 | ||
| ``` | ||
|
vishalsatam marked this conversation as resolved.
|
||
|
|
||
| **Next steps:** | ||
| 1. Check if the approval email was sent to the recipient | ||
| 2. Verify the callback ID `xyz789` was included in the email | ||
| 3. If you'd like me to help send the callback manually, let me know | ||
| ``` | ||
|
|
||
| ``` | ||
| User: "Execution def-456 failed on order-processor:live in us-east-1" | ||
|
|
||
| Claude: [Spawns Task agent with troubleshooting instructions] | ||
| Agent: [Runs get-durable-execution-history command] | ||
| Agent: [Analyzes with jq queries] | ||
| Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'. | ||
| Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours. | ||
| Recommendation: Check if approval email was sent, or manually send callback."] | ||
| Claude: [Presents findings and offers to send manual callback if needed] | ||
|
|
||
| Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts. | ||
|
|
||
| Error message: | ||
| ``` | ||
| PaymentGatewayError: Card declined - insufficient funds (code: card_declined) | ||
| ``` | ||
|
|
||
| This means the payment provider rejected the charge. The durable function | ||
| retried 3 times with exponential backoff but the error persisted because | ||
| it's a permanent failure (not a transient network issue). | ||
|
|
||
| **[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)** | ||
| ``` | ||
| https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789 | ||
| ``` | ||
|
|
||
| **Next steps:** | ||
| 1. This is a business logic failure, not an infrastructure issue | ||
| 2. Consider adding a non-retryable error classification for `card_declined` | ||
| 3. Implement a compensation step to release the reserved inventory | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.