diff --git a/plugins/aws-serverless/skills/aws-lambda-durable-functions/SKILL.md b/plugins/aws-serverless/skills/aws-lambda-durable-functions/SKILL.md index f552e3b7..346edc0e 100644 --- a/plugins/aws-serverless/skills/aws-lambda-durable-functions/SKILL.md +++ b/plugins/aws-serverless/skills/aws-lambda-durable-functions/SKILL.md @@ -104,7 +104,7 @@ Load the appropriate reference file based on what the user is working on: - **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](references/testing-patterns.md) - **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](references/deployment-iac.md) - **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](references/advanced-patterns.md) -- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](references/troubleshooting-executions.md) +- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](references/troubleshooting-executions.md) ## Quick Reference diff --git a/plugins/aws-serverless/skills/aws-lambda-durable-functions/references/troubleshooting-executions.md b/plugins/aws-serverless/skills/aws-lambda-durable-functions/references/troubleshooting-executions.md index 66565dbd..3353a5e0 100644 --- a/plugins/aws-serverless/skills/aws-lambda-durable-functions/references/troubleshooting-executions.md +++ b/plugins/aws-serverless/skills/aws-lambda-durable-functions/references/troubleshooting-executions.md @@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide: ``` Diagnose durable function execution issue: -- Function: : (must be qualified ARN) -- Execution ID: +- Durable Execution ARN: +- Region: (infer from ARN) CRITICAL SAFETY RULES: - This is READ-ONLY diagnosis @@ -29,31 +29,144 @@ CRITICAL SAFETY RULES: - Only suggest manual remediation if user explicitly requests it Steps: -1. Run: aws lambda get-durable-execution-history --function-name --execution-id -2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT) -3. Check for stuck operations (PENDING/RUNNING status) -4. Identify failed operations and error messages -5. Calculate operation durations and timeline -6. Diagnose specific issue: - - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback - - Failed operations: Show error and retry attempts - - Timeout: Calculate total duration, identify slow operations - - Unexpected behavior: Compare operation order with expected flow -7. Provide specific recommendations and next steps -Use jq for JSON parsing and analysis. +0. If the user provides a function name + alias (e.g., my-function:prod) instead of a full ARN: + - Resolve the alias to a version: aws lambda get-alias --function-name --name --region --query 'FunctionVersion' --output text + - List executions for that function: aws lambda list-durable-executions-by-function --function-name : --region + - Ask the user to identify the execution, or use the most recent one. + +1. Fetch the execution history directly: + Run: aws lambda get-durable-execution-history --durable-execution-arn --region --include-execution-data + +2. If the command succeeds, analyze and provide a user-friendly diagnosis: + a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT) + b. Identify the root cause by looking for these key events in the history: + + **Execution-level failures:** + - `ExecutionFailed` — entire execution crashed; extract the error and cause fields + - `ExecutionTimedOut` — the execution exceeded its configured timeout + - `ExecutionStopped` — execution was manually stopped via StopDurableExecution + + **Context and step failures:** + - `ContextFailed` — a child context threw an unhandled error; check the parent context for what triggered it + - `StepFailed` — an individual step failed; includes RetryDetails (CurrentAttempt, NextAttemptDelaySeconds) showing retry state + + **Callback issues:** + - `CallbackStarted` with a Timeout field — confirms a timeout was registered; correlate with any subsequent `CallbackTimedOut` + - `CallbackTimedOut` — a timeout fired but may not have been caught by the function code + - `CallbackFailed` — the callback was resolved with an error + + **Chained invocation failures:** + - `ChainedInvokeFailed` — a chained (child) durable execution failed + - `ChainedInvokeTimedOut` — a chained execution exceeded its timeout + - `ChainedInvokeStopped` — a chained execution was stopped + + **Other signals:** + - `WaitCancelled` — a scheduled wait was cancelled before completing + - `InvocationCompleted` with an Error field — the Lambda invocation itself errored (e.g., runtime crash) + + **Diagnosis patterns:** + - Failed operations: Show the EXACT error message verbatim in a code block + - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting + - Timeout: Show which operation was running when timeout occurred + - Unexpected behavior: Compare operation order with expected flow + c. Calculate operation durations and timeline + d. Provide a clear, plain-language explanation of what went wrong and why + +3. If the command fails: + - Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN. + - Permissions/network error: check that your caller identity has lambda:GetDurableExecutionHistory on the function ARN. + - In either case, direct them to the console as a fallback (see step 4) + +4. ALWAYS provide a direct link to the Execution Details page in the Lambda console. + Parse the ARN (arn::lambda:::function::/durable-execution//) + to extract region, functionName, functionVersion, executionName, and invocationId, then construct: + https://.console.aws.amazon.com/lambda/home?region=#/functions//versions//executions// + + Frame it as: "**[View this execution in the console]()**" + +5. Provide specific, actionable next steps based on the diagnosis. + +6. If unable to determine the root cause from execution history: + - Provide the console link (step 4) + - Offer to fetch the log group and pull relevant logs: + a. Get the log group: + aws lambda get-function-configuration --function-name : --region --query 'LoggingConfig.LogGroup' + b. Query logs filtered by invocation ID (parsed from the ARN): + aws logs filter-log-events --log-group-name --region --filter-pattern '""' + c. If the function uses SDK structured logging (context.logger), query for step-level logs. + Use execution start time from step 2 as start-time, and execution end time (or current time if still running) as end-time: + QUERY_ID=$(aws logs start-query --log-group-name --region \ + --start-time --end-time \ + --query-string 'fields @timestamp, @message | filter executionName = "" | sort @timestamp asc' \ + --query 'queryId' --output text) + Then poll for results: + aws logs get-query-results --query-id $QUERY_ID --region + - If unable to determine the log group, ask the user to check the function's logs + +OUTPUT FORMAT: +- Lead with a short summary: "Your execution [status] because [reason]." +- Show the error message verbatim in a code block if one exists. +- Explain what this means in plain language. +- Provide the console link for visual debugging. +- List recommended next steps. + +Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool. ``` ## Example Usage +```` +User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2" + +Agent: [Runs get-durable-execution-history command] +Agent: [Analyzes with jq queries] + +Claude: Your execution is **waiting for an external callback** that hasn't arrived yet. + +The operation `wait-for-approval` has been pending since 2026-02-14 (3 days). +It will time out in 12 hours if no callback is received. + +Callback ID: `xyz789` + +**[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)** + ``` -User: "My durable function execution abc-123 is stuck on my-function:prod" +https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456 +``` + +**Next steps:** +1. Check if the approval email was sent to the recipient +2. Verify the callback ID `xyz789` was included in the email +3. If you'd like me to help send the callback manually, let me know +```` + +```` +User: "Execution def-456 failed on order-processor:live in us-east-1" -Claude: [Spawns Task agent with troubleshooting instructions] Agent: [Runs get-durable-execution-history command] Agent: [Analyzes with jq queries] -Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'. - Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours. - Recommendation: Check if approval email was sent, or manually send callback."] -Claude: [Presents findings and offers to send manual callback if needed] + +Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts. + +Error message: + +``` +PaymentGatewayError: Card declined - insufficient funds (code: card_declined) ``` + +This means the payment provider rejected the charge. The durable function +retried 3 times with exponential backoff but the error persisted because +it's a permanent failure (not a transient network issue). + +**[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)** + +``` +https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789 +``` + +**Next steps:** +1. This is a business logic failure, not an infrastructure issue +2. Consider adding a non-retryable error classification for `card_declined` +3. Implement a compensation step to release the reserved inventory +````