aws · bchampp · May 6, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/aws-lambda-durable-functions-power/POWER.md b/aws-lambda-durable-functions-power/POWER.md
@@ -87,7 +87,7 @@ Load the appropriate reference file based on what the user is working on:
 - **Testing**, **local testing**, **cloud testing**, **test runner**, or **flaky tests** -> see [testing-patterns.md](steering/testing-patterns.md)
 - **Deployment**, **CloudFormation**, **CDK**, **SAM**, **log groups**, **deploy**, or **infrastructure** -> see [deployment-iac.md](steering/deployment-iac.md)
 - **Advanced patterns**, **GenAI agents**, **completion policies**, **step semantics**, or **custom serialization** -> see [advanced-patterns.md](steering/advanced-patterns.md)
-- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, or **execution history** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
+- **troubleshooting**, **stuck execution**, **failed execution**, **debug execution ID**, **execution history**, **execution error**, **why did my execution fail**, **execution timed out**, **callback not received**, **diagnose execution**, or **root cause execution** -> see [troubleshooting-executions.md](steering/troubleshooting-executions.md)
 
 ## Quick Reference
 

diff --git a/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md b/aws-lambda-durable-functions-power/steering/troubleshooting-executions.md
@@ -19,8 +19,8 @@ When spawning the troubleshooting agent, provide:
 
 ```
 Diagnose durable function execution issue:
-- Function: <function-name>:<alias> (must be qualified ARN)
-- Execution ID: <execution-id>
+- Durable Execution ARN: <durable-execution-arn>
+- Region: <region> (infer from ARN)
 
 CRITICAL SAFETY RULES:
 - This is READ-ONLY diagnosis
@@ -29,31 +29,115 @@ CRITICAL SAFETY RULES:
 - Only suggest manual remediation if user explicitly requests it
 
 Steps:
-1. Run: aws lambda get-durable-execution-history --function-name <function> --execution-id <id>
-2. Analyze execution status (RUNNING/SUCCEEDED/FAILED/TIMED_OUT)
-3. Check for stuck operations (PENDING/RUNNING status)
-4. Identify failed operations and error messages
-5. Calculate operation durations and timeline
-6. Diagnose specific issue:
-   - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, suggest manual callback
-   - Failed operations: Show error and retry attempts
-   - Timeout: Calculate total duration, identify slow operations
-   - Unexpected behavior: Compare operation order with expected flow
-7. Provide specific recommendations and next steps
-
-Use jq for JSON parsing and analysis.
+0. If the user provides a function name + alias (e.g., my-function:prod) instead of a full ARN:
+   - Resolve the alias to a version: aws lambda get-alias --function-name <functionName> --name <alias> --region <region> --query 'FunctionVersion' --output text
+   - List executions for that function: aws lambda list-durable-executions-by-function --function-name <functionName>:<version> --region <region>
+   - Ask the user to identify the execution, or use the most recent one.
+
+1. Fetch the execution history directly:
+   Run: aws lambda get-durable-execution-history --durable-execution-arn <durable-execution-arn> --region <region> --include-execution-data
+
+2. If the command succeeds, analyze and provide a user-friendly diagnosis:
+   a. Report the execution status (RUNNING/SUCCEEDED/FAILED/STOPPED/TIMED_OUT)
+   b. Identify the root cause:
+      - Failed operations: Show the EXACT error message verbatim in a code block
+      - Stuck in WAIT_FOR_CALLBACK: Extract callback ID, show how long it's been waiting
+      - Timeout: Show which operation was running when timeout occurred
+      - Unexpected behavior: Compare operation order with expected flow
+   c. Calculate operation durations and timeline
+   d. Provide a clear, plain-language explanation of what went wrong and why
+
+3. If the command fails:
+   - Execution not found: Tell the user the execution ID may be incorrect or the execution may have been purged. Ask them to verify the ARN.
+   - Permissions/network error: check that your caller identity has lambda:GetDurableExecutionHistory on the function ARN.
+   - In either case, direct them to the console as a fallback (see step 4)
+
+4. ALWAYS provide a direct link to the Execution Details page in the Lambda console.
+   Parse the ARN (arn:<partition>:lambda:<region>:<accountId>:function:<functionName>:<functionVersion>/durable-execution/<executionName>/<invocationId>)
+   to extract region, functionName, functionVersion, executionName, and invocationId, then construct:
+   https://<region>.console.aws.amazon.com/lambda/home?region=<region>#/functions/<functionName>/versions/<functionVersion>/executions/<executionName>/<invocationId>
+
+   Frame it as: "**[View this execution in the console](<url>)**"
+
+5. Provide specific, actionable next steps based on the diagnosis.
+
+6. If unable to determine the root cause from execution history:
+   - Provide the console link (step 4)
+   - Offer to fetch the log group and pull relevant logs:
+     a. Get the log group:
+        aws lambda get-function-configuration --function-name <functionName>:<functionVersion> --region <region> --query 'LoggingConfig.LogGroup'
+     b. Query logs filtered by invocation ID (parsed from the ARN):
+        aws logs filter-log-events --log-group-name <logGroup> --region <region> --filter-pattern '"<invocationId>"'
+     c. If the function uses SDK structured logging (context.logger), query for step-level logs.
+        Use execution start time from step 2 as start-time, and execution end time (or current time if still running) as end-time:
+        QUERY_ID=$(aws logs start-query --log-group-name <logGroup> --region <region> \
+          --start-time <startEpoch> --end-time <endEpoch> \
+          --query-string 'fields @timestamp, @message | filter executionName = "<executionName>" | sort @timestamp asc' \
+          --query 'queryId' --output text)
+        Then poll for results:
+        aws logs get-query-results --query-id $QUERY_ID --region <region>
+   - If unable to determine the log group, ask the user to check the function's logs
+
+OUTPUT FORMAT:
+- Lead with a short summary: "Your execution [status] because [reason]."
+- Show the error message verbatim in a code block if one exists.
+- Explain what this means in plain language.
+- Provide the console link for visual debugging.
+- List recommended next steps.
+
+Use jq for JSON parsing if available, otherwise use --query (AWS CLI built-in JMESPath) or python -m json.tool.
 ```
 
 ## Example Usage
 
 ```
-User: "My durable function execution abc-123 is stuck on my-function:prod"
+User: "My durable function execution abc-123 is stuck on my-function:prod in us-west-2"
 
-Claude: [Spawns Task agent with troubleshooting instructions]
 Agent: [Runs get-durable-execution-history command]
 Agent: [Analyzes with jq queries]
-Agent: [Returns: "Execution stuck in WAIT_FOR_CALLBACK operation 'wait-for-approval'.
-         Callback ID: xyz789. Waiting since 2026-02-14. Timeout in 12 hours.
-         Recommendation: Check if approval email was sent, or manually send callback."]
-Claude: [Presents findings and offers to send manual callback if needed]
+
+Claude: Your execution is **waiting for an external callback** that hasn't arrived yet.
+
+        The operation `wait-for-approval` has been pending since 2026-02-14 (3 days).
+        It will time out in 12 hours if no callback is received.
+
+        Callback ID: `xyz789`
+
+        **[View this execution in the console](https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456)**
+        ```
+        https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions/my-function/versions/3/executions/abc-123/inv-456
+        ```
+
+        **Next steps:**
+        1. Check if the approval email was sent to the recipient
+        2. Verify the callback ID `xyz789` was included in the email
+        3. If you'd like me to help send the callback manually, let me know
+```
+
+```
+User: "Execution def-456 failed on order-processor:live in us-east-1"
+
+Agent: [Runs get-durable-execution-history command]
+Agent: [Analyzes with jq queries]
+
+Claude: Your execution **failed** at the `charge-payment` step after 3 retry attempts.
+
+        Error message:
+        ```
+        PaymentGatewayError: Card declined - insufficient funds (code: card_declined)
+        ```
+
+        This means the payment provider rejected the charge. The durable function
+        retried 3 times with exponential backoff but the error persisted because
+        it's a permanent failure (not a transient network issue).
+
+        **[View this execution in the console](https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789)**
+        ```
+        https://us-east-1.console.aws.amazon.com/lambda/home?region=us-east-1#/functions/order-processor/versions/5/executions/def-456/inv-789
+        ```
+
+        **Next steps:**
+        1. This is a business logic failure, not an infrastructure issue
+        2. Consider adding a non-retryable error classification for `card_declined`
+        3. Implement a compensation step to release the reserved inventory
 ```