Skip to content

docs(aws-serverless): Backport improved troubleshooting execution guide#160

Merged
krokoko merged 2 commits into
awslabs:mainfrom
vishalsatam:backport/improve-troubleshooting-steering
May 15, 2026
Merged

docs(aws-serverless): Backport improved troubleshooting execution guide#160
krokoko merged 2 commits into
awslabs:mainfrom
vishalsatam:backport/improve-troubleshooting-steering

Conversation

@vishalsatam
Copy link
Copy Markdown
Contributor

@vishalsatam vishalsatam commented May 7, 2026

Backport changes from aws/aws-durable-execution-docs#161 to the agent-plugins repo.
Improvements include:

  • ARN-based workflow instead of separate function name + execution ID
  • Alias resolution and list executions step
  • Console link generation for visual debugging
  • Structured output format with clear diagnosis summaries
  • CloudWatch Logs querying when execution history is insufficient
  • Additional usage examples covering failure scenarios
  • Better error handling guidance (execution not found, permissions)
  • Expanded trigger keywords in SKILL.md for better routing

Closes aws/aws-durable-execution-docs#162

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Copy link
Copy Markdown

@bchampp bchampp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might consider adding a little more explicit instructions in the "Identify the root cause". From the original GitHub issue:

Key events to look for:

  • CallbackTimedOut — timeout fired but may not have been caught
  • ContextFailed — child context threw an unhandled error
  • ExecutionFailed — entire execution crashed
  • CallbackStarted with Timeout field — confirms timeout was registered

What do you think? Right now we're relying on the agents ability to interpret these history events but I think we can provide clearer guidance.

@vishalsatam vishalsatam force-pushed the backport/improve-troubleshooting-steering branch from 42cf189 to 039881e Compare May 11, 2026 18:42
@vishalsatam
Copy link
Copy Markdown
Contributor Author

I think we might consider adding a little more explicit instructions in the "Identify the root cause". From the original GitHub issue:

Key events to look for:

  • CallbackTimedOut — timeout fired but may not have been caught
  • ContextFailed — child context threw an unhandled error
  • ExecutionFailed — entire execution crashed
  • CallbackStarted with Timeout field — confirms timeout was registered

What do you think? Right now we're relying on the agents ability to interpret these history events but I think we can provide clearer guidance.

I updated the commit to include details for the failure type of events.

Backport changes from aws/aws-durable-execution-docs#161 to the
agent-plugins repo. Improvements include:
- ARN-based workflow instead of separate function name + execution ID
- Alias resolution and list executions step
- Console link generation for visual debugging
- Structured output format with clear diagnosis summaries
- CloudWatch Logs querying when execution history is insufficient
- Additional usage examples covering failure scenarios
- Better error handling guidance (execution not found, permissions)
- Expanded trigger keywords in SKILL.md for better routing

Closes aws/aws-durable-execution-docs#162
@vishalsatam vishalsatam force-pushed the backport/improve-troubleshooting-steering branch from 039881e to 2478a8b Compare May 11, 2026 18:46
@vishalsatam
Copy link
Copy Markdown
Contributor Author

@scottschreckengaust @theagenticguy @krokoko , can you please review this pull request? If you approve, please merge as well since I don't have permissions.

@krokoko krokoko enabled auto-merge May 15, 2026 16:40
Copy link
Copy Markdown
Contributor

@theagenticguy theagenticguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@krokoko krokoko added this pull request to the merge queue May 15, 2026
Merged via the queue into awslabs:main with commit 38a9152 May 15, 2026
22 checks passed
bchampp pushed a commit to aws/aws-durable-execution-docs that referenced this pull request May 15, 2026
Add detailed event type documentation to step 2b of the troubleshooting
guide, providing explicit guidance on key execution history events:

- Execution-level failures (ExecutionFailed, ExecutionTimedOut, ExecutionStopped)
- Context and step failures (ContextFailed, StepFailed with RetryDetails)
- Callback issues (CallbackStarted, CallbackTimedOut, CallbackFailed)
- Chained invocation failures (ChainedInvokeFailed/TimedOut/Stopped)
- Other signals (WaitCancelled, InvocationCompleted with Error)

This gives agents clearer guidance for interpreting execution history
events rather than relying on implicit knowledge.

Syncs with awslabs/agent-plugins#160.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backport agent to serverless

4 participants