Skip to content

docs(aws-serverless): Backport improved troubleshooting execution guide#160

Open
vishalsatam wants to merge 1 commit into
awslabs:mainfrom
vishalsatam:backport/improve-troubleshooting-steering
Open

docs(aws-serverless): Backport improved troubleshooting execution guide#160
vishalsatam wants to merge 1 commit into
awslabs:mainfrom
vishalsatam:backport/improve-troubleshooting-steering

Conversation

@vishalsatam
Copy link
Copy Markdown

@vishalsatam vishalsatam commented May 7, 2026

Backport changes from aws/aws-durable-execution-docs#161 to the agent-plugins repo.
Improvements include:

  • ARN-based workflow instead of separate function name + execution ID
  • Alias resolution and list executions step
  • Console link generation for visual debugging
  • Structured output format with clear diagnosis summaries
  • CloudWatch Logs querying when execution history is insufficient
  • Additional usage examples covering failure scenarios
  • Better error handling guidance (execution not found, permissions)
  • Expanded trigger keywords in SKILL.md for better routing

Closes aws/aws-durable-execution-docs#162

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

Copy link
Copy Markdown

@bchampp bchampp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might consider adding a little more explicit instructions in the "Identify the root cause". From the original GitHub issue:

Key events to look for:

  • CallbackTimedOut — timeout fired but may not have been caught
  • ContextFailed — child context threw an unhandled error
  • ExecutionFailed — entire execution crashed
  • CallbackStarted with Timeout field — confirms timeout was registered

What do you think? Right now we're relying on the agents ability to interpret these history events but I think we can provide clearer guidance.

@vishalsatam vishalsatam force-pushed the backport/improve-troubleshooting-steering branch from 42cf189 to 039881e Compare May 11, 2026 18:42
@vishalsatam
Copy link
Copy Markdown
Author

I think we might consider adding a little more explicit instructions in the "Identify the root cause". From the original GitHub issue:

Key events to look for:

  • CallbackTimedOut — timeout fired but may not have been caught
  • ContextFailed — child context threw an unhandled error
  • ExecutionFailed — entire execution crashed
  • CallbackStarted with Timeout field — confirms timeout was registered

What do you think? Right now we're relying on the agents ability to interpret these history events but I think we can provide clearer guidance.

I updated the commit to include details for the failure type of events.

Backport changes from aws/aws-durable-execution-docs#161 to the
agent-plugins repo. Improvements include:
- ARN-based workflow instead of separate function name + execution ID
- Alias resolution and list executions step
- Console link generation for visual debugging
- Structured output format with clear diagnosis summaries
- CloudWatch Logs querying when execution history is insufficient
- Additional usage examples covering failure scenarios
- Better error handling guidance (execution not found, permissions)
- Expanded trigger keywords in SKILL.md for better routing

Closes aws/aws-durable-execution-docs#162
@vishalsatam vishalsatam force-pushed the backport/improve-troubleshooting-steering branch from 039881e to 2478a8b Compare May 11, 2026 18:46
@vishalsatam
Copy link
Copy Markdown
Author

@scottschreckengaust @theagenticguy @krokoko , can you please review this pull request? If you approve, please merge as well since I don't have permissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backport agent to serverless

2 participants