Skip to content

fix(sdk): match reference behaviour for large error payloads#133

Merged
1 commit merged intomainfrom
checkpoint-opt
Nov 8, 2025
Merged

fix(sdk): match reference behaviour for large error payloads#133
1 commit merged intomainfrom
checkpoint-opt

Conversation

@ghost
Copy link
Copy Markdown

@ghost ghost commented Nov 7, 2025

Changes:

  • When payloads are large, we checkpoint the error and return only
    failed.
  • When payloads are small, we return back the error

fixes: #41

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ghost ghost requested review from wangyb-A and yaythomas as code owners November 7, 2025 22:36
Changes:
 - When payloads are large, we checkpoint the error and return only
   failed.
 - When payloads are small, we return back the error

fixes: #41
@ghost ghost force-pushed the checkpoint-opt branch from f36d1b6 to 73e5bf4 Compare November 7, 2025 22:38
ThreadPoolExecutor(
max_workers=2, thread_name_prefix="dex-handler"
) as executor,
contextlib.closing(execution_state) as execution_state,
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding here the context so that it always closes automatically.

Comment on lines +734 to +736

def close(self):
self.stop_checkpointing()
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensures we close with the contextlib.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we add a comment here, and the place that we use the contextlib.closing to point out we are stopping the ckeckpointing


# Stop background checkpointing thread
execution_state.stop_checkpointing()
execution_state.create_checkpoint(success_operation, is_sync=True)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we do create_checkpoint(success_operation, is_sync=True) over create_checkpoint_sync is because that was unwrapping the error. This made it appear like a user error, when in fact it was a checkpoint error.

The change in how we handle user errors made this bug visible.

@wangyb-A
Copy link
Copy Markdown
Contributor

wangyb-A commented Nov 7, 2025

I ran an integration test with this patch, confirmed that the error is handled correctly

@ghost ghost merged commit e2d4527 into main Nov 8, 2025
9 checks passed
@ghost ghost deleted the checkpoint-opt branch November 8, 2025 01:02
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize Response Path on Failure

3 participants