Skip to content

Conversation

@rdettai-sk
Copy link
Collaborator

@rdettai-sk rdettai-sk commented Nov 28, 2025

Description

We observed that query spikes create huge leaf search tasks backlogs that don't get cancelled when the queries time out.

This is caused by the timeout cancellation that isn't propagated to spawned tasks.

This implementation is based on JoinSet, a Tokio primitive that helps managing the lifecycle of a group of tasks. It is crucial to make sure all the tasks get cancelled when the leaf request times out.

How was this PR tested?

Describe how you tested this PR.

@rdettai-sk rdettai-sk self-assigned this Nov 28, 2025
@guilload guilload self-requested a review December 2, 2025 18:30
try_join_all(leaf_request_tasks),
)
.await??;
let leaf_responses: Vec<LeafSearchResponse> = try_join_vec.try_join_all().await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how things are currently, but do we actually need the responses to come back in the same order. That seems odd.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked myself the same question. I got to the conclusion that it's likely for reproducibility reasons: same list of splits + same query => same result. But it only holds at the leaf level. Given that the split list doesn't seem to be deterministic on the root (no order for list_indexes_metadata()), I don't know how much we really win from this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably nothing. Let's get rid of that constraint so we can use JoinSet directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me! It simplifies the code quite a bit. I'll apply the same orderless processing when gathering join errors from individual splits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants