Skip to content

Conversation

@tontinton
Copy link
Contributor

@tontinton tontinton commented Apr 19, 2025

I think I found a nice optimization here while starting to research the search query code. @rdettai

Count queries can return much faster by not downloading splits, and I couldn't think of a good reason to always download split files on time range queries, if the split time range is fully contained in the search request time range.

Split the PR into 2: #5759.

@tontinton tontinton changed the title Get count from split metadata on simple query with timerange Get count from split metadata on simple query with time range Apr 19, 2025
@tontinton tontinton changed the title Get count from split metadata on simple query with time range Get count from split metadata on simple time range query Apr 19, 2025
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization. I think it deserves a unit test. E.g the following test scenario:

  • The mock metastore returns 4 splits (1 outside the range, 1 overlapping the star, 1 overlapping the end, 1 overlapping the whole range and 1 within the range.
  • The mock search service expects only 1 call.

Comment on lines 757 to 759
let start_time = split.time_range.as_ref().map(|x| x.start()).copied();
let end_time = split.time_range.as_ref().map(|x| x.end()).copied();
if is_metadata_count_request(search_request, start_time, end_time) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, to make search_partial_hits_phase a bit more readable, I would pass the time_range directly to is_metadata_count_request

jobs_to_leaf_request(search_request, indexes_metas_for_leaf_search, client_jobs)?;
leaf_request_tasks.push(cluster_client.leaf_search(leaf_request, client.clone()));
}
leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, moving it to a separate line would help see a bit more clearly how errors are resolved.

Suggested change
leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?);
let executed_leaf_search_responses = try_join_all(leaf_request_tasks).await?;
leaf_search_responses.extend(executed_leaf_search_responses);

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from de8dde6 to f3ab102 Compare April 22, 2025 20:40
@tontinton
Copy link
Contributor Author

Nice optimization. I think it deserves a unit test. E.g the following test scenario:

* The mock metastore returns 4 splits (1 outside the range, 1 overlapping the star, 1 overlapping the end, 1 overlapping the whole range and 1 within the range.

* The mock search service expects only 1 call.

Will add tests when I have a bit more time.

@tontinton tontinton requested a review from rdettai April 29, 2025 17:38
Copy link
Collaborator

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test. Unfortunately I think it is flawed (so was my example of test earlier, I should have written "the mock search service expects to be called for 3 splits")

Comment on lines 5131 to 5139
mock_search.expect_leaf_search().times(1).returning(|_req| {
Ok(quickwit_proto::search::LeafSearchResponse {
num_hits: 1,
partial_hits: vec![mock_partial_hit("split_inside", 1, 1)],
failed_splits: Vec::new(),
num_attempted_splits: 1,
..Default::default()
})
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • asserting that _req contains the right splits is the important part of this test
  • the mock response should be all splits except "split_inside" which is the one we resolve at the root level 🙃

Comment on lines 5159 to 5160
assert_eq!(resp.num_hits, 1);
assert_eq!(resp.hits.len(), 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the response I would have expected here. num_hits should be 10 (num docs in spit inside) + whatever you decide leaf_search returns for the overlapping splits.

end_timestamp: Some(129_000),
index_id_patterns: vec!["test-index".to_string()],
query_ast: qast_json_helper("test", &["body"]),
max_hits: 10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your change is not expected to run if max_hits!=0

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from c535af6 to 89f26f1 Compare April 30, 2025 19:21
@tontinton
Copy link
Contributor Author

tontinton commented Apr 30, 2025

Thanks for adding a test. Unfortunately I think it is flawed (so was my example of test earlier, I should have written "the mock search service expects to be called for 3 splits")

You're totally right, I was supposed to be making a count query, fixed now.

@tontinton tontinton requested a review from rdettai April 30, 2025 19:22
@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from 89f26f1 to 926a2f3 Compare April 30, 2025 20:48
Comment on lines 667 to 670
if let Some(request_start_timestamp) = request.start_timestamp {
let Some(split_start_timestamp) = split_start_timestamp else {
return false;
};
if split_start_timestamp < request_start_timestamp {
return false;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I wonder whether this wouldn't be easier to read:

Suggested change
if let Some(request_start_timestamp) = request.start_timestamp {
let Some(split_start_timestamp) = split_start_timestamp else {
return false;
};
if split_start_timestamp < request_start_timestamp {
return false;
}
}
match (request.start_timestamp, split_start_timestamp) {
(Some(request_start), Some(split_start)) if split_start >= request_start => {}
(Some(_), _) => return false,
(None, _) => {}
}

Comment on lines 5154 to 5158
num_hits: req.leaf_requests[0]
.split_offsets
.iter()
.map(|s| s.num_docs)
.sum(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this test more powerful by setting a smaller number here.

Comment on lines 5184 to 5165
assert_eq!(resp.num_hits, 50);
assert_eq!(resp.hits.len(), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is missing from the other tests, but would be nice to also assert resp.num_successful_splits here

@rdettai
Copy link
Collaborator

rdettai commented May 14, 2025

last minute thought, you could also add some tests in https://github.com/tontinton/quickwit/blob/926a2f33d7b35a5cee064adc457c4503f96dc725/quickwit/rest-api-tests/scenarii/qw_search_api/0001_ts_range.yaml to double check the limit conditions, e.g:

endpoint: simple/search
params:
  query: "*"
  start_timestamp: 1684993000
  end_timestamp: 1684993004
expected:
  num_hits: 3

should confirm that the upper bound exclusion condition is correct and stays that way.

@tontinton tontinton force-pushed the optimize-timestamp-range-count branch from 926a2f3 to 2a6a590 Compare May 14, 2025 18:49
@tontinton tontinton requested a review from rdettai May 14, 2025 18:49
@tontinton tontinton force-pushed the optimize-timestamp-range-count branch 4 times, most recently from 4f1812e to 8c4d360 Compare May 15, 2025 12:09
trinity-1686a and others added 8 commits September 8, 2025 17:01
* Remove `quickwit-lambda` package

* Fix warning
Bumps the github-actions group with 2 updates: [actions/github-script](https://github.com/actions/github-script) and [actions/setup-node](https://github.com/actions/setup-node).


Updates `actions/github-script` from 7 to 8
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@v7...v8)

Updates `actions/setup-node` from 4 to 5
- [Release notes](https://github.com/actions/setup-node/releases)
- [Commits](actions/setup-node@v4...v5)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/setup-node
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
fulmicoton and others added 5 commits September 12, 2025 16:25
…oss#5890)

The following methods are used in order:
- from the `QW_NUM_CPUS` environment variable
- from the `KUBERNETES_LIMITS_CPU` environment variable
- from the operating system
-  default to 2.

Co-authored-by: fulmicoton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants