- 
                Notifications
    You must be signed in to change notification settings 
- Fork 490
Get count from split metadata on simple time range query #5758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Get count from split metadata on simple time range query #5758
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optimization. I think it deserves a unit test. E.g the following test scenario:
- The mock metastore returns 4 splits (1 outside the range, 1 overlapping the star, 1 overlapping the end, 1 overlapping the whole range and 1 within the range.
- The mock search service expects only 1 call.
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | let start_time = split.time_range.as_ref().map(|x| x.start()).copied(); | ||
| let end_time = split.time_range.as_ref().map(|x| x.end()).copied(); | ||
| if is_metadata_count_request(search_request, start_time, end_time) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, to make search_partial_hits_phase a bit more readable, I would pass the time_range directly to is_metadata_count_request
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | jobs_to_leaf_request(search_request, indexes_metas_for_leaf_search, client_jobs)?; | ||
| leaf_request_tasks.push(cluster_client.leaf_search(leaf_request, client.clone())); | ||
| } | ||
| leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, moving it to a separate line would help see a bit more clearly how errors are resolved.
| leaf_search_responses.extend(try_join_all(leaf_request_tasks).await?); | |
| let executed_leaf_search_responses = try_join_all(leaf_request_tasks).await?; | |
| leaf_search_responses.extend(executed_leaf_search_responses); | 
de8dde6    to
    f3ab102      
    Compare
  
    | 
 Will add tests when I have a bit more time. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a test. Unfortunately I think it is flawed (so was my example of test earlier, I should have written "the mock search service expects to be called for 3 splits")
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | mock_search.expect_leaf_search().times(1).returning(|_req| { | ||
| Ok(quickwit_proto::search::LeafSearchResponse { | ||
| num_hits: 1, | ||
| partial_hits: vec![mock_partial_hit("split_inside", 1, 1)], | ||
| failed_splits: Vec::new(), | ||
| num_attempted_splits: 1, | ||
| ..Default::default() | ||
| }) | ||
| }); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- asserting that _reqcontains the right splits is the important part of this test
- the mock response should be all splits except "split_inside" which is the one we resolve at the root level 🙃
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | assert_eq!(resp.num_hits, 1); | ||
| assert_eq!(resp.hits.len(), 1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the response I would have expected here. num_hits should be 10 (num docs in spit inside) + whatever you decide leaf_search returns for the overlapping splits.
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | end_timestamp: Some(129_000), | ||
| index_id_patterns: vec!["test-index".to_string()], | ||
| query_ast: qast_json_helper("test", &["body"]), | ||
| max_hits: 10, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your change is not expected to run if max_hits!=0
c535af6    to
    89f26f1      
    Compare
  
    | 
 You're totally right, I was supposed to be making a count query, fixed now. | 
89f26f1    to
    926a2f3      
    Compare
  
            
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | if let Some(request_start_timestamp) = request.start_timestamp { | ||
| let Some(split_start_timestamp) = split_start_timestamp else { | ||
| return false; | ||
| }; | ||
| if split_start_timestamp < request_start_timestamp { | ||
| return false; | ||
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, I wonder whether this wouldn't be easier to read:
| if let Some(request_start_timestamp) = request.start_timestamp { | |
| let Some(split_start_timestamp) = split_start_timestamp else { | |
| return false; | |
| }; | |
| if split_start_timestamp < request_start_timestamp { | |
| return false; | |
| } | |
| } | |
| match (request.start_timestamp, split_start_timestamp) { | |
| (Some(request_start), Some(split_start)) if split_start >= request_start => {} | |
| (Some(_), _) => return false, | |
| (None, _) => {} | |
| } | 
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | num_hits: req.leaf_requests[0] | ||
| .split_offsets | ||
| .iter() | ||
| .map(|s| s.num_docs) | ||
| .sum(), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this test more powerful by setting a smaller number here.
        
          
                quickwit/quickwit-search/src/root.rs
              
                Outdated
          
        
      | assert_eq!(resp.num_hits, 50); | ||
| assert_eq!(resp.hits.len(), 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is missing from the other tests, but would be nice to also assert resp.num_successful_splits here
| last minute thought, you could also add some tests in https://github.com/tontinton/quickwit/blob/926a2f33d7b35a5cee064adc457c4503f96dc725/quickwit/rest-api-tests/scenarii/qw_search_api/0001_ts_range.yaml to double check the limit conditions, e.g: should confirm that the upper bound exclusion condition is correct and stays that way. | 
926a2f3    to
    2a6a590      
    Compare
  
    4f1812e    to
    8c4d360      
    Compare
  
    * Remove `quickwit-lambda` package * Fix warning
Bumps the github-actions group with 2 updates: [actions/github-script](https://github.com/actions/github-script) and [actions/setup-node](https://github.com/actions/setup-node). Updates `actions/github-script` from 7 to 8 - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@v7...v8) Updates `actions/setup-node` from 4 to 5 - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v4...v5) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/setup-node dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…oss#5890) The following methods are used in order: - from the `QW_NUM_CPUS` environment variable - from the `KUBERNETES_LIMITS_CPU` environment variable - from the operating system - default to 2. Co-authored-by: fulmicoton <[email protected]>
Follow-up for quickwit-oss#5884
8c4d360    to
    1ea3723      
    Compare
  
    
I think I found a nice optimization here while starting to research the search query code. @rdettai
Count queries can return much faster by not downloading splits, and I couldn't think of a good reason to always download split files on time range queries, if the split time range is fully contained in the search request time range.
Split the PR into 2: #5759.