Skip to content

Conversation

vinaykpud
Copy link
Contributor

@vinaykpud vinaykpud commented Jul 16, 2025

Description

Optimized the reduce logic by:

Applying QuickSelect instead of PriorityQueue when the final size was smaller than the bucket count and simply copying the entire buckets and sorting when the final size == bucket count

I have aded the tests in the comment sections bellow: #18777 (comment)

Benchmarking results:

#18777 (comment)

Related Issues

Resolves #18705
Related: #18650

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

rishabhmaurya and others added 3 commits July 16, 2025 10:25
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
@github-actions github-actions bot added bug Something isn't working Search:Performance labels Jul 16, 2025
Copy link
Contributor

❌ Gradle check result for a711100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vinaykpud vinaykpud closed this Jul 17, 2025
@vinaykpud vinaykpud reopened this Jul 17, 2025
Copy link
Contributor

❌ Gradle check result for a711100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for f2ad692: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vinaykpud vinaykpud force-pushed the coord-node-opt branch 2 times, most recently from 0b85cba to 862535c Compare July 17, 2025 15:06
Copy link
Contributor

❌ Gradle check result for 862535c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5410283: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5087f77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for c563b43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 775f402: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for f638796: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vinaykpud vinaykpud closed this Jul 17, 2025
@vinaykpud vinaykpud reopened this Jul 17, 2025
Copy link
Contributor

✅ Gradle check result for f638796: SUCCESS

Copy link

codecov bot commented Jul 17, 2025

Codecov Report

Attention: Patch coverage is 36.00000% with 32 lines in your changes missing coverage. Please review.

Project coverage is 72.79%. Comparing base (b609a50) to head (c7089ff).
Report is 30 commits behind head on main.

Files with missing lines Patch % Lines
...earch/aggregations/bucket/terms/InternalTerms.java 36.00% 31 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18777      +/-   ##
============================================
- Coverage     72.80%   72.79%   -0.02%     
- Complexity    68535    68542       +7     
============================================
  Files          5572     5573       +1     
  Lines        314779   314843      +64     
  Branches      45691    45703      +12     
============================================
+ Hits         229166   229178      +12     
- Misses        67014    67033      +19     
- Partials      18599    18632      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

✅ Gradle check result for 6c58e5f: SUCCESS

@vinaykpud
Copy link
Contributor Author

vinaykpud commented Jul 18, 2025

Benchmarking Results

To measure the performance improvements from this optimization, I conducted comprehensive benchmarking tests using opensearch-benchmark.

Test Setup

Initial Data Ingestion:

opensearch-benchmark execute-test --workload=big5 --target-hosts=localhost:9200 \
  --kill-running-processes --workload-params "corpus_size:60,ingest_percentage:1.16" \
  --exclude-tasks="check-cluster-health"

Test Configuration:

  • Documents ingested: 804,000 into the big5 index
  • Additional field added for the index: Custom numeric data with 200,000 cardinality ie 200,000 during aggregation
  • Cluster setup: Single node, single shard
  • Target task: numeric-terms-agg

Benchmark Execution:

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=big5 \
  --target-hosts=localhost:9200 --kill-running-processes \
  --include-tasks "numeric-terms-agg" --telemetry=node-stats

Iteration details for this task: numeric-terms-agg

Performance Results

I measured the execution time for the reduce method across different request sizes:

Request Size Metric Before Optimization After Optimization Improvement
100K Average 61ms 19ms 69% faster
P90 68ms 21ms 69% faster
250K Average 92ms 19ms 79% faster
P90 107ms 20ms 81% faster

Resource Utilization

The optimization also significantly reduced system resource consumption:

  • Decreased CPU utilization during aggregation operations
  • Reduction Heap utilization
Screenshot 2025-07-17 at 7 21 41 PM Screenshot 2025-07-17 at 7 21 48 PM

Key Findings

  • Consistent performance: The optimized solution maintains ~19-21ms execution time regardless of request size
  • Scalability improvement: Performance gains increase with larger request sizes
  • Resource efficiency: Reduced CPU and memory overhead across all test scenarios

@vinaykpud
Copy link
Contributor Author

vinaykpud commented Jul 18, 2025

We now have two approaches for reducing merged buckets to obtain the top N buckets:

  1. Priority Queue-based solution (previous implementation)
  2. Quick Select-based solution (introduced in this PR)

Based on theoretical analysis, we expected the Quick Select solution to outperform the Priority Queue when the request size (top N) is large relative to cardinality, while the Priority Queue should perform better for smaller request sizes. To validate this hypothesis, I conducted performance tests measuring the execution duration for each method similar to previous comment.

Performance Results:

Request Size Quick Select (avg) Priority Queue (avg) Performance Winner
500 ~0.003 ms ~0.003 ms Tie
1,000 0.02 ms 0.013 ms Priority Queue
5,000 0.1 ms 1.1 ms Quick Select
10,000 1.2 ms 3.4 ms Quick Select

Conclusion: The Quick Select solution introduces slight overhead for smaller datasets (≤1,000 items) but demonstrates significant performance improvements for larger datasets. At 5,000+ items, Quick Select shows better performance, making it the preferred choice for high-cardinality use cases.

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
@vinaykpud
Copy link
Contributor Author

Adding @rishabhmaurya for review
@getsaurabh02

Copy link
Contributor

❌ Gradle check result for 0b7e48f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@rishabhmaurya
Copy link
Contributor

See if we can reuse the code - #18702 (comment), all 3 optimizations have duplicate logic, its better we maintain one version of it and reuse it

@vinaykpud
Copy link
Contributor Author

Although high level logic looks same, the way we use, steps are different.
here we follow following steps

  1. Process buckets and update doc count errors and count valid buckets
  2. populate valid buckets from the reduced buckets based on min doc count condition
  3. Select top buckets if needed
  4. calculate otherDocCount
    So this logic is specific to the coordinator and not common with NumericTermAggregator
    i am not sure if we can resuse from Optimization in Numeric Terms Aggregation query for Large Bucket Counts #18702 (comment)

Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should use the object allocation metric(similar to #17447) instead of the JVM utilization metric in the benchmark shared - #18777 (comment)?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Sep 8, 2025
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Copy link
Contributor

❌ Gradle check result for 70904ce: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Sep 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance] Optimize the reduce merge at coordinator
3 participants