Coordinator reduce logic Optimization to use QuickSelect #18777

vinaykpud · 2025-07-16T23:48:26Z

Description

Optimized the reduce logic by:

Applying QuickSelect instead of PriorityQueue when the final size was smaller than the bucket count and simply copying the entire buckets and sorting when the final size == bucket count

I have aded the tests in the comment sections bellow: #18777 (comment)

Benchmarking results:

#18777 (comment)

Related Issues

Resolves #18705
Related: #18650

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishabh Maurya <[email protected]>

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-07-17T00:33:05Z

❌ Gradle check result for a711100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T06:06:36Z

❌ Gradle check result for a711100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T14:35:54Z

❌ Gradle check result for f2ad692: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T15:22:21Z

❌ Gradle check result for 862535c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T17:42:33Z

❌ Gradle check result for 5410283: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T18:31:44Z

❌ Gradle check result for 5087f77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T20:11:55Z

❌ Gradle check result for c563b43: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T20:28:15Z

❌ Gradle check result for 775f402: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T21:42:50Z

❌ Gradle check result for f638796: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-07-17T23:02:41Z

✅ Gradle check result for f638796: SUCCESS

codecov · 2025-07-17T23:03:02Z

Codecov Report

Attention: Patch coverage is 36.00000% with 32 lines in your changes missing coverage. Please review.

Project coverage is 72.79%. Comparing base (b609a50) to head (c7089ff).
Report is 30 commits behind head on main.

Files with missing lines	Patch %	Lines
...earch/aggregations/bucket/terms/InternalTerms.java	36.00%	31 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18777      +/-   ##
============================================
- Coverage     72.80%   72.79%   -0.02%     
- Complexity    68535    68542       +7     
============================================
  Files          5572     5573       +1     
  Lines        314779   314843      +64     
  Branches      45691    45703      +12     
============================================
+ Hits         229166   229178      +12     
- Misses        67014    67033      +19     
- Partials      18599    18632      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-07-18T01:38:42Z

✅ Gradle check result for 6c58e5f: SUCCESS

vinaykpud · 2025-07-18T15:57:29Z

Benchmarking Results

To measure the performance improvements from this optimization, I conducted comprehensive benchmarking tests using opensearch-benchmark.

Test Setup

Initial Data Ingestion:

opensearch-benchmark execute-test --workload=big5 --target-hosts=localhost:9200 \
  --kill-running-processes --workload-params "corpus_size:60,ingest_percentage:1.16" \
  --exclude-tasks="check-cluster-health"

Test Configuration:

Documents ingested: 804,000 into the big5 index
Additional field added for the index: Custom numeric data with 200,000 cardinality ie 200,000 during aggregation
Cluster setup: Single node, single shard
Target task: numeric-terms-agg

Benchmark Execution:

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=big5 \
  --target-hosts=localhost:9200 --kill-running-processes \
  --include-tasks "numeric-terms-agg" --telemetry=node-stats

Iteration details for this task: numeric-terms-agg

Performance Results

I measured the execution time for the reduce method across different request sizes:

Request Size	Metric	Before Optimization	After Optimization	Improvement
100K	Average	61ms	19ms	69% faster
	P90	68ms	21ms	69% faster
250K	Average	92ms	19ms	79% faster
	P90	107ms	20ms	81% faster

Resource Utilization

The optimization also significantly reduced system resource consumption:

Decreased CPU utilization during aggregation operations
Reduction Heap utilization

Key Findings

Consistent performance: The optimized solution maintains ~19-21ms execution time regardless of request size
Scalability improvement: Performance gains increase with larger request sizes
Resource efficiency: Reduced CPU and memory overhead across all test scenarios

vinaykpud · 2025-07-18T16:13:12Z

We now have two approaches for reducing merged buckets to obtain the top N buckets:

Priority Queue-based solution (previous implementation)
Quick Select-based solution (introduced in this PR)

Based on theoretical analysis, we expected the Quick Select solution to outperform the Priority Queue when the request size (top N) is large relative to cardinality, while the Priority Queue should perform better for smaller request sizes. To validate this hypothesis, I conducted performance tests measuring the execution duration for each method similar to previous comment.

Performance Results:

Request Size	Quick Select (avg)	Priority Queue (avg)	Performance Winner
500	~0.003 ms	~0.003 ms	Tie
1,000	0.02 ms	0.013 ms	Priority Queue
5,000	0.1 ms	1.1 ms	Quick Select
10,000	1.2 ms	3.4 ms	Quick Select

Conclusion: The Quick Select solution introduces slight overhead for smaller datasets (≤1,000 items) but demonstrates significant performance improvements for larger datasets. At 5,000+ items, Quick Select shows better performance, making it the preferred choice for high-cardinality use cases.

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

vinaykpud · 2025-07-22T20:25:38Z

Adding @rishabhmaurya for review
@getsaurabh02

github-actions · 2025-07-22T21:34:04Z

❌ Gradle check result for 0b7e48f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rishabhmaurya · 2025-08-01T23:42:24Z

See if we can reuse the code - #18702 (comment), all 3 optimizations have duplicate logic, its better we maintain one version of it and reuse it

vinaykpud · 2025-08-05T19:43:13Z

Although high level logic looks same, the way we use, steps are different.
here we follow following steps

Process buckets and update doc count errors and count valid buckets
populate valid buckets from the reduced buckets based on min doc count condition
Select top buckets if needed
calculate otherDocCount
So this logic is specific to the coordinator and not common with NumericTermAggregator
i am not sure if we can resuse from Optimization in Numeric Terms Aggregation query for Large Bucket Counts #18702 (comment)

jainankitk

I am wondering if we should use the object allocation metric(similar to #17447) instead of the JVM utilization metric in the benchmark shared - #18777 (comment)?

opensearch-trigger-bot · 2025-09-08T15:22:30Z

This PR is stalled because it has been open for 30 days with no activity.

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-09-26T22:03:01Z

❌ Gradle check result for 70904ce: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

rishabhmaurya and others added 3 commits July 16, 2025 10:25

Optimize the reduce merge at coordinator

06e3bcd

Signed-off-by: Rishabh Maurya <[email protected]>

removed fanoutexecutor

5bbb3bf

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

removed the reduceMergeSort3 logic

f0da440

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions bot added bug Something isn't working Search:Performance labels Jul 16, 2025

vinaykpud closed this Jul 17, 2025

vinaykpud reopened this Jul 17, 2025

vinaykpud force-pushed the coord-node-opt branch 2 times, most recently from 0b85cba to 862535c Compare July 17, 2025 15:06

vinaykpud force-pushed the coord-node-opt branch from 5410283 to 5087f77 Compare July 17, 2025 18:03

vinaykpud force-pushed the coord-node-opt branch from 1a2c2a9 to 775f402 Compare July 17, 2025 20:19

vinaykpud force-pushed the coord-node-opt branch from eb6b16e to f638796 Compare July 17, 2025 20:32

vinaykpud closed this Jul 17, 2025

vinaykpud reopened this Jul 17, 2025

vinaykpud force-pushed the coord-node-opt branch from f638796 to 6c58e5f Compare July 18, 2025 00:10

vinaykpud force-pushed the coord-node-opt branch from 6c58e5f to 7edf16f Compare July 18, 2025 17:16

vinaykpud requested review from kotwanikunal, mch2, msfroh, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami, VachaShah and a team as code owners July 22, 2025 20:18

Merge branch 'main' into coord-node-opt

0b7e48f

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

jainankitk reviewed Aug 7, 2025

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label Sep 8, 2025

Merge branch 'main' into coord-node-opt

70904ce

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

vinaykpud force-pushed the coord-node-opt branch from ca8a1c5 to 70904ce Compare September 26, 2025 20:47

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Sep 27, 2025

Coordinator reduce logic Optimization to use QuickSelect #18777

Are you sure you want to change the base?

Coordinator reduce logic Optimization to use QuickSelect #18777

Conversation

vinaykpud commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Optimized the reduce logic by:

Benchmarking results:

Related Issues

Check List

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

codecov bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jul 18, 2025

Uh oh!

vinaykpud commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking Results

Test Setup

Performance Results

Resource Utilization

Key Findings

Uh oh!

vinaykpud commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinaykpud commented Jul 22, 2025

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

rishabhmaurya commented Aug 1, 2025

Uh oh!

vinaykpud commented Aug 5, 2025

Uh oh!

jainankitk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

opensearch-trigger-bot bot commented Sep 8, 2025

Uh oh!

github-actions bot commented Sep 26, 2025

Uh oh!

Uh oh!

vinaykpud commented Jul 16, 2025 •

edited

Loading

codecov bot commented Jul 17, 2025 •

edited

Loading

vinaykpud commented Jul 18, 2025 •

edited

Loading

vinaykpud commented Jul 18, 2025 •

edited

Loading

jainankitk left a comment •

edited

Loading