Configure final reduce phase threads for heavy aggreagtion functions#14662
Configure final reduce phase threads for heavy aggreagtion functions#14662xiangfu0 merged 3 commits intoapache:masterfrom
Conversation
| Object[] values = record.getValues(); | ||
| for (int i = 0; i < numAggregationFunctions; i++) { | ||
| int colId = i + _numKeyColumns; | ||
| values[colId] = _aggregationFunctions[i].extractFinalResult(values[colId]); |
There was a problem hiding this comment.
I think it'd make sense to either :
- put an upper limit on _numThreadsForFinalReduce (e.g. 2 or 3* Runtime.getRuntime().availableProcessors()) or
- change the variable to a boolean flag
enableParallelFinalReduceand use a sensible number of task
to prevent using excessive number of futures or various error modes, e.g.
if _numThreadsForFinalReduce is Integer.MAX_VALUE then chunkSize is going to be negative.
If shared thread pool is overwhelmed by running tasks it might be good to use current thread not only to wait but also task processing, stealing tasks until there's nothing left and only then waiting for futures to finish.
There was a problem hiding this comment.
If shared thread pool is overwhelmed by running tasks it might be good to use current thread not only to wait but also task processing, stealing tasks until there's nothing left and only then waiting for futures to finish.
Potentially, and this can be done transparently by configuring the executor's rejected execution handler to CallerRunsPolicy. However, beware if the executor, which does non-blocking work, is sized to the number of available processors, then if the thread pool is overwhelmed, it means the available CPUs are overwhelmed too. Performing reductions on the caller thread would only lead to excessive context switching and it might be better, from a global perspective, for the task to wait for capacity to be available.
| @@ -232,6 +232,12 @@ public static Integer getGroupTrimThreshold(Map<String, String> queryOptions) { | |||
| return uncheckedParseInt(QueryOptionKey.GROUP_TRIM_THRESHOLD, groupByTrimThreshold); | |||
| } | |||
There was a problem hiding this comment.
Would it be possible to show that final reduce is parallelized in explain output ?
|
can we do this automatically if the keys > X and for specific aggregation functions like funnel etc? |
1f6e6b6 to
30d28c3
Compare
put some heuristic logic here. |
51be961 to
36dbcca
Compare
f189992 to
fcb643d
Compare
fb0b6f8 to
ceac8f5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #14662 +/- ##
============================================
+ Coverage 61.75% 63.73% +1.98%
- Complexity 207 1469 +1262
============================================
Files 2436 2708 +272
Lines 133233 151490 +18257
Branches 20636 23389 +2753
============================================
+ Hits 82274 96551 +14277
- Misses 44911 47683 +2772
- Partials 6048 7256 +1208
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| */ | ||
| @SuppressWarnings({"rawtypes", "unchecked"}) | ||
| public abstract class IndexedTable extends BaseTable { | ||
| private static final int THREAD_POOL_SIZE = Math.max(Runtime.getRuntime().availableProcessors(), 1); |
There was a problem hiding this comment.
(minor) Some constants are available in ResourceManager
There was a problem hiding this comment.
This name is also confusing. Seems this is the upper bound when _numThreadsForServerFinalReduce is not configured. Why not use the same upper bound?
There was a problem hiding this comment.
True, reusing QueryMultiThreadingUtils.MAX_NUM_THREADS_PER_QUERY
ceac8f5 to
4d67c0b
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
LGTM with minor comments
| _trimThreshold = trimThreshold; | ||
| // NOTE: The upper limit of threads number for final reduce is set to 2 * number of available processors by default | ||
| _numThreadsExtractFinalResult = Math.min(queryContext.getNumThreadsExtractFinalResult(), | ||
| Math.max(1, 2 * Runtime.getRuntime().availableProcessors())); |
There was a problem hiding this comment.
We should probably cap it at CPU cores because this is CPU heavy operation
| for (int threadId = 0; threadId < numThreadsExtractFinalResult; threadId++) { | ||
| int startIdx = threadId * chunkSize; | ||
| int endIdx = Math.min(startIdx + chunkSize, topRecordsList.size()); | ||
| if (startIdx < endIdx) { |
There was a problem hiding this comment.
not always the case in the test with very small segment.
| public static final int DEFAULT_NUM_THREADS_FOR_FINAL_REDUCE = 1; | ||
| public static final int DEFAULT_PARALLEL_CHUNK_SIZE_FOR_FINAL_REDUCE = 10_000; |
4d67c0b to
bcb01ca
Compare
…pache#14662) * Configure final reduce phase threads for heavy aggreagtion functions * Address comments * Add tests with numThreadsForFinalReduce
…pache#14662) * Configure final reduce phase threads for heavy aggreagtion functions * Address comments * Add tests with numThreadsForFinalReduce
Add a new query option:
numThreadsForFinalReduceto allow customize the number of threads per aggregate/reduce call.This will significantly reduce the execution time of aggregation groupby, where there are many groups and each group final reduce is very costly like funnel functions.