Support local replicated join and local exchange parallelism #14893

Jackie-Jiang · 2025-01-22T23:15:08Z

Related to #14518

Added a new table hint:

is_replicated (boolean)

Support local replicated join by configuring both side as local distribution, and also hint right table as replicated:

SELECT /*+ joinOptions(left_distribution_type = 'local', right_distribution_type = 'local') */ a.col1, b.col2 FROM a JOIN b /*+ tableOptions(is_replicated='true') */ ON a.col1 = b.col1

Also support parallelism for local exchange to increase the parallelism for intermediate stage with table hint partition_parallelism.

codecov-commenter · 2025-01-22T23:58:21Z

Codecov Report

Attention: Patch coverage is 80.25210% with 47 lines in your changes missing coverage. Please review.

Project coverage is 63.73%. Comparing base (59551e4) to head (f5095e3).
Report is 1621 commits behind head on master.

Files with missing lines	Patch %	Lines
.../org/apache/pinot/query/routing/WorkerManager.java	76.05%	23 Missing and 11 partials ⚠️
...che/pinot/broker/routing/BrokerRoutingManager.java	0.00%	9 Missing ⚠️
...ery/planner/physical/MailboxAssignmentVisitor.java	93.22%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14893      +/-   ##
============================================
+ Coverage     61.75%   63.73%   +1.98%     
- Complexity      207     1471    +1264     
============================================
  Files          2436     2708     +272     
  Lines        133233   151631   +18398     
  Branches      20636    23423    +2787     
============================================
+ Hits          82274    96638   +14364     
- Misses        44911    47724    +2813     
- Partials       6048     7269    +1221

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.70% <80.25%> (+1.99%)`	⬆️
java-21	`63.61% <80.25%> (+1.98%)`	⬆️
skip-bytebuffers-false	`63.71% <80.25%> (+1.97%)`	⬆️
skip-bytebuffers-true	`63.59% <80.25%> (+35.86%)`	⬆️
temurin	`63.73% <80.25%> (+1.98%)`	⬆️
unittests	`63.72% <80.25%> (+1.98%)`	⬆️
unittests1	`56.30% <83.40%> (+9.41%)`	⬆️
unittests2	`34.00% <0.00%> (+6.27%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz · 2025-01-23T09:33:23Z

pinot-broker/src/main/java/org/apache/pinot/broker/routing/BrokerRoutingManager.java

+
+    List<String> getSegments(BrokerRequest brokerRequest) {
+      Set<String> selectedSegments = _segmentSelector.select(brokerRequest);
+      if (!selectedSegments.isEmpty()) {


nit: isn't this if a bit redundant?

We want to short circuit it. This is the same as calculateRouting()

gortiz · 2025-01-23T09:41:10Z

pinot-core/src/main/java/org/apache/pinot/core/routing/RoutingManager.java

@@ -55,6 +56,12 @@ public interface RoutingManager {
  @Nullable
  RoutingTable getRoutingTable(BrokerRequest brokerRequest, long requestId);

+  /**
+   * Returns the segments that are relevant for the given broker request.


Let's specify here what null means.

gortiz

I would need more time to review the code and, ideally, some explanation of the decisions you made here. The changes look to me more complex than I would have expected. We are deviating from the standard Calcite semantics here (i.e., with singleton + parallelism), and I'm not sure why we need to do that.

What I would expect in this situation is that the join node uses the broadcast distribution for its right side (meaning that each incarnation of the join will see all the data). The main difference with the regular broadcast is that instead of picking one server per segment and broadcasting from them, we pick all servers that will execute the left side and read from them, sending the information to its own node.

gortiz · 2025-01-23T09:56:47Z

...t-query-planner/src/main/java/org/apache/pinot/calcite/rel/logical/PinotLogicalExchange.java


  private PinotLogicalExchange(RelOptCluster cluster, RelTraitSet traitSet, RelNode input, RelDistribution distribution,
-      PinotRelExchangeType exchangeType) {
+      PinotRelExchangeType exchangeType, List<Integer> keys) {


In which cases the keys will be different than distribution.getKeys? What is in fact the meaning of having keys = {X, Y, Z} and a distribution like random that doesn't support keys? Wouldn't be better to change the distribution value depending on the keys? If we want to use a distribution + keys that is not permitted by Calcite we can create our own implementation of RelDistribution

Let me put more comments explaining this. We use SINGLETON to represent local exchange, but we also want to support parallelism for local exchange where keys are needed. We can revisit this as we add more custom distribution types

gortiz · 2025-01-23T10:10:38Z

...lanner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotJoinExchangeNodeInsertRule.java

+        // NOTE: We use SINGLETON to represent local distribution. Add keys to the exchange because we might want to
+        //       switch it to HASH distribution to increase parallelism. See MailboxAssignmentVisitor for details.


Reading RelDistribution.Type, shouldn't this be broadcast? The definition of broadcast is:

There are multiple instances of the stream, and all records appear in each instance

While the definition of singleton is:

There is only one instance of the stream. It sees all records.

BTW, I don't get why we set DistributionType as SINGLETON in cases where we want to use HASH.

This is not broadcast because we don't want to send data to other servers. This is not strictly SINGLETON if we want to add parallelism to local exchange (split one block into multiple and spread them into multiple operators). If there is no extra parallelism (1-to-1 distribution), then it is SINGLETON.

gortiz · 2025-01-23T10:12:55Z

...-planner/src/main/java/org/apache/pinot/query/planner/physical/MailboxAssignmentVisitor.java

@@ -42,7 +40,7 @@ public class MailboxAssignmentVisitor extends DefaultPostOrderTraversalVisitor<V
  public Void process(PlanNode node, DispatchablePlanContext context) {
    if (node instanceof MailboxSendNode) {
      MailboxSendNode sendNode = (MailboxSendNode) node;
-      int senderStageId = sendNode.getStageId();
+      Integer senderStageId = sendNode.getStageId();


Why Integer? BaseNode.getStageId() is always a int, right?

Correct, but using Integer can avoid a lot of boxing. I changed this to Integer to align with receiverStageId

Jackie-Jiang · 2025-01-24T03:43:45Z

I would need more time to review the code and, ideally, some explanation of the decisions you made here. The changes look to me more complex than I would have expected. We are deviating from the standard Calcite semantics here (i.e., with singleton + parallelism), and I'm not sure why we need to do that.

What I would expect in this situation is that the join node uses the broadcast distribution for its right side (meaning that each incarnation of the join will see all the data). The main difference with the regular broadcast is that instead of picking one server per segment and broadcasting from them, we pick all servers that will execute the left side and read from them, sending the information to its own node.

Broadcast is supported in #14797, but there could still be data shuffling.
With this PR, we can completely eliminate data shuffling, and right table is always served from the same server.
Regarding singleton + parallelism, this is needed to increase parallelism for intermediate stage. If we do singleton (1-to-1 exchange), there will be same number of intermediate operators as leaf operators, which is not good enough in a lot of cases. We usually run only one leaf operator per server, but we want to run more intermediate operators to fully utilize CPU.

Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes multi-stage Related to the multi-stage query engine labels Jan 22, 2025

Jackie-Jiang force-pushed the local_replicated branch from 845dfe8 to 4a26c4b Compare January 22, 2025 23:19

Jackie-Jiang force-pushed the local_replicated branch from 4a26c4b to ac02e60 Compare January 23, 2025 00:21

Jackie-Jiang mentioned this pull request Jan 23, 2025

Add more JOIN strategies #14518

Open

gortiz reviewed Jan 23, 2025

View reviewed changes

Jackie-Jiang force-pushed the local_replicated branch from ac02e60 to 96ec407 Compare January 23, 2025 23:27

Jackie-Jiang mentioned this pull request Jan 24, 2025

Allow using hint to force enable/disable colocated join #14912

Open

Support local replicated join and local exchange parallelism

f5095e3

Jackie-Jiang force-pushed the local_replicated branch from 96ec407 to f5095e3 Compare January 24, 2025 03:37

Jackie-Jiang requested a review from xiangfu0 January 24, 2025 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support local replicated join and local exchange parallelism #14893

Support local replicated join and local exchange parallelism #14893

Jackie-Jiang commented Jan 22, 2025 •

edited

Loading

codecov-commenter commented Jan 22, 2025 •

edited

Loading

gortiz Jan 23, 2025

Jackie-Jiang Jan 24, 2025

gortiz Jan 23, 2025

Jackie-Jiang Jan 24, 2025

gortiz left a comment

gortiz Jan 23, 2025

Jackie-Jiang Jan 24, 2025

gortiz Jan 23, 2025

gortiz Jan 23, 2025

Jackie-Jiang Jan 24, 2025

gortiz Jan 23, 2025

Jackie-Jiang Jan 24, 2025

Jackie-Jiang commented Jan 24, 2025

		// NOTE: We use SINGLETON to represent local distribution. Add keys to the exchange because we might want to
		// switch it to HASH distribution to increase parallelism. See MailboxAssignmentVisitor for details.

Support local replicated join and local exchange parallelism #14893

Are you sure you want to change the base?

Support local replicated join and local exchange parallelism #14893

Conversation

Jackie-Jiang commented Jan 22, 2025 • edited Loading

codecov-commenter commented Jan 22, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented Jan 24, 2025

Jackie-Jiang commented Jan 22, 2025 •

edited

Loading

codecov-commenter commented Jan 22, 2025 •

edited

Loading