[PoC] Partition Level Parallelism #30

ankitsultana · 2023-03-03T23:13:50Z

Approach:

Say a leaf stage has VirtualServers VS: [V0, V1, V2, V3]. We will say that leaf-stage has a non-empty ColocationKey if and only if we can assign all selected segments in such a way that each segment S with partition-id=P goes to VirtualServer VS[P % len(VS)].

This check is done in "PartitionWorkerManager" itself, and it marks those leaf stages as "localized". In GreedyShuffleRewriter#visitTableScan we will return a non-empty colocation key if that stage was marked localized.

Also, the shuffle-rewrite now does NOT re-assign virtual servers. It simply performs the following check:

Do the colocation keys allow skipping shuffle.
Whether the number of query-partitions are the same.

Some other points:

virtualId for VirtualServers (aka Query-Partition id) is global. In the current master, setting stageParallelism=4 will spawn 4 threads in each server. Instead with this PR the behavior would be to have 4 different workers across all servers. If the number of servers needed to serve the query are higher than the parallelism, an error is returned. This can happen if the selected segments are across more than stageParallelism servers.

walterddr · 2023-03-07T04:06:39Z

pinot-core/src/main/java/org/apache/pinot/core/routing/RoutingTable.java

+  public Map<String, Integer> getSegmentToPartitionMap() {
+    return _partitionIdMap;
+  }


it would be nice if we have a typical 2-stage partition architecture

a 1-stage partition includes

partition key column(s)

partition method

partition count
for example:

class PartitionInstance implements Partitioning { List<ServerInstance> _serverInstances; } class PartitionTable implements Partitioning { Map<String, List<Partitioning>> _partitionIdToInstanceListMapping; }

and the RoutingTable is a special 2-stage partition

class RoutingTable implements Partitioning { Map<String, List<Partitioning>> _partitionIdToSegmentListMapping; }

for example the following list of segments

S1, P1, on Server1 S2, P1, on Server1 S3, P1, on Server2 S4, P2, on Server3 S5, P2, on Server4 S6, P2, on Server4 S7, P2, on Server5

The Routing Table will look like

_partitionIdToSegmentListMapping: { "P1": [ {"S1": ["Server1"], "S2": ["Server1"], "S3": ["Server2"]} ], "P2": [ {"S4": ["Server3"], "S5": ["Server4"], "S6": ["Server4"], "S7": ["Server5"] } ] }

rethink about this a bit. it seems to me that routingTable should be straight forward from how the routing is being produced and dispatch. thus

the server --> partition --> segment might be better suited for this.

the partition --> segment --> server route is easier to construct based on the info routing manager has

^ btw this operates under the assumption that partition-to-instance assignment for the same replica group only returns a single server (contradict to the current instance partition API which can return multiple)
it is more similar to the real-time table

walterddr

several remark from my end for parition definition only changes.

walterddr · 2023-03-17T16:55:53Z

pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManagerProvider.java

+    _tableCache = tableCache;
+  }
+
+  public PartitionWorkerManager get(QueryPlan queryPlan, Map<Integer, List<Integer>> stageTree, long requestId,


suggest not doing the provider for now. unless we want some pluggability. (IMO partition worker manager should be default)

walterddr · 2023-03-17T16:58:07Z

pinot-broker/src/main/java/org/apache/pinot/broker/routing/BrokerRoutingManager.java

+    Map<String, Integer> calculatePartitionInfo(Set<String> selectedSegments) {
+      Map<String, Integer> result = new HashMap<>();
+      for (String segment : selectedSegments) {
+        result.put(segment, _segmentMetadataCache.getPartitionId(segment));
+      }
+      return result;
+    }


we can

send the segment metadata info directly over to server. or

keep a partition to segment reverse map during helix changes

if we have too many segments (e.g. > 10,000) the single-threaded nature of broker request handler is going to be a problem

walterddr · 2023-03-17T17:01:48Z

pinot-core/src/main/java/org/apache/pinot/core/routing/RoutingTable.java

@@ -25,25 +25,39 @@

 public class RoutingTable {
  private final Map<ServerInstance, List<String>> _serverInstanceToSegmentsMap;
+  private final Map<String, ServerInstance> _segmentToServerMap;


i was planning to keep 2 alternatives

// for backward compatibility Map<ServerInstance, List<String>> _serverInstanceToSegmentsMap; // for future dispatch, unpartitioned dispatch will contain a `Collections.singletonMap(-1, oldSegmentLists)` Map<ServerInstance, Map<Integer, List<String>> _serverInstanceToPartitionedSegmentsMap;

and let the RoutingManager manages

partition->segments map

server->partitions map
individually (update upon helix changes)

walterddr · 2023-03-17T17:04:06Z

pinot-query-planner/src/main/java/org/apache/pinot/query/planner/StageMetadata.java

@@ -46,12 +45,12 @@ public class StageMetadata implements Serializable {
  private List<String> _scannedTables;

  // used for assigning server/worker nodes.
-  private List<VirtualServer> _serverInstances;
+  private List<VirtualServer> _virtualServers;


I am actually

modifying this back to ServerInstance and

make it a map: Map<ServerInstance, List<Integer> instancesToPartitionMapping

again use -1 here for non-partitioned, e.g. all servers should be responsible for all partitions

ankitsultana added 4 commits March 3, 2023 14:43

First Working Version for Partition Level Parallelism

023296a

Handle case with no segments in a query partition

6d64711

Minor refactors

71d3934

Fix some bugs

ae4795d

walterddr reviewed Mar 7, 2023

View reviewed changes

walterddr reviewed Mar 17, 2023

View reviewed changes

ankitsultana force-pushed the master branch 2 times, most recently from d58cba3 to fe98bb0 Compare May 11, 2023 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Partition Level Parallelism #30

[PoC] Partition Level Parallelism #30

ankitsultana commented Mar 3, 2023 •

edited

Loading

walterddr Mar 7, 2023

walterddr Mar 17, 2023

walterddr Mar 17, 2023

walterddr left a comment

walterddr Mar 17, 2023

ankitsultana Mar 17, 2023

walterddr Mar 17, 2023

walterddr Mar 17, 2023

walterddr Mar 17, 2023

[PoC] Partition Level Parallelism #30

Are you sure you want to change the base?

[PoC] Partition Level Parallelism #30

Conversation

ankitsultana commented Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

walterddr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ankitsultana commented Mar 3, 2023 •

edited

Loading