-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC] Partition Level Parallelism #30
base: master
Are you sure you want to change the base?
Conversation
public Map<String, Integer> getSegmentToPartitionMap() { | ||
return _partitionIdMap; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice if we have a typical 2-stage partition architecture
a 1-stage partition includes
- partition key column(s)
- partition method
- partition count
for example:
class PartitionInstance implements Partitioning {
List<ServerInstance> _serverInstances;
}
class PartitionTable implements Partitioning {
Map<String, List<Partitioning>> _partitionIdToInstanceListMapping;
}
and the RoutingTable is a special 2-stage partition
class RoutingTable implements Partitioning {
Map<String, List<Partitioning>> _partitionIdToSegmentListMapping;
}
for example the following list of segments
S1, P1, on Server1
S2, P1, on Server1
S3, P1, on Server2
S4, P2, on Server3
S5, P2, on Server4
S6, P2, on Server4
S7, P2, on Server5
The Routing Table will look like
_partitionIdToSegmentListMapping:
{
"P1": [ {"S1": ["Server1"], "S2": ["Server1"], "S3": ["Server2"]} ],
"P2": [ {"S4": ["Server3"], "S5": ["Server4"], "S6": ["Server4"], "S7": ["Server5"] } ]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rethink about this a bit. it seems to me that routingTable should be straight forward from how the routing is being produced and dispatch. thus
- the server --> partition --> segment might be better suited for this.
- the partition --> segment --> server route is easier to construct based on the info routing manager has
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ btw this operates under the assumption that partition-to-instance assignment for the same replica group only returns a single server (contradict to the current instance partition API which can return multiple)
it is more similar to the real-time table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
several remark from my end for parition definition only changes.
_tableCache = tableCache; | ||
} | ||
|
||
public PartitionWorkerManager get(QueryPlan queryPlan, Map<Integer, List<Integer>> stageTree, long requestId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest not doing the provider for now. unless we want some pluggability. (IMO partition worker manager should be default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes +1
Map<String, Integer> calculatePartitionInfo(Set<String> selectedSegments) { | ||
Map<String, Integer> result = new HashMap<>(); | ||
for (String segment : selectedSegments) { | ||
result.put(segment, _segmentMetadataCache.getPartitionId(segment)); | ||
} | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can
- send the segment metadata info directly over to server. or
- keep a partition to segment reverse map during helix changes
if we have too many segments (e.g. > 10,000) the single-threaded nature of broker request handler is going to be a problem
@@ -25,25 +25,39 @@ | |||
|
|||
public class RoutingTable { | |||
private final Map<ServerInstance, List<String>> _serverInstanceToSegmentsMap; | |||
private final Map<String, ServerInstance> _segmentToServerMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was planning to keep 2 alternatives
// for backward compatibility
Map<ServerInstance, List<String>> _serverInstanceToSegmentsMap;
// for future dispatch, unpartitioned dispatch will contain a `Collections.singletonMap(-1, oldSegmentLists)`
Map<ServerInstance, Map<Integer, List<String>>
_serverInstanceToPartitionedSegmentsMap;
and let the RoutingManager manages
- partition->segments map
- server->partitions map
individually (update upon helix changes)
@@ -46,12 +45,12 @@ public class StageMetadata implements Serializable { | |||
private List<String> _scannedTables; | |||
|
|||
// used for assigning server/worker nodes. | |||
private List<VirtualServer> _serverInstances; | |||
private List<VirtualServer> _virtualServers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually
- modifying this back to
ServerInstance
and - make it a map:
Map<ServerInstance, List<Integer>
instancesToPartitionMapping
again use -1
here for non-partitioned, e.g. all servers should be responsible for all partitions
d58cba3
to
fe98bb0
Compare
Approach:
Say a leaf stage has VirtualServers
VS: [V0, V1, V2, V3]
. We will say that leaf-stage has a non-empty ColocationKey if and only if we can assign all selected segments in such a way that each segment S with partition-id=P goes to VirtualServerVS[P % len(VS)]
.This check is done in "PartitionWorkerManager" itself, and it marks those leaf stages as "localized". In
GreedyShuffleRewriter#visitTableScan
we will return a non-empty colocation key if that stage was marked localized.Also, the shuffle-rewrite now does NOT re-assign virtual servers. It simply performs the following check:
Some other points:
stageParallelism=4
will spawn 4 threads in each server. Instead with this PR the behavior would be to have 4 different workers across all servers. If the number of servers needed to serve the query are higher than the parallelism, an error is returned. This can happen if the selected segments are across more than stageParallelism servers.