-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
The use of multiple RepartitionExec and CoalesceBatchesExec I think makes the explain plans in DataFusion hard to read. This causes users of DataFusion, especially new users, to ask about / wonder if they really need this and what it is doing (see discord thread, for example)
For example, consider this plan that is repartitioning the input to a HashJoin but that repartitioning requires three separate nodes
ProjectionExec: expr=[name@1 as schoolname, name@3 as teachername]
CoalesceBatchesExec: target_batch_size=8192
HashJoinExec: mode=Partitioned, join_type=Inner, on=[(id@0, class_id@0)]
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([id@0], 8), input_partitions=8
RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
VirtualExecutionPlan
CoalesceBatchesExec: target_batch_size=8192
RepartitionExec: partitioning=Hash([class_id@0], 8), input_partitions=8
RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
ProjectionExec: expr=[class_id@1 as class_id, name@2 as name]
VirtualExecutionPlan
Describe the solution you'd like
Ideally I think the plan would look like this:
ProjectionExec: expr=[name@1 as schoolname, name@3 as teachername]
CoalesceBatchesExec: target_batch_size=8192
HashJoinExec: mode=Partitioned, join_type=Inner, on=[(id@0, class_id@0)]
RepartitionExec: partitioning=Hash([id@0], 8), input_partitions=8 <-- repartition
VirtualExecutionPlan
RepartitionExec: partitioning=Hash([class_id@0], 8), input_partitions=8
ProjectionExec: expr=[class_id@1 as class_id, name@2 as name]
VirtualExecutionPlan
Describe alternatives you've considered
I think we could do this in at least two steps:
- Combine the CoalesceBatchesExec into the
RepartitionExec(and anything else that needs it) - Using a single
RepartitionExec-- I think this may require changes to the RepartitionExec to ensure that the inputs can be hashed in parallel
I think care needs to be taken to ensure that
Additional context
This came from a discussion in discord: https://discord.com/channels/885562378132000778/1206315256977035394/1212085214168490015
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request