Skip to content

Make repartitioning in PhysicalPlan output less confusing #9370

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The use of multiple RepartitionExec and CoalesceBatchesExec I think makes the explain plans in DataFusion hard to read. This causes users of DataFusion, especially new users, to ask about / wonder if they really need this and what it is doing (see discord thread, for example)

For example, consider this plan that is repartitioning the input to a HashJoin but that repartitioning requires three separate nodes

ProjectionExec: expr=[name@1 as schoolname, name@3 as teachername]
  CoalesceBatchesExec: target_batch_size=8192
    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(id@0, class_id@0)]
      CoalesceBatchesExec: target_batch_size=8192
        RepartitionExec: partitioning=Hash([id@0], 8), input_partitions=8
          RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
            VirtualExecutionPlan
      CoalesceBatchesExec: target_batch_size=8192
        RepartitionExec: partitioning=Hash([class_id@0], 8), input_partitions=8
          RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
            ProjectionExec: expr=[class_id@1 as class_id, name@2 as name]
              VirtualExecutionPlan

Describe the solution you'd like

Ideally I think the plan would look like this:

ProjectionExec: expr=[name@1 as schoolname, name@3 as teachername]
  CoalesceBatchesExec: target_batch_size=8192
    HashJoinExec: mode=Partitioned, join_type=Inner, on=[(id@0, class_id@0)]
      RepartitionExec: partitioning=Hash([id@0], 8), input_partitions=8 <-- repartition
        VirtualExecutionPlan
      RepartitionExec: partitioning=Hash([class_id@0], 8), input_partitions=8
        ProjectionExec: expr=[class_id@1 as class_id, name@2 as name]
          VirtualExecutionPlan

Describe alternatives you've considered

I think we could do this in at least two steps:

  1. Combine the CoalesceBatchesExec into the RepartitionExec (and anything else that needs it)
  2. Using a single RepartitionExec -- I think this may require changes to the RepartitionExec to ensure that the inputs can be hashed in parallel

I think care needs to be taken to ensure that

Additional context

This came from a discussion in discord: https://discord.com/channels/885562378132000778/1206315256977035394/1212085214168490015

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions