Skip to content

[FEA] GpuOutOfCoreSort over Wide Tables #14072

@sperlingxx

Description

@sperlingxx

This is an umbrella issue tracks several proposed optimizations aimed at resolving severe performance degradation observed when running GpuOutOfCoreSort on extremely wide tables (~100 columns and including big columns like ArrayType or long string) in Spark-Rapids. This scenario often forces frequent and heavy spilling to host memory, creating significant I/O and memory bottlenecks.

Problem Description

This optimization proposal is directly motivated by severe performance bottlenecks observed in large-scale data warehousing environments, specifically concerning final data persistence. These systems often process high volumes of data and utilize wide tables, which together create an extremely challenging scenario for GPU-accelerated sorting.

The primary architectural constraint is the business requirement for a mandated final sort order (typically on primary keys) before writing the dataset to Parquet tables via a GpuInsertIntoHiveTable command. This requirement necessitates an extremely expensive GpuSortExec operation near the end of the query pipeline.Query Plan Pattern: The Root Cause (Column Expansion Pattern)

The costly final GpuSortExec is a direct consequence of a preceding query plan structure that expands a small initial table into a massive final table through a deep chain of join operations.

TYPICAL SCIENARIO: The OutOfCoreSort Bottleneck

GpuInsertIntoHiveTable [Target: Parquet]
│
│ # 🛑 CRITICAL PERFORMANCE BOTTLENECK 🛑
│ # PROBLEM: An IMPLICIT GpuSortExec must occur here because the target
│ # Hive table is defined with sort columns.
│ #
│ # WHY IT FAILS: The data stream at this point is at its MAXIMUM width
│ # (~200 columns). Sorting this massive amount of data per row exceeds
│ # GPU memory, forcing expensive spills to host memory/disk (OOC).
└── GpuProject [Final Projection: ~100 Columns]
    # Data width is ~100 columns here.
    └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
        # (Stream-side Bucket/Local Join)
        # Further Expansion: Dozens of columns entering this join.
        ├── Build Side: Join Table 3 (Bucket Scan/Shuffle Read)
        └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
            # (Stream-side Bucket/Local Join)
            # Further Expansion: Dozens of columns entering this join.
            ├── Build Side: Join Table 2 (Bucket Scan/Shuffle Read)
            └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
                # (Stream-side Bucket/Local Join)
                # Column Expansion begins here.
                ├── Build Side: Join Table 1 (Bucket Scan/Shuffle Read)
                └── GpuFileGpuScan [Source: Parquet] # NARROW START: Only < 10 columns read.

Core Pain Points

The wide table sort like the above final GpuSortExec severely degrades overall GPU execution performance because:

  1. Super Heavy OutOfCoreSort & I/O Bottleneck: The required sort operates on a massive dataset that far exceeds available GPU memory, frequently triggering Out-of-Core Sort logic. This forces a huge volume of spill data to be transferred to host memory/disk, introducing significant I/O overhead.
  2. Low GPU Concurrency: The extreme I/O and memory pressure severely restrict parallel execution. Observed metrics show the average concurrent execution ratio is approximately 0.05, indicating that GPU tasks are idle—waiting for I/O and memory resources—for over 95% of their runtime.

Potential approach to improve

  • Sort Key Only Boundaries
    • During OutOfCoreSort boundary computation, only copy sort key columns (not all columns) from GPU to Host.
  • Cascade Compression [FEA] Use GPU compression in spill framework for better performance #13948
  • Adaptive Compression
    • Intelligently decide whether to compress spill data based on runtime conditions (memory pressure, spill frequency, and historical compression effectiveness) to avoid overhead for incompressible data.
  • Priority-Based Proactive Spilling
    • Proactively spill batches from the pending queue when GPU memory pressure is detected, prioritizing batches that are furthest from being needed based on their sort order (first row value).
  • Adaptive OOC Split Sizing
    • Dynamically adjusts the size of the splits/batches used during the OutOfCoreSort process to better manage GPU memory usage and reduce spilling overhead.
  • Enable cudf::merge for nested types Enable efficient merge sort for nested types #14060

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classifyperformanceA performance related task/issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions