-
Notifications
You must be signed in to change notification settings - Fork 268
Description
This is an umbrella issue tracks several proposed optimizations aimed at resolving severe performance degradation observed when running GpuOutOfCoreSort on extremely wide tables (~100 columns and including big columns like ArrayType or long string) in Spark-Rapids. This scenario often forces frequent and heavy spilling to host memory, creating significant I/O and memory bottlenecks.
Problem Description
This optimization proposal is directly motivated by severe performance bottlenecks observed in large-scale data warehousing environments, specifically concerning final data persistence. These systems often process high volumes of data and utilize wide tables, which together create an extremely challenging scenario for GPU-accelerated sorting.
The primary architectural constraint is the business requirement for a mandated final sort order (typically on primary keys) before writing the dataset to Parquet tables via a GpuInsertIntoHiveTable command. This requirement necessitates an extremely expensive GpuSortExec operation near the end of the query pipeline.Query Plan Pattern: The Root Cause (Column Expansion Pattern)
The costly final GpuSortExec is a direct consequence of a preceding query plan structure that expands a small initial table into a massive final table through a deep chain of join operations.
TYPICAL SCIENARIO: The OutOfCoreSort Bottleneck
GpuInsertIntoHiveTable [Target: Parquet]
│
│ # 🛑 CRITICAL PERFORMANCE BOTTLENECK 🛑
│ # PROBLEM: An IMPLICIT GpuSortExec must occur here because the target
│ # Hive table is defined with sort columns.
│ #
│ # WHY IT FAILS: The data stream at this point is at its MAXIMUM width
│ # (~200 columns). Sorting this massive amount of data per row exceeds
│ # GPU memory, forcing expensive spills to host memory/disk (OOC).
└── GpuProject [Final Projection: ~100 Columns]
# Data width is ~100 columns here.
└── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
# (Stream-side Bucket/Local Join)
# Further Expansion: Dozens of columns entering this join.
├── Build Side: Join Table 3 (Bucket Scan/Shuffle Read)
└── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
# (Stream-side Bucket/Local Join)
# Further Expansion: Dozens of columns entering this join.
├── Build Side: Join Table 2 (Bucket Scan/Shuffle Read)
└── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
# (Stream-side Bucket/Local Join)
# Column Expansion begins here.
├── Build Side: Join Table 1 (Bucket Scan/Shuffle Read)
└── GpuFileGpuScan [Source: Parquet] # NARROW START: Only < 10 columns read.
Core Pain Points
The wide table sort like the above final GpuSortExec severely degrades overall GPU execution performance because:
- Super Heavy OutOfCoreSort & I/O Bottleneck: The required sort operates on a massive dataset that far exceeds available GPU memory, frequently triggering Out-of-Core Sort logic. This forces a huge volume of spill data to be transferred to host memory/disk, introducing significant I/O overhead.
- Low GPU Concurrency: The extreme I/O and memory pressure severely restrict parallel execution. Observed metrics show the average concurrent execution ratio is approximately 0.05, indicating that GPU tasks are idle—waiting for I/O and memory resources—for over 95% of their runtime.
Potential approach to improve
- Sort Key Only Boundaries
- During OutOfCoreSort boundary computation, only copy sort key columns (not all columns) from GPU to Host.
- Cascade Compression [FEA] Use GPU compression in spill framework for better performance #13948
- Adaptive Compression
- Intelligently decide whether to compress spill data based on runtime conditions (memory pressure, spill frequency, and historical compression effectiveness) to avoid overhead for incompressible data.
- Priority-Based Proactive Spilling
- Proactively spill batches from the pending queue when GPU memory pressure is detected, prioritizing batches that are furthest from being needed based on their sort order (first row value).
- Adaptive OOC Split Sizing
- Dynamically adjusts the size of the splits/batches used during the OutOfCoreSort process to better manage GPU memory usage and reduce spilling overhead.
- Enable
cudf::mergefor nested types Enable efficient merge sort for nested types #14060