[FEA] GpuOutOfCoreSort over Wide Tables


This is an umbrella issue tracks several proposed optimizations aimed at resolving severe performance degradation observed when running `GpuOutOfCoreSort` on extremely wide tables (~100 columns and including big columns like ArrayType or long string) in Spark-Rapids. This scenario often forces frequent and heavy spilling to host memory, creating significant I/O and memory bottlenecks.

### Problem Description

This optimization proposal is directly motivated by severe performance bottlenecks observed in large-scale data warehousing environments, specifically concerning final data persistence. These systems often process high volumes of data and utilize wide tables, which together create an extremely challenging scenario for GPU-accelerated sorting.

The primary architectural constraint is the business requirement for a mandated final sort order (typically on primary keys) before writing the dataset to Parquet tables via a `GpuInsertIntoHiveTable` command. This requirement necessitates an extremely expensive `GpuSortExec` operation near the end of the query pipeline.Query Plan Pattern: The Root Cause (Column Expansion Pattern)

The costly final `GpuSortExec` is a direct consequence of a preceding query plan structure that expands a small initial table into a massive final table through a deep chain of join operations.

**TYPICAL SCIENARIO: The OutOfCoreSort Bottleneck**

```
GpuInsertIntoHiveTable [Target: Parquet]
│
│ # 🛑 CRITICAL PERFORMANCE BOTTLENECK 🛑
│ # PROBLEM: An IMPLICIT GpuSortExec must occur here because the target
│ # Hive table is defined with sort columns.
│ #
│ # WHY IT FAILS: The data stream at this point is at its MAXIMUM width
│ # (~200 columns). Sorting this massive amount of data per row exceeds
│ # GPU memory, forcing expensive spills to host memory/disk (OOC).
└── GpuProject [Final Projection: ~100 Columns]
    # Data width is ~100 columns here.
    └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
        # (Stream-side Bucket/Local Join)
        # Further Expansion: Dozens of columns entering this join.
        ├── Build Side: Join Table 3 (Bucket Scan/Shuffle Read)
        └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
            # (Stream-side Bucket/Local Join)
            # Further Expansion: Dozens of columns entering this join.
            ├── Build Side: Join Table 2 (Bucket Scan/Shuffle Read)
            └── GpuShuffledAsymmetricHashJoin / GpuSizedHashJoin [Left Outer]
                # (Stream-side Bucket/Local Join)
                # Column Expansion begins here.
                ├── Build Side: Join Table 1 (Bucket Scan/Shuffle Read)
                └── GpuFileGpuScan [Source: Parquet] # NARROW START: Only < 10 columns read.
```

### Core Pain Points

The wide table sort like the above final GpuSortExec severely degrades overall GPU execution performance because:
1. **Super Heavy OutOfCoreSort & I/O Bottleneck:** The required sort operates on a massive dataset that far exceeds available GPU memory, frequently triggering **Out-of-Core Sort** logic. This forces a huge volume of spill data to be transferred to host memory/disk, introducing significant I/O overhead.  
2. **Low GPU Concurrency:** The extreme I/O and memory pressure severely restrict parallel execution. Observed metrics show the average concurrent execution ratio is approximately **0.05**, indicating that GPU tasks are idle—waiting for I/O and memory resources—for over 95% of their runtime.  

### Potential approach to improve
- [ ] Sort Key Only Boundaries
     - During OutOfCoreSort boundary computation, only copy sort key columns (not all columns) from GPU to Host.
- [ ] Cascade Compression #13948
- [ ] Adaptive Compression
    - Intelligently decide whether to compress spill data based on runtime conditions (memory pressure, spill frequency, and historical compression effectiveness) to avoid overhead for incompressible data.
- [ ] Priority-Based Proactive Spilling 
    - Proactively spill batches from the pending queue when GPU memory pressure is detected, prioritizing batches that are furthest from being needed based on their sort order (first row value). 
- [ ] Adaptive OOC Split Sizing
    - Dynamically adjusts the size of the splits/batches used during the OutOfCoreSort process to better manage GPU memory usage and reduce spilling overhead.
- [ ] Enable `cudf::merge` for nested types #14060


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] GpuOutOfCoreSort over Wide Tables #14072

Problem Description

Core Pain Points

Potential approach to improve

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] GpuOutOfCoreSort over Wide Tables #14072

Description

Problem Description

Core Pain Points

Potential approach to improve

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions