feat: support batched table functions #18535
Draft
+4,792
−95
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Batched Table Functions with LATERAL Join Support
Note: This is a draft PR for gathering feedback on the approach. It's working, but not polished yet
I'm interested to hear opinions if this is the right path forward before starting to finalize the PR
Which issue does this PR close?
Closes #18121.
The Problem
DataFusion's current table function API (
TableFunctionImpl) requires constant arguments at planning time and returns aTableProvider. This prevents LATERAL joins where table functions need to correlate with outer query rows.Current Architecture Limitations
Problems:
Proposed Solution
Introduce a new trait that processes arrays of arguments (batched invocation) and returns output with input row mapping for correlation:
Key Design Decisions
1. Arrays as Arguments
ScalarUDF- receives evaluated data, not planning-time expressionsinvoke_batch2. Async API
TableProvider::scan()async pattern3. Direct Pushdown Parameters
TableProvider::scan()patterninvoke_batch()3. Streaming Results
Stream<BatchResultChunk>for memory-bounded executioninput_row_indicesfor efficient LATERAL join implementationArchitecture Overview
Execution Flow Example: generate_series(start, end)
Input batch:
Evaluate args →
start=[10, 20],end=[12, 22]invoke_batch → Stream of chunks:
Combine using indices → Final output:
What's Included
Core Components
1. New Trait (
datafusion/catalog/src/table.rs)2. Logical Plan Nodes (
datafusion/expr/src/logical_plan/plan.rs)StandaloneBatchedTableFunction- forSELECT * FROM func(1, 100)LateralBatchedTableFunction- forSELECT * FROM t LATERAL func(t.x)3. Physical Executor (
datafusion/catalog/src/batched_function/exec.rs)BatchedTableFunctionExecwith two modes (Standalone/Lateral)6. Optimizer Rules (
datafusion/optimizer/src/)7. Reference Implementation (
datafusion/functions-table/src/generate_series_batched.rs)SQL Examples Enabled
Testing
Comprehensive test coverage:
batched_function/exec.rscore/tests/batched_table_function.rssqllogictest/test_files/lateral.sltOpen Questions
1. Is a new trait the right approach?
This PR introduces
BatchedTableFunctionImplas a separate trait. An alternative would be to useTableProviderwrappers.2. Is the API appropriate?
3. Is "Batched" the right terminology?
Current naming:
BatchedTableFunctionImpl,StandaloneBatchedTableFunction, etc.4. Should we keep both APIs or eventually deprecate
TableFunctionImpl?User-Facing Changes
New SQL capabilities:
New API for function developers:
Breaking changes:
The PR adds two new values to the LogicalPlan enum.