[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) #53400

Yicong-Huang · 2025-12-09T00:35:10Z

What changes were proposed in this pull request?

Add support for User-Defined Aggregate Functions (UDAF) in PySpark. Currently PySpark supports User-Defined Functions (UDF) and User-Defined Table Functions (UDTF), but lacks support for UDAF. Users need to write custom aggregation logic in Scala/Java or use less efficient workarounds.

This change adds UDAF support using a two-stage aggregation pattern with mapInArrow and applyInArrow. The basic idea is to implement aggregation (and partial aggregation) by:

df.selectExpr("rand() as key").mapInArrow(reduce).groupBy(key).applyInArrow(merge)

Where func1 calls Aggregator.reduce() for partial aggregation within each partition, and func2 calls Aggregator.merge() to combine partial results, then Aggregator.finish() for final results.

Aligned with Scala side, the implementation provides a Python Aggregator base class that users can subclass:

class Aggregator:
    def zero(self) -> Any:
        """Return zero value for aggregation buffer"""
        raise NotImplementedError
    
    def reduce(self, buffer: Any, value: Any) -> Any:
        """Combine input value into buffer"""
        raise NotImplementedError
    
    def merge(self, buffer1: Any, buffer2: Any) -> Any:
        """Merge two intermediate buffers"""
        raise NotImplementedError
    
    def finish(self, reduction: Any) -> Any:
        """Produce final result from buffer"""
        raise NotImplementedError

Users can create UDAF instances using the udaf() function and use them with DataFrame.agg():

sum_udaf = udaf(MySum(), "bigint")
df.agg(sum_udaf(df.value))
df.groupBy("group").agg(sum_udaf(df.value))

Key changes:

Added pyspark.sql.udaf module with Aggregator base class, UserDefinedAggregateFunction wrapper, and udaf() factory function
Integrated UDAF support in GroupedData.agg() by detecting UDAF columns via _udaf_func attribute

Why are the changes needed?

Currently PySpark lacks support for User-Defined Aggregate Functions (UDAF), which limits users' ability to express complex aggregation logic directly in Python. Users must either write custom aggregation logic in Scala/Java or use less efficient workarounds. This change adds UDAF support to complement existing UDF and UDTF support in PySpark, aligning with the Scala/Java Aggregator interface in org.apache.spark.sql.expressions.Aggregator.

Does this PR introduce any user-facing change?

Yes. This PR adds a new feature - User-Defined Aggregate Functions (UDAF) support in PySpark. Users can now define custom aggregation logic by subclassing the Aggregator class and using the udaf() function to create UDAF instances that can be used with DataFrame.agg() and GroupedData.agg().

Example:

class MySum(Aggregator):
    def zero(self):
        return 0
    def reduce(self, buffer, value):
        return buffer + value
    def merge(self, buffer1, buffer2):
        return buffer1 + buffer2
    def finish(self, reduction):
        return reduction

sum_udaf = udaf(MySum(), "bigint")
df.agg(sum_udaf(df.value))

How was this patch tested?

Added comprehensive unit tests in python/pyspark/sql/tests/test_udaf.py covering:

Basic aggregation (sum, average, max)
Grouped aggregation with groupBy().agg()
Null value handling
Empty DataFrame handling
Large datasets (20000+ rows) distributed across partitions
Error handling for invalid inputs
Integration with df.agg() and df.groupBy().agg()

Was this patch authored or co-authored using generative AI tooling?

No.

allisonwang-db

Nice feature!

allisonwang-db · 2025-12-09T18:01:39Z

python/pyspark/sql/tests/test_udaf.py

+    def test_udaf_mixed_with_other_agg_not_supported(self):
+        """Test that mixing UDAF with other aggregate functions raises error."""
+
+        class MySum(Aggregator):


Can we add some tests for more complicated data structures? like dictionary?

added more data types!

zhengruifeng · 2025-12-11T03:39:16Z

python/pyspark/sql/udaf.py

+]
+
+
+class Aggregator:


do we necessarily need this class?
I see UDTF doesn't need a base class.

>>> class TestUDTF: ... def eval(self, *args: Any): ... yield "hello", "world"

zhengruifeng · 2025-12-11T03:45:49Z

python/pyspark/sql/udaf.py

+        Apply this UDAF to the given columns.
+
+        This creates a Column expression that can be used in DataFrame operations.
+        The actual aggregation is performed using mapInArrow and applyInArrow.


why not a dedicated pyhsical plan?

zhengruifeng · 2025-12-11T03:50:04Z

python/pyspark/sql/udaf.py

+        -----
+        This implementation uses mapInArrow and applyInArrow internally to perform
+        the aggregation. The approach follows:
+        1. mapInArrow: Performs partial aggregation (reduce) on each partition


If we want to support partial aggregation with existing arrow UDFs, I think we should use a modified FlatMapGroupsInArrowExec with requiredChildDistribution = UnspecifiedDistribution.

zhengruifeng · 2025-12-11T03:57:41Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewritePythonAggregatorUDAF.scala

+ * MapInArrow, Aggregate, and FlatMapGroupsInArrow operators.
+ *
+ * This implements a three-phase aggregation pattern:
+ * 1. Partial aggregation (MapInArrow): Applies reduce() on each partition, outputs


MapInArrowExec dosen't requiredChildOrdering, where does it sort the data for partial aggregation?

zhengruifeng · 2025-12-11T04:02:24Z

The basic idea is to implement aggregation (and partial aggregation) by:
df.selectExpr("rand() as key").mapInArrow(reduce).groupBy(key).applyInArrow(merge)

I think there should be a sortWithinPartitions before mapInArrow for partial aggregation.

zhengruifeng · 2025-12-11T04:05:38Z

The whole approach is based on mapInArrow and applyInArrow, how does it support function registration so that it can be used in SQL?

zhengruifeng · 2025-12-11T04:13:24Z

python/pyspark/sql/udaf.py

+                        group_buffers[grouping_key] = agg.zero()
+
+                    if value is not None:
+                        group_buffers[grouping_key] = agg.reduce(group_buffers[grouping_key], value)


group_buffers buffers all the aggregators within a partition, it will cause memory issue if the cardinality is large.
A reasonable physical plan should sort the partition by the key, and then output the partial aggregation result after finishing each group

it mimic the HashAggregateExec, while SortAggregateExec is more stable

feat: add udaf support

c71ab4f

github-actions bot added SQL PYTHON labels Dec 9, 2025

Yicong-Huang added 2 commits December 8, 2025 16:48

fix: remove comments

749ec24

fix: format

34d11ec

allisonwang-db reviewed Dec 9, 2025

View reviewed changes

Yicong-Huang changed the title ~~[SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF)~~ [WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) Dec 10, 2025

Yicong-Huang marked this pull request as draft December 10, 2025 01:36

Yicong-Huang added 10 commits December 10, 2025 16:18

feat: use logicial plan to implement UDAF

c28b20c

chore: remove pure python implementation

71946ba

feat: require staticmethod

65a4827

test: add tests for different types

89a10aa

fix: format

345adfa

fix: doc

0885b3d

fix: doc test

5c7eab7

fix: import

ddc4ef6

fix: remove assumption on column order

a9afba6

test: combine tests

76c5197

zhengruifeng reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) #53400

[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) #53400

Yicong-Huang commented Dec 9, 2025 •

edited

Loading

Uh oh!

allisonwang-db left a comment

Uh oh!

allisonwang-db Dec 9, 2025

Uh oh!

Yicong-Huang Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025 •

edited

Loading

Uh oh!

zhengruifeng Dec 11, 2025

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) #53400

Are you sure you want to change the base?

[WIP][SPARK-54647][PYTHON] Support User-Defined Aggregate Functions (UDAF) #53400

Conversation

Yicong-Huang commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

zhengruifeng commented Dec 11, 2025

Uh oh!

zhengruifeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yicong-Huang commented Dec 9, 2025 •

edited

Loading

zhengruifeng Dec 11, 2025 •

edited

Loading