Skip to content

Conversation

@xin-zhang2
Copy link
Contributor

@xin-zhang2 xin-zhang2 commented Oct 31, 2025

Description

Added isDistinctSensitive() flag in AggregationFunction to indicate whether the function is sensitive to duplicate inputs, and an optimizer rule to remove distinct from aggregates which are insensitive to duplicates and orders.
Fixes #26075.

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Oct 31, 2025
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 31, 2025

Reviewer's Guide

This PR introduces a new distinct‐sensitivity flag for aggregation functions (both in Java SPI and C++ protocol), propagates the flag through all built-in and SQL-invoked implementations and function namespaces, and adds a new optimizer rule to strip DISTINCT from aggregates that are insensitive to duplicates or order (including registration and tests).

Class diagram for AggregationFunctionMetadata and related changes

classDiagram
class AggregationFunctionMetadata {
  -TypeSignature intermediateType
  -boolean isOrderSensitive
  -boolean isDistinctSensitive
  +AggregationFunctionMetadata(intermediateType, isOrderSensitive, isDistinctSensitive)
  +TypeSignature getIntermediateType()
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
  +String toString()
}

class AggregationFunctionImplementation {
  <<interface>>
  +boolean isDecomposable()
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
}

class SqlInvokedAggregationFunctionImplementation {
  -Type intermediateType
  -Type finalType
  -boolean isOrderSensitive
  -boolean isDistinctSensitive
  -List<Type> parameterTypes
  +SqlInvokedAggregationFunctionImplementation(intermediateType, finalType, isOrderSensitive, isDistinctSensitive, parameterTypes)
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
  +List<Type> getParameterTypes()
}

AggregationFunctionImplementation <|.. SqlInvokedAggregationFunctionImplementation
Loading

Class diagram for BuiltInAggregationFunctionImplementation and AggregationHeader changes

classDiagram
class BuiltInAggregationFunctionImplementation {
  -List<Class> lambdaInterfaces
  -boolean decomposable
  -boolean orderSensitive
  -boolean distinctSensitive
  -AggregationMetadata aggregationMetadata
  +BuiltInAggregationFunctionImplementation(..., decomposable, orderSensitive, distinctSensitive, ...)
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
  +AggregationMetadata getAggregationMetadata()
}

class AggregationHeader {
  -String name
  -Optional<String> description
  -boolean decomposable
  -boolean orderSensitive
  -boolean distinctSensitive
  -SqlFunctionVisibility visibility
  -boolean isCalledOnNullInput
  +AggregationHeader(name, description, decomposable, orderSensitive, distinctSensitive, visibility, isCalledOnNullInput)
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
}
Loading

Class diagram for HiveAggregationFunctionDescription and HiveAggregationFunctionImplementation changes

classDiagram
class HiveAggregationFunctionDescription {
  -QualifiedObjectName name
  -List<Type> parameterTypes
  -List<Type> intermediateTypes
  -Type finalType
  -boolean decomposable
  -boolean orderSensitive
  -boolean distinctSensitive
  +HiveAggregationFunctionDescription(..., decomposable, orderSensitive, distinctSensitive)
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
}

class HiveAggregationFunctionImplementation {
  +boolean isOrderSensitive()
  +boolean isDistinctSensitive()
}

HiveAggregationFunctionImplementation --> HiveAggregationFunctionDescription : uses
Loading

Class diagram for new optimizer rule RemoveInsensitiveAggregateDistinct

classDiagram
class RemoveInsensitiveAggregateDistinct {
  -Pattern<AggregationNode> pattern
  -FunctionAndTypeManager functionAndTypeManager
  +RemoveInsensitiveAggregateDistinct(functionAndTypeManager)
  +Pattern<AggregationNode> getPattern()
  +Result apply(AggregationNode node, Captures captures, Context context)
  -boolean canRemoveDistinct(AggregationNode aggregationNode)
  -boolean canRemoveDistinct(Aggregation aggregation)
}
Loading

Class diagram for AggregationFunction annotation changes

classDiagram
class AggregationFunction {
  <<annotation>>
  +boolean isOrderSensitive() default false
  +boolean isDistinctSensitive() default true
  +SqlFunctionVisibility visibility() default PUBLIC
  +String[] alias() default ""
}
Loading

File-Level Changes

Change Details Files
Add isDistinctSensitive flag to aggregation metadata and protocol serialization
  • Introduce boolean isDistinctSensitive in AggregationFunctionMetadata with JSON annotations
  • Extend SPI interfaces and toString methods to expose the new flag
  • Update C++ protocol (presto_protocol_core.cpp/.h) and FunctionMetadata.cpp to serialize/deserialize isDistinctSensitive
presto-spi/src/main/java/com/facebook/presto/spi/function/AggregationFunctionMetadata.java
presto-spi/src/main/java/com/facebook/presto/spi/function/AggregationFunction.java
presto-spi/src/main/java/com/facebook/presto/spi/function/AggregationFunctionImplementation.java
presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.cpp
presto-native-execution/presto_cpp/presto_protocol/core/presto_protocol_core.h
presto-native-execution/presto_cpp/main/functions/FunctionMetadata.cpp
Propagate distinctSensitive flag through aggregation implementations and namespaces
  • Add distinctSensitive constructor argument in BuiltInAggregationFunctionImplementation, SqlInvokedAggregationFunctionImplementation, HiveAggregationFunctionDescription, AggregationHeader etc.
  • Annotate @AggregationFunction declarations with isDistinctSensitive and update Annotation parser
  • Update function-namespace managers (worker, built-in, special and sidecar) and ParametricAggregation to pass the new flag
presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/...
presto-hive-function-namespace/src/main/java/com/facebook/presto/hive/functions/aggregation/HiveAggregationFunctionDescription.java
presto-built-in-worker-function-tools/src/main/java/com/facebook/presto/builtin/tools/WorkerFunctionUtil.java
presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/AggregationFromAnnotationsParser.java
presto-main-base/src/main/java/com/facebook/presto/metadata/BuiltInSpecialFunctionNamespaceManager.java
presto-function-namespace-managers-common/src/main/java/com/facebook/presto/functionNamespace/AbstractSqlInvokedFunctionNamespaceManager.java
presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/ParametricAggregation.java
presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/RealAverageAggregation.java
presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/ReduceAggregationFunction.java
Implement RemoveInsensitiveAggregateDistinct optimizer rule with registration and tests
  • Create new Rule class to clear DISTINCT on insensitive aggregates
  • Register it in logical and distributed PlanOptimizers
  • Add comprehensive unit tests in TestRemoveInsensitiveAggregateDistinct
presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RemoveInsensitiveAggregateDistinct.java
presto-main-base/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRemoveInsensitiveAggregateDistinct.java

Assessment against linked issues

Issue Objective Addressed Explanation
#26075 Implement an optimizer rule that rewrites min(distinct x) (and similar aggregates) to min(x) when the aggregate function is not sensitive to duplicates.
#26075 Ensure the optimization applies to all relevant aggregate functions (min, max, arbitrary, any_value) that are not sensitive to duplicates or order.
#26075 Add tests to verify that the optimization correctly removes 'distinct' for insensitive aggregate functions and does not remove it for sensitive ones.

Possibly linked issues

  • Optimize min(distinct x) to drop 'distinct' #26075: The PR adds an optimizer rule to remove the 'distinct' keyword from aggregate functions that are not sensitive to duplicate inputs, such as min, max, arbitrary, and any_value, directly addressing the optimization proposed in the issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

public class RemoveInsensitiveAggregateDistinct
implements Rule<AggregationNode>
{
private static final Set<QualifiedObjectName> DISTINCT_INSENSITIVE_AGGREGATION_NAMES = ImmutableSet.<QualifiedObjectName>builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to think about how we can put this into the function metadata itself, rather than hardcoding this list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Thanks for the suggestion! I've updated the code to remove the hardcoding and use a flag in AggregationFunctionMetadata for this. Could you take another look when you have a chance? Thanks again!

@xin-zhang2 xin-zhang2 force-pushed the removeInsensitiveAggregateDistinct branch 4 times, most recently from 4500631 to 44e138b Compare November 4, 2025 12:13
@xin-zhang2 xin-zhang2 force-pushed the removeInsensitiveAggregateDistinct branch from 44e138b to d373b7f Compare November 4, 2025 16:06
@xin-zhang2 xin-zhang2 marked this pull request as ready for review November 4, 2025 18:42
@prestodb-ci prestodb-ci requested review from a team and pramodsatya and removed request for a team November 4, 2025 18:42
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/RemoveInsensitiveAggregateDistinct.java:51-52` </location>
<code_context>
+    @Override
+    public Result apply(AggregationNode node, Captures captures, Context context)
+    {
+        ImmutableMap.Builder<VariableReferenceExpression, Aggregation> aggregations = ImmutableMap.builder();
+        for (Map.Entry<VariableReferenceExpression, Aggregation> entry : node.getAggregations().entrySet()) {
+            Aggregation aggregation = entry.getValue();
+            if (canRemoveDistinct(aggregation)) {
</code_context>

<issue_to_address>
**suggestion (performance):** Aggregations are always rebuilt, even if no changes are made.

Consider returning the original node when no distinct flags are removed to prevent unnecessary reconstruction and reduce plan node churn.

Suggested implementation:

```java
    public Result apply(AggregationNode node, Captures captures, Context context)
    {
        ImmutableMap.Builder<VariableReferenceExpression, Aggregation> aggregations = ImmutableMap.builder();
        boolean anyDistinctRemoved = false;
        for (Map.Entry<VariableReferenceExpression, Aggregation> entry : node.getAggregations().entrySet()) {
            Aggregation aggregation = entry.getValue();
            if (canRemoveDistinct(aggregation)) {
                aggregations.put(entry.getKey(),
                        new Aggregation(
                                aggregation.getCall(),
                                aggregation.getFilter(),
                                aggregation.getOrderBy(),
                                false,
                                aggregation.getMask()));
                anyDistinctRemoved = true;
            }
            else {
                aggregations.put(entry);

```

```java
        }
        if (!anyDistinctRemoved) {
            return Result.empty();
        }
        return Result.of(
                new AggregationNode(
                        node.getId(),
                        node.getSource(),
                        aggregations.build(),
                        node.getGroupingSets(),
                        node.getPreGroupedVariables(),
                        node.getStep(),
                        node.getHashVariable(),
                        node.getGroupIdVariable()));
    }

```
</issue_to_address>

### Comment 2
<location> `presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/BitwiseOrAggregation.java:27` </location>
<code_context>
 import com.facebook.presto.spi.function.OutputFunction;
 import com.facebook.presto.spi.function.SqlType;

-@AggregationFunction("bitwise_or_agg")
+@AggregationFunction(value = "bitwise_or_agg", isDistinctSensitive = false)
 public class BitwiseOrAggregation
 {
</code_context>

<issue_to_address>
**question (bug_risk):** Explicitly setting isDistinctSensitive to false may affect semantic expectations.

Confirm that ignoring duplicates in bitwise_or_agg aligns with all use cases, as this change may cause correctness issues if not intended.
</issue_to_address>

### Comment 3
<location> `presto-main-base/src/main/java/com/facebook/presto/operator/aggregation/BooleanAndAggregation.java:31` </location>
<code_context>
 import static com.facebook.presto.operator.aggregation.state.TriStateBooleanState.NULL_VALUE;
 import static com.facebook.presto.operator.aggregation.state.TriStateBooleanState.TRUE_VALUE;

-@AggregationFunction(value = "bool_and", alias = "every")
+@AggregationFunction(value = "bool_and", alias = "every", isDistinctSensitive = false)
 public final class BooleanAndAggregation
 {
</code_context>

<issue_to_address>
**question:** Setting isDistinctSensitive to false for bool_and may have semantic implications.

Please verify that handling of duplicates and nulls in bool_and remains correct with isDistinctSensitive set to false.
</issue_to_address>

### Comment 4
<location> `presto-main-base/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestRemoveInsensitiveAggregateDistinct.java:143` </location>
<code_context>
+    @Test
+    public void testMixedDistinct()
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding a test for aggregations with filters or order by clauses.

Please include test cases with filters and order by clauses in aggregations to ensure the rule handles these scenarios correctly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +51 to +52
ImmutableMap.Builder<VariableReferenceExpression, Aggregation> aggregations = ImmutableMap.builder();
for (Map.Entry<VariableReferenceExpression, Aggregation> entry : node.getAggregations().entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Aggregations are always rebuilt, even if no changes are made.

Consider returning the original node when no distinct flags are removed to prevent unnecessary reconstruction and reduce plan node churn.

Suggested implementation:

    public Result apply(AggregationNode node, Captures captures, Context context)
    {
        ImmutableMap.Builder<VariableReferenceExpression, Aggregation> aggregations = ImmutableMap.builder();
        boolean anyDistinctRemoved = false;
        for (Map.Entry<VariableReferenceExpression, Aggregation> entry : node.getAggregations().entrySet()) {
            Aggregation aggregation = entry.getValue();
            if (canRemoveDistinct(aggregation)) {
                aggregations.put(entry.getKey(),
                        new Aggregation(
                                aggregation.getCall(),
                                aggregation.getFilter(),
                                aggregation.getOrderBy(),
                                false,
                                aggregation.getMask()));
                anyDistinctRemoved = true;
            }
            else {
                aggregations.put(entry);
        }
        if (!anyDistinctRemoved) {
            return Result.empty();
        }
        return Result.of(
                new AggregationNode(
                        node.getId(),
                        node.getSource(),
                        aggregations.build(),
                        node.getGroupingSets(),
                        node.getPreGroupedVariables(),
                        node.getStep(),
                        node.getHashVariable(),
                        node.getGroupIdVariable()));
    }

import com.facebook.presto.spi.function.OutputFunction;
import com.facebook.presto.spi.function.SqlType;

@AggregationFunction("bitwise_or_agg")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Explicitly setting isDistinctSensitive to false may affect semantic expectations.

Confirm that ignoring duplicates in bitwise_or_agg aligns with all use cases, as this change may cause correctness issues if not intended.

import static com.facebook.presto.operator.aggregation.state.TriStateBooleanState.NULL_VALUE;
import static com.facebook.presto.operator.aggregation.state.TriStateBooleanState.TRUE_VALUE;

@AggregationFunction(value = "bool_and", alias = "every")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Setting isDistinctSensitive to false for bool_and may have semantic implications.

Please verify that handling of duplicates and nulls in bool_and remains correct with isDistinctSensitive set to false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize min(distinct x) to drop 'distinct'

3 participants