Improve overlap percent estimation for low-density ranges in StatisticRange #27570

anton-kutuzov · 2025-12-07T19:06:08Z

Description

Join order can be misestimated due to the uniform distribution assumption in StatisticRange.overlapPercentWith().

When a column has a very wide numeric range but few distinct values (e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small (e.g., 8.19e-10), underestimating join cardinalities.

Example:

SELECT *
FROM table1 t1
JOIN table2 t2
  ON t1.eid = t2.eid
WHERE CAST(event_date AS DATE) = DATE '2025-09-07'
  AND t1.platform_id IN (1, 2, 3, 4);

table1 is large and table2 is small.
The column platform_id in table1 has 14 distinct values, with low = 1 and high = 3 662 098 119.
In this case, the method StatisticRange.overlapPercentWith() estimates the overlap as
(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10
which effectively means “all rows are filtered out”.
But in reality, the filter IN (1,2,3,4) should keep roughly 4 out of 14 values (~29%).

Solution:
Introduce a density check density = distinctValues / (high - low) and combine uniform overlap with NDV-based estimate when density is low.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

cla-bot · 2025-12-07T19:06:10Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

…cRange

raunaqmorarka · 2025-12-09T18:58:57Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            double otherDensity = other.distinctValues / other.length();
+            double minDensity = minExcludeNaN(thisDensity, otherDensity);
+
+            if (!isNaN(thisDensity) && !isNaN(otherDensity)


!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)

raunaqmorarka · 2025-12-09T19:02:31Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            if (!isNaN(thisDensity) && !isNaN(otherDensity)
+                    && isFinite(length()) && isFinite(other.length())
+                    && minDensity < DENSITY_HEURISTIC_THRESHOLD) {
+                return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;


can this be return min(other.distinctValues / this.distinctValues, 1); ?

I think that we cannot use:
min(other.distinctValues / this.distinctValues, 1)
because in cases like IN (1, 2, 3, 4), the distinctValues for other is NaN. Then:
min(other.distinctValues / this.distinctValues, 1) = min(NaN, 1) = NaN
so the result would be NaN, which is not valid.

Instead, we should use:
minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues
or equivalently:
minExcludeNaN(other.distinctValues / this.distinctValues, 1)

Also, I looked more carefully at the idea of removing the weight I added before. If we do that, the estimate will go from about 0.29 to 1, because other.distinctValues is NaN

raunaqmorarka · 2025-12-09T19:06:50Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

        }
+
        if (lengthOfIntersect > 0) {
+            double thisDensity = this.distinctValues / length();


Please add a code comment explaining this section

raunaqmorarka · 2025-12-09T19:42:57Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            double minDensity = minExcludeNaN(thisDensity, otherDensity);
+
+            if (!isNaN(thisDensity) && !isNaN(otherDensity)
+                    && isFinite(length()) && isFinite(other.length())


Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here

findepi · 2025-12-10T09:22:13Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

        if (lengthOfIntersect > 0) {
+            double thisDensity = this.distinctValues / length();
+            double otherDensity = other.distinctValues / other.length();
+            double minDensity = minExcludeNaN(thisDensity, otherDensity);


ExcludeNaN is redundant (we guard against this and other density being nan) but still is a cognitive overhead. I'd inline minDensity variable and remove ExcludeNaN from it's definition.

findepi · 2025-12-10T09:24:38Z

core/trino-main/src/main/java/io/trino/cost/StatisticRange.java

+            if (!isNaN(thisDensity) && !isNaN(otherDensity)
+                    && isFinite(length()) && isFinite(other.length())
+                    && minDensity < DENSITY_HEURISTIC_THRESHOLD) {
+                return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;


Add a comment explaining why this particular logic is here.
In particular, a future person working on this code should understand what would break if they delete these lines. Not that people delete random lines -- however. sometimes new code lines cause new problems and removing or changing them is always a possibility. It must be understandable what circumstances we should be concerned about when doing so.

findepi · 2025-12-10T09:25:29Z