Skip to content

Conversation

@anton-kutuzov
Copy link
Contributor

Description

Join order can be misestimated due to the uniform distribution assumption in StatisticRange.overlapPercentWith().

When a column has a very wide numeric range but few distinct values (e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small (e.g., 8.19e-10), underestimating join cardinalities.

Example:

SELECT *
FROM table1 t1
JOIN table2 t2
  ON t1.eid = t2.eid
WHERE CAST(event_date AS DATE) = DATE '2025-09-07'
  AND t1.platform_id IN (1, 2, 3, 4);

table1 is large and table2 is small.
The column platform_id in table1 has 14 distinct values, with low = 1 and high = 3 662 098 119.
In this case, the method StatisticRange.overlapPercentWith() estimates the overlap as
(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10
which effectively means “all rows are filtered out”.
But in reality, the filter IN (1,2,3,4) should keep roughly 4 out of 14 values (~29%).

Solution:
Introduce a density check density = distinctValues / (high - low) and combine uniform overlap with NDV-based estimate when density is low.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot
Copy link

cla-bot bot commented Dec 7, 2025

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@anton-kutuzov anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from e52a475 to 9479bf0 Compare December 7, 2025 19:07
@cla-bot cla-bot bot added the cla-signed label Dec 7, 2025
@anton-kutuzov anton-kutuzov force-pushed the fix_overlap_percent_low_density branch from 9479bf0 to bf0ef28 Compare December 7, 2025 19:44
@wendigo wendigo requested a review from findepi December 9, 2025 12:38
double otherDensity = other.distinctValues / other.length();
double minDensity = minExcludeNaN(thisDensity, otherDensity);

if (!isNaN(thisDensity) && !isNaN(otherDensity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)

if (!isNaN(thisDensity) && !isNaN(otherDensity)
&& isFinite(length()) && isFinite(other.length())
&& minDensity < DENSITY_HEURISTIC_THRESHOLD) {
return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be return min(other.distinctValues / this.distinctValues, 1); ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we cannot use:
min(other.distinctValues / this.distinctValues, 1)
because in cases like IN (1, 2, 3, 4), the distinctValues for other is NaN. Then:
min(other.distinctValues / this.distinctValues, 1) = min(NaN, 1) = NaN
so the result would be NaN, which is not valid.

Instead, we should use:
minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues
or equivalently:
minExcludeNaN(other.distinctValues / this.distinctValues, 1)


Also, I looked more carefully at the idea of removing the weight I added before. If we do that, the estimate will go from about 0.29 to 1, because other.distinctValues is NaN

}

if (lengthOfIntersect > 0) {
double thisDensity = this.distinctValues / length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a code comment explaining this section

double minDensity = minExcludeNaN(thisDensity, otherDensity);

if (!isNaN(thisDensity) && !isNaN(otherDensity)
&& isFinite(length()) && isFinite(other.length())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here

if (lengthOfIntersect > 0) {
double thisDensity = this.distinctValues / length();
double otherDensity = other.distinctValues / other.length();
double minDensity = minExcludeNaN(thisDensity, otherDensity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExcludeNaN is redundant (we guard against this and other density being nan) but still is a cognitive overhead. I'd inline minDensity variable and remove ExcludeNaN from it's definition.

Comment on lines +132 to +135
if (!isNaN(thisDensity) && !isNaN(otherDensity)
&& isFinite(length()) && isFinite(other.length())
&& minDensity < DENSITY_HEURISTIC_THRESHOLD) {
return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why this particular logic is here.
In particular, a future person working on this code should understand what would break if they delete these lines. Not that people delete random lines -- however. sometimes new code lines cause new problems and removing or changing them is always a possibility. It must be understandable what circumstances we should be concerned about when doing so.

StatisticRange smallFilter = range(0, 100, 5);

double overlap = boundaryRange.overlapPercentWith(smallFilter);
assertThat(overlap).isBetween(0.01, 0.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's very wide

StatisticRange sparseRange = range(1, 3662098119.0, 14);
StatisticRange filterRange = range(1, 4, 4);

double expectedOverlap = 4.0 / 14.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline?

StatisticRange boundaryRange = range(0, 10000, 10);
StatisticRange smallFilter = range(0, 100, 5);

double overlap = boundaryRange.overlapPercentWith(smallFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline this variable and add line break before .isBetween

StatisticRange filterRange = range(100, 200, 5);

double expected = 5.0 / 10.0;
double actual = verySparse.overlapPercentWith(filterRange);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline this variable and add line break before .isCloseTo

StatisticRange verySparse = range(0, 1e9, 10);
StatisticRange filterRange = range(100, 200, 5);

double expected = 5.0 / 10.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline?


double expected = 5.0 / 10.0;
double actual = verySparse.overlapPercentWith(filterRange);
assertThat(actual).isCloseTo(expected, within(0.1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1 is a lot error margin given the expected is 0.5.
could be 0.0001

assertOverlap(unboundedRange(0.0), unboundedRange(0), 0);
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i love unit tests... except they actually don't really tell the story
sometimes it's just testing that a particular math formula produces a particular result... which feels like testing Java itself.

When it's not obvious why we expect given result, then it's not clear to do what to do when a test fails on a change. Oftentimes, updating the test is there's needed (formula changed ⇒ result change ⇒ test expected value changed).
It would be more interesting to see a test where expected value is harder to dispute. TestTpcdsLocalStats is an example of such test -- it simply tests that estimated row count matches reality. The reality cannot be disputed and estimate being close to it is generally desired property of the estimate.

@findepi
Copy link
Member

findepi commented Dec 10, 2025

Do i understand correctly that in OP context, for a platform_id = 1 OR platform_id = 2 OR platform_id = 3 OR platform_id = 4 filter we would produce a good estimate, and for platform_id IN (1, 2, 3, 4) we produce a poor one?

@raunaqmorarka
Copy link
Member

Do i understand correctly that in OP context, for a platform_id = 1 OR platform_id = 2 OR platform_id = 3 OR platform_id = 4 filter we would produce a good estimate, and for platform_id IN (1, 2, 3, 4) we produce a poor one?

Those OR would get simplified to an IN. The problem here is from another optimization where we simplify filters like that to a BETWEEN 1 and 4 filter for better efficiency (SimplifyContinuousInValues), and that changes the estimate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants