-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Improve overlap percent estimation for low-density ranges in StatisticRange #27570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve overlap percent estimation for low-density ranges in StatisticRange #27570
Conversation
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Anton Kutuzov.
|
e52a475 to
9479bf0
Compare
9479bf0 to
bf0ef28
Compare
| double otherDensity = other.distinctValues / other.length(); | ||
| double minDensity = minExcludeNaN(thisDensity, otherDensity); | ||
|
|
||
| if (!isNaN(thisDensity) && !isNaN(otherDensity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)
| if (!isNaN(thisDensity) && !isNaN(otherDensity) | ||
| && isFinite(length()) && isFinite(other.length()) | ||
| && minDensity < DENSITY_HEURISTIC_THRESHOLD) { | ||
| return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be return min(other.distinctValues / this.distinctValues, 1); ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we cannot use:
min(other.distinctValues / this.distinctValues, 1)
because in cases like IN (1, 2, 3, 4), the distinctValues for other is NaN. Then:
min(other.distinctValues / this.distinctValues, 1) = min(NaN, 1) = NaN
so the result would be NaN, which is not valid.
Instead, we should use:
minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues
or equivalently:
minExcludeNaN(other.distinctValues / this.distinctValues, 1)
Also, I looked more carefully at the idea of removing the weight I added before. If we do that, the estimate will go from about 0.29 to 1, because other.distinctValues is NaN
| } | ||
|
|
||
| if (lengthOfIntersect > 0) { | ||
| double thisDensity = this.distinctValues / length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a code comment explaining this section
| double minDensity = minExcludeNaN(thisDensity, otherDensity); | ||
|
|
||
| if (!isNaN(thisDensity) && !isNaN(otherDensity) | ||
| && isFinite(length()) && isFinite(other.length()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here
| if (lengthOfIntersect > 0) { | ||
| double thisDensity = this.distinctValues / length(); | ||
| double otherDensity = other.distinctValues / other.length(); | ||
| double minDensity = minExcludeNaN(thisDensity, otherDensity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExcludeNaN is redundant (we guard against this and other density being nan) but still is a cognitive overhead. I'd inline minDensity variable and remove ExcludeNaN from it's definition.
| if (!isNaN(thisDensity) && !isNaN(otherDensity) | ||
| && isFinite(length()) && isFinite(other.length()) | ||
| && minDensity < DENSITY_HEURISTIC_THRESHOLD) { | ||
| return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining why this particular logic is here.
In particular, a future person working on this code should understand what would break if they delete these lines. Not that people delete random lines -- however. sometimes new code lines cause new problems and removing or changing them is always a possibility. It must be understandable what circumstances we should be concerned about when doing so.
| StatisticRange smallFilter = range(0, 100, 5); | ||
|
|
||
| double overlap = boundaryRange.overlapPercentWith(smallFilter); | ||
| assertThat(overlap).isBetween(0.01, 0.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's very wide
| StatisticRange sparseRange = range(1, 3662098119.0, 14); | ||
| StatisticRange filterRange = range(1, 4, 4); | ||
|
|
||
| double expectedOverlap = 4.0 / 14.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline?
| StatisticRange boundaryRange = range(0, 10000, 10); | ||
| StatisticRange smallFilter = range(0, 100, 5); | ||
|
|
||
| double overlap = boundaryRange.overlapPercentWith(smallFilter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline this variable and add line break before .isBetween
| StatisticRange filterRange = range(100, 200, 5); | ||
|
|
||
| double expected = 5.0 / 10.0; | ||
| double actual = verySparse.overlapPercentWith(filterRange); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline this variable and add line break before .isCloseTo
| StatisticRange verySparse = range(0, 1e9, 10); | ||
| StatisticRange filterRange = range(100, 200, 5); | ||
|
|
||
| double expected = 5.0 / 10.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline?
|
|
||
| double expected = 5.0 / 10.0; | ||
| double actual = verySparse.overlapPercentWith(filterRange); | ||
| assertThat(actual).isCloseTo(expected, within(0.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1 is a lot error margin given the expected is 0.5.
could be 0.0001
| assertOverlap(unboundedRange(0.0), unboundedRange(0), 0); | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i love unit tests... except they actually don't really tell the story
sometimes it's just testing that a particular math formula produces a particular result... which feels like testing Java itself.
When it's not obvious why we expect given result, then it's not clear to do what to do when a test fails on a change. Oftentimes, updating the test is there's needed (formula changed ⇒ result change ⇒ test expected value changed).
It would be more interesting to see a test where expected value is harder to dispute. TestTpcdsLocalStats is an example of such test -- it simply tests that estimated row count matches reality. The reality cannot be disputed and estimate being close to it is generally desired property of the estimate.
|
Do i understand correctly that in OP context, for a |
Those OR would get simplified to an IN. The problem here is from another optimization where we simplify filters like that to a BETWEEN 1 and 4 filter for better efficiency (SimplifyContinuousInValues), and that changes the estimate. |
Description
Join order can be misestimated due to the uniform distribution assumption in
StatisticRange.overlapPercentWith().When a column has a very wide numeric range but few distinct values
(e.g., distinctValues = 14, low = 1, high = 3.6e9), the current overlap estimation becomes extremely small(e.g., 8.19e-10), underestimating join cardinalities.Example:
table1is large andtable2is small.The column
platform_idintable1has14distinct values, withlow = 1andhigh = 3 662 098 119.In this case, the method
StatisticRange.overlapPercentWith()estimates the overlap as(4 - 1) / (3,662,098,119 - 1) ≈ 8.19e-10which effectively means “all rows are filtered out”.
But in reality, the filter
IN (1,2,3,4)should keep roughly4out of14values(~29%).Solution:
Introduce a density check
density = distinctValues / (high - low)and combine uniform overlap with NDV-based estimate when density is low.Additional context and related issues
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: