Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions core/trino-main/src/main/java/io/trino/cost/StatisticRange.java
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ public class StatisticRange
{
private static final double INFINITE_TO_FINITE_RANGE_INTERSECT_OVERLAP_HEURISTIC_FACTOR = 0.25;
private static final double INFINITE_TO_INFINITE_RANGE_INTERSECT_OVERLAP_HEURISTIC_FACTOR = 0.5;
private static final double DENSITY_HEURISTIC_THRESHOLD = 1e-3;

// TODO unify field and method names with SymbolStatsEstimate
/**
Expand Down Expand Up @@ -122,7 +123,17 @@ public double overlapPercentWith(StatisticRange other)
if (isInfinite(length()) && isFinite(lengthOfIntersect)) {
return INFINITE_TO_FINITE_RANGE_INTERSECT_OVERLAP_HEURISTIC_FACTOR;
}

if (lengthOfIntersect > 0) {
double thisDensity = this.distinctValues / length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a code comment explaining this section

double otherDensity = other.distinctValues / other.length();
double minDensity = minExcludeNaN(thisDensity, otherDensity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExcludeNaN is redundant (we guard against this and other density being nan) but still is a cognitive overhead. I'd inline minDensity variable and remove ExcludeNaN from it's definition.


if (!isNaN(thisDensity) && !isNaN(otherDensity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isNaN(thisDensity) && !isNaN(otherDensity) -> !isNaN(minDensity)

&& isFinite(length()) && isFinite(other.length())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check that the lengths are finite ?
I think we want to skip lengthOfIntersect == length() case here

&& minDensity < DENSITY_HEURISTIC_THRESHOLD) {
return minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be return min(other.distinctValues / this.distinctValues, 1); ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we cannot use:
min(other.distinctValues / this.distinctValues, 1)
because in cases like IN (1, 2, 3, 4), the distinctValues for other is NaN. Then:
min(other.distinctValues / this.distinctValues, 1) = min(NaN, 1) = NaN
so the result would be NaN, which is not valid.

Instead, we should use:
minExcludeNaN(this.distinctValues, other.distinctValues) / this.distinctValues
or equivalently:
minExcludeNaN(other.distinctValues / this.distinctValues, 1)


Also, I looked more carefully at the idea of removing the weight I added before. If we do that, the estimate will go from about 0.29 to 1, because other.distinctValues is NaN

Comment on lines +132 to +135
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why this particular logic is here.
In particular, a future person working on this code should understand what would break if they delete these lines. Not that people delete random lines -- however. sometimes new code lines cause new problems and removing or changing them is always a possibility. It must be understandable what circumstances we should be concerned about when doing so.

}
return lengthOfIntersect / length();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import static java.lang.Double.NaN;
import static java.lang.Double.POSITIVE_INFINITY;
import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.AssertionsForClassTypes.within;

public class TestStatisticRange
{
Expand Down Expand Up @@ -59,6 +60,55 @@ public void testOverlapPercentWith()
assertOverlap(unboundedRange(0.0), unboundedRange(0), 0);
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i love unit tests... except they actually don't really tell the story
sometimes it's just testing that a particular math formula produces a particular result... which feels like testing Java itself.

When it's not obvious why we expect given result, then it's not clear to do what to do when a test fails on a change. Oftentimes, updating the test is there's needed (formula changed ⇒ result change ⇒ test expected value changed).
It would be more interesting to see a test where expected value is harder to dispute. TestTpcdsLocalStats is an example of such test -- it simply tests that estimated row count matches reality. The reality cannot be disputed and estimate being close to it is generally desired property of the estimate.

public void testLowDensityOverlap()
{
StatisticRange sparseRange = range(1, 3662098119.0, 14);
StatisticRange filterRange = range(1, 4, 4);

double expectedOverlap = 4.0 / 14.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline?

assertOverlap(sparseRange, filterRange, expectedOverlap);
}

@Test
public void testDensityThresholdBoundary()
{
StatisticRange boundaryRange = range(0, 10000, 10);
StatisticRange smallFilter = range(0, 100, 5);

double overlap = boundaryRange.overlapPercentWith(smallFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline this variable and add line break before .isBetween

assertThat(overlap).isBetween(0.01, 0.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's very wide

}

@Test
public void testHighDensityOverlap()
{
StatisticRange denseRange = range(0, 100, 50);
StatisticRange filterRange = range(20, 30, 5);

assertOverlap(denseRange, filterRange, 0.1);
}

@Test
public void testVeryLowDensity()
{
StatisticRange verySparse = range(0, 1e9, 10);
StatisticRange filterRange = range(100, 200, 5);

double expected = 5.0 / 10.0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline?

double actual = verySparse.overlapPercentWith(filterRange);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd inline this variable and add line break before .isCloseTo

assertThat(actual).isCloseTo(expected, within(0.1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1 is a lot error margin given the expected is 0.5.
could be 0.0001

}

@Test
public void testDensityWithZeroDistinctValues()
{
StatisticRange zeroDistinct = range(0, 1000, 0);
StatisticRange filterRange = range(100, 200, 5);

assertOverlap(zeroDistinct, filterRange, 0);
}

@Test
public void testIntersect()
{
Expand Down