Skip to content

Conversation

@feniljain
Copy link
Contributor

@feniljain feniljain commented Nov 4, 2025

Which issue does this PR close?

What changes are included in this PR?

Added a distinct element calculator in core hash join loop. It also works on an assumption that indices will be returned in an increasing order, I couldn't see a place where this assumption is broken, but if that's not the case, please do help me out.

Also, I am not 100% sure my implementation for avg_fanout is correct, so do let me know if that needs changes.

Are these changes tested?

No failures in sqllogictests/tests in datafusion/core/tests/sql/, should I add a test case for this?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 4, 2025
@feniljain feniljain force-pushed the feat-hash-join-metrics branch from 7ba1336 to e00d993 Compare November 4, 2025 19:14
@2010YOUY01
Copy link
Contributor

Thank you, I think the implementation is correct.

The only consideration is performance, Hash Join implementation is definitely on the performance critical path, so we have to be careful not to introduce additional overhead. This PR should be good to go if we can verify it has no influence on the performance.

In this PR, the extra overhead is for each batch, count the distinct_count for a sorted vector like [0,1,1,2,2,2...] up to batch size long, it seem shouldn't be the bottleneck. (@alamb Could you help trigger then benchmark please?)

I believe these metrics provide more insight than simply computing output_rows / input_rows for equal joins. However, if they introduce noticeable overhead, we can move them under ExplainAnalyzeLevel::Dev, and track them only when this extra-verbose level is enabled. We should also document that more detailed analyze levels may incur additional execution overhead but offer deeper insights.

@xudong963
Copy link
Member

Thank you, I think the implementation is correct.

The only consideration is performance, Hash Join implementation is definitely on the performance critical path, so we have to be careful not to introduce additional overhead. This PR should be good to go if we can verify it has no influence on the performance.

In this PR, the extra overhead is for each batch, count the distinct_count for a sorted vector like [0,1,1,2,2,2...] up to batch size long, it seem shouldn't be the bottleneck. (@alamb Could you help trigger then benchmark please?)

I believe these metrics provide more insight than simply computing output_rows / input_rows for equal joins. However, if they introduce noticeable overhead, we can move them under ExplainAnalyzeLevel::Dev, and track them only when this extra-verbose level is enabled. We should also document that more detailed analyze levels may incur additional execution overhead but offer deeper insights.

+1. Better to do some profiling for the classic join patterns to have a clear print.

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat-hash-join-metrics (e00d993) to c5fb605 diff using: tpch_mem
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

In this PR, the extra overhead is for each batch, count the distinct_count for a sorted vector like [0,1,1,2,2,2...] up to batch size long, it seem shouldn't be the bottleneck. (@alamb Could you help trigger then benchmark please?)

Kicked it off

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

🤖: Benchmark completed

Details

Comparing HEAD and feat-hash-join-metrics
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ feat-hash-join-metrics ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 140.73 ms │              143.40 ms │    no change │
│ QQuery 2     │  26.64 ms │               29.79 ms │ 1.12x slower │
│ QQuery 3     │  34.25 ms │               41.59 ms │ 1.21x slower │
│ QQuery 4     │  29.03 ms │               29.87 ms │    no change │
│ QQuery 5     │  87.12 ms │               89.33 ms │    no change │
│ QQuery 6     │  19.23 ms │               19.52 ms │    no change │
│ QQuery 7     │ 220.66 ms │              232.08 ms │ 1.05x slower │
│ QQuery 8     │  32.49 ms │               34.91 ms │ 1.07x slower │
│ QQuery 9     │ 107.24 ms │              103.10 ms │    no change │
│ QQuery 10    │  62.77 ms │               67.07 ms │ 1.07x slower │
│ QQuery 11    │  17.54 ms │               18.19 ms │    no change │
│ QQuery 12    │  50.81 ms │               52.23 ms │    no change │
│ QQuery 13    │  47.93 ms │               49.43 ms │    no change │
│ QQuery 14    │  13.78 ms │               14.63 ms │ 1.06x slower │
│ QQuery 15    │  24.87 ms │               26.07 ms │    no change │
│ QQuery 16    │  25.58 ms │               25.40 ms │    no change │
│ QQuery 17    │ 149.39 ms │              156.53 ms │    no change │
│ QQuery 18    │ 276.29 ms │              291.48 ms │ 1.05x slower │
│ QQuery 19    │  36.79 ms │               38.34 ms │    no change │
│ QQuery 20    │  49.66 ms │               49.73 ms │    no change │
│ QQuery 21    │ 318.85 ms │              329.51 ms │    no change │
│ QQuery 22    │  22.17 ms │               21.37 ms │    no change │
└──────────────┴───────────┴────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1793.83ms │
│ Total Time (feat-hash-join-metrics)   │ 1863.55ms │
│ Average Time (HEAD)                   │   81.54ms │
│ Average Time (feat-hash-join-metrics) │   84.71ms │
│ Queries Faster                        │         0 │
│ Queries Slower                        │         7 │
│ Queries with No Change                │        15 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat-hash-join-metrics (e00d993) to c5fb605 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

🤖: Benchmark completed

Details

Comparing HEAD and feat-hash-join-metrics
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ feat-hash-join-metrics ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2676.22 ms │             2687.81 ms │     no change │
│ QQuery 1     │  1239.21 ms │             1201.13 ms │     no change │
│ QQuery 2     │  2297.31 ms │             2287.02 ms │     no change │
│ QQuery 3     │  1203.15 ms │             1194.59 ms │     no change │
│ QQuery 4     │  2317.59 ms │             2396.58 ms │     no change │
│ QQuery 5     │ 28535.00 ms │            28593.26 ms │     no change │
│ QQuery 6     │  4243.56 ms │             4140.09 ms │     no change │
│ QQuery 7     │  3738.66 ms │             3537.71 ms │ +1.06x faster │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 46250.69ms │
│ Total Time (feat-hash-join-metrics)   │ 46038.19ms │
│ Average Time (HEAD)                   │  5781.34ms │
│ Average Time (feat-hash-join-metrics) │  5754.77ms │
│ Queries Faster                        │          1 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          7 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ feat-hash-join-metrics ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.56 ms │                2.30 ms │ +1.11x faster │
│ QQuery 1     │    48.43 ms │               49.43 ms │     no change │
│ QQuery 2     │   137.41 ms │              133.50 ms │     no change │
│ QQuery 3     │   164.42 ms │              163.90 ms │     no change │
│ QQuery 4     │  1212.32 ms │             1092.00 ms │ +1.11x faster │
│ QQuery 5     │  1641.93 ms │             1547.45 ms │ +1.06x faster │
│ QQuery 6     │     2.32 ms │                2.51 ms │  1.08x slower │
│ QQuery 7     │    56.78 ms │               56.49 ms │     no change │
│ QQuery 8     │  1623.39 ms │             1433.44 ms │ +1.13x faster │
│ QQuery 9     │  1985.94 ms │             1803.53 ms │ +1.10x faster │
│ QQuery 10    │   365.86 ms │              362.26 ms │     no change │
│ QQuery 11    │   423.33 ms │              426.94 ms │     no change │
│ QQuery 12    │  1560.49 ms │             1346.06 ms │ +1.16x faster │
│ QQuery 13    │  2258.56 ms │             2137.00 ms │ +1.06x faster │
│ QQuery 14    │  1364.37 ms │             1255.45 ms │ +1.09x faster │
│ QQuery 15    │  1361.68 ms │             1254.45 ms │ +1.09x faster │
│ QQuery 16    │  2744.94 ms │             2704.16 ms │     no change │
│ QQuery 17    │  2699.05 ms │             2691.15 ms │     no change │
│ QQuery 18    │  4902.42 ms │             5318.32 ms │  1.08x slower │
│ QQuery 19    │   128.58 ms │              133.94 ms │     no change │
│ QQuery 20    │  2034.58 ms │             2178.17 ms │  1.07x slower │
│ QQuery 21    │  2335.23 ms │             2493.61 ms │  1.07x slower │
│ QQuery 22    │  3979.30 ms │             4134.97 ms │     no change │
│ QQuery 23    │ 12979.52 ms │            13176.79 ms │     no change │
│ QQuery 24    │   214.84 ms │              222.73 ms │     no change │
│ QQuery 25    │   474.79 ms │              485.79 ms │     no change │
│ QQuery 26    │   218.05 ms │              222.30 ms │     no change │
│ QQuery 27    │  2864.29 ms │             2968.56 ms │     no change │
│ QQuery 28    │ 23459.77 ms │            23870.99 ms │     no change │
│ QQuery 29    │   962.14 ms │              982.48 ms │     no change │
│ QQuery 30    │  1374.68 ms │             1289.52 ms │ +1.07x faster │
│ QQuery 31    │  1408.04 ms │             1348.50 ms │     no change │
│ QQuery 32    │  4598.77 ms │             4668.63 ms │     no change │
│ QQuery 33    │  5701.39 ms │             5612.52 ms │     no change │
│ QQuery 34    │  5967.52 ms │             5664.98 ms │ +1.05x faster │
│ QQuery 35    │  2164.31 ms │             2006.57 ms │ +1.08x faster │
│ QQuery 36    │   124.84 ms │              119.75 ms │     no change │
│ QQuery 37    │    52.35 ms │               51.32 ms │     no change │
│ QQuery 38    │   127.46 ms │              118.98 ms │ +1.07x faster │
│ QQuery 39    │   200.91 ms │              198.17 ms │     no change │
│ QQuery 40    │    40.65 ms │               42.20 ms │     no change │
│ QQuery 41    │    40.50 ms │               38.77 ms │     no change │
│ QQuery 42    │    32.30 ms │               32.18 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 96041.03ms │
│ Total Time (feat-hash-join-metrics)   │ 95842.78ms │
│ Average Time (HEAD)                   │  2233.51ms │
│ Average Time (feat-hash-join-metrics) │  2228.90ms │
│ Queries Faster                        │         13 │
│ Queries Slower                        │          4 │
│ Queries with No Change                │         26 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ feat-hash-join-metrics ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 142.43 ms │              142.85 ms │     no change │
│ QQuery 2     │  27.62 ms │               28.79 ms │     no change │
│ QQuery 3     │  40.54 ms │               36.59 ms │ +1.11x faster │
│ QQuery 4     │  29.48 ms │               30.18 ms │     no change │
│ QQuery 5     │  90.55 ms │               89.00 ms │     no change │
│ QQuery 6     │  19.70 ms │               19.50 ms │     no change │
│ QQuery 7     │ 227.81 ms │              234.14 ms │     no change │
│ QQuery 8     │  33.79 ms │               33.35 ms │     no change │
│ QQuery 9     │ 108.03 ms │              105.63 ms │     no change │
│ QQuery 10    │  64.37 ms │               66.94 ms │     no change │
│ QQuery 11    │  18.40 ms │               17.97 ms │     no change │
│ QQuery 12    │  52.20 ms │               53.43 ms │     no change │
│ QQuery 13    │  47.74 ms │               52.96 ms │  1.11x slower │
│ QQuery 14    │  14.24 ms │               14.31 ms │     no change │
│ QQuery 15    │  25.47 ms │               25.93 ms │     no change │
│ QQuery 16    │  24.89 ms │               24.85 ms │     no change │
│ QQuery 17    │ 153.19 ms │              161.08 ms │  1.05x slower │
│ QQuery 18    │ 286.54 ms │              285.21 ms │     no change │
│ QQuery 19    │  37.91 ms │               38.49 ms │     no change │
│ QQuery 20    │  49.92 ms │               51.43 ms │     no change │
│ QQuery 21    │ 334.08 ms │              341.07 ms │     no change │
│ QQuery 22    │  21.08 ms │               22.20 ms │  1.05x slower │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1849.99ms │
│ Total Time (feat-hash-join-metrics)   │ 1875.92ms │
│ Average Time (HEAD)                   │   84.09ms │
│ Average Time (feat-hash-join-metrics) │   85.27ms │
│ Queries Faster                        │         1 │
│ Queries Slower                        │         3 │
│ Queries with No Change                │        18 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@feniljain
Copy link
Contributor Author

Considering the results, I think we should move it to ExplainAnalyzeLevel::Dev then?

@2010YOUY01
Copy link
Contributor

Considering the results, I think we should move it to ExplainAnalyzeLevel::Dev then?

It's running in a noisy cloud environment, and tpch_mem takes quite short time, so it might not be accurate.

I’ve verified this with tpch_mem10 locally, and it actually slows down several queries. I tried to make this count distinct indices faster (and sent a PR feniljain#2), and after that I think the result looks good:

--------------------
Benchmark tpch_mem_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       main ┃ pr-18488-opt ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  224.33 ms │    225.34 ms │     no change │
│ QQuery 2     │   52.61 ms │     53.36 ms │     no change │
│ QQuery 3     │  116.20 ms │    115.79 ms │     no change │
│ QQuery 4     │   70.10 ms │     69.32 ms │     no change │
│ QQuery 5     │  199.10 ms │    198.30 ms │     no change │
│ QQuery 6     │   50.64 ms │     46.92 ms │ +1.08x faster │
│ QQuery 7     │  474.45 ms │    479.40 ms │     no change │
│ QQuery 8     │  158.07 ms │    157.65 ms │     no change │
│ QQuery 9     │  341.40 ms │    336.64 ms │     no change │
│ QQuery 10    │  143.37 ms │    150.52 ms │     no change │
│ QQuery 11    │   41.77 ms │     42.82 ms │     no change │
│ QQuery 12    │  146.51 ms │    147.67 ms │     no change │
│ QQuery 13    │  118.36 ms │    120.90 ms │     no change │
│ QQuery 14    │   25.81 ms │     25.68 ms │     no change │
│ QQuery 15    │   54.43 ms │     55.78 ms │     no change │
│ QQuery 16    │   47.39 ms │     47.76 ms │     no change │
│ QQuery 17    │  414.56 ms │    417.56 ms │     no change │
│ QQuery 18    │ 1042.19 ms │   1074.00 ms │     no change │
│ QQuery 19    │   77.35 ms │     80.07 ms │     no change │
│ QQuery 20    │   97.83 ms │     98.10 ms │     no change │
│ QQuery 21    │ 4416.18 ms │   4322.24 ms │     no change │
│ QQuery 22    │   35.76 ms │     35.70 ms │     no change │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)           │ 8348.40ms │
│ Total Time (pr-18488-opt)   │ 8301.53ms │
│ Average Time (main)         │  379.47ms │
│ Average Time (pr-18488-opt) │  377.34ms │
│ Queries Faster              │         1 │
│ Queries Slower              │         0 │
│ Queries with No Change      │        21 │
│ Queries with Failure        │         0 │
└─────────────────────────────┴───────────┘

@feniljain
Copy link
Contributor Author

feniljain commented Nov 6, 2025

It's running in a noisy cloud environment, and tpch_mem takes quite short time, so it might not be accurate.

Very interesting

I’ve verified this with tpch_mem10 locally, and it actually slows down several queries.

Thanks for confirming!

I tried to make this count distinct indices faster (and sent a PR feniljain#2),

Very curious about this PR, cause it seems you have written the same logic as mine, but using loops instead, am I missing some detail?

Is it that the None check, which is causing all this overhead?

@2010YOUY01
Copy link
Contributor

It's running in a noisy cloud environment, and tpch_mem takes quite short time, so it might not be accurate.

Very interesting

I’ve verified this with tpch_mem10 locally, and it actually slows down several queries.

Thanks for confirming!

I tried to make this count distinct indices faster (and sent a PR feniljain#2),

Very curious about this PR, cause it seems you have written the same logic as mine, but using loops instead, am I missing some detail?

Is it that the None check, which is causing all this overhead?

If we can make the loop body really simple, the compiler can figure out how to generate more efficient machine code like using SIMD instructions, and the hardware can execute faster through several mechanisms (e.g. better memory prefetching), this can result in several times of speed up for the equivalent implementations.

I'm not entirely sure under which circumstances the compiler might fail to optimize, so I try to keep the loop body as simple as possible — and that usually works well.

make HJ indices distinct count faster
@feniljain
Copy link
Contributor Author

Yup, its easier to autovectorize simple loops! But in Rust IIUC even iterators are lowered to loops itself. Interesting point to explore what didn't work out in this case.

I have merged your PR, thanks for it!

@2010YOUY01
Copy link
Contributor

2010YOUY01 commented Nov 7, 2025

Yup, its easier to autovectorize simple loops! But in Rust IIUC even iterators are lowered to loops itself. Interesting point to explore what didn't work out in this case.

I have merged your PR, thanks for it!

Yes, that’s true. I’m also curious about when auto-vectorization actually works and what’s an easy way to verify it. Once we start adding complexity to a simple loop body (e.g., iterators, branches, or inlined function calls), it tends to stop working at some point.

So my current practice is to keep it as simple as possible on hot paths. (avoid iterators/inlines)

I guess the situation is somewhat analogous to the optimization implementation inside datafusion:
for example, SELECT * FROM parquet_table WHERE a > 0 is likely to benefit from Parquet statistics pruning, whereas SELECT * FROM parquet_table WHERE pow(a, 2) < 4 probably won’t. In theory, it’s entirely possible to support such cases, but in practice, we usually don’t have enough engineering bandwidth to implement it yet.

@2010YOUY01
Copy link
Contributor

Let's include some simple tests, otherwise it's good to go after test passes. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add selectivity metrics (for Explain Analyze) in Hash Join

4 participants