-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Apache DataSketches plugin for Trino #27563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e892109 to
09190ed
Compare
...ino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/SketchFunctionsPlugin.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/UnionWithParams.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/UnionWithParams.java
Show resolved
Hide resolved
| ```sql | ||
| SELECT | ||
| o_orderdate AS date, | ||
| theta_sketch_estimate(theta_sketch_union(o_custkey_sketch)) AS unique_user_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see only functions that take a sketch as a parameter.
How to i make my first sketch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sketches are produced either by other engines or ingestion pipelines (e.g. Spark or Hive) rather than Trino itself. Apache DataSketches are used quite commonly.
Plugin to query Apache DataSketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.
09190ed to
497b043
Compare
| } | ||
|
|
||
| @Test | ||
| public void testSimpleMerge() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unit test should not be in AbstractTestQueryFramework class.
it doesn't use QueryRunner, so running this test should not require paying the cost of QR setup
| state1 = factory.createSingleState(); | ||
| state2 = factory.createSingleState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this shared mutable state in a concurrent test class?
| state.setNominalEntries(nominalEntries); | ||
| state.setSeed(seed); | ||
| state.setSketch(inputValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks the code overrides state in the @InputFunction function.
Isn't the @InputFunction function supposed to accumulate?
| # DataSketches functions | ||
|
|
||
| DataSketches is a high-performance library of stochastic streaming | ||
| algorithms, commonly called sketches. Sketches are small, stateful programs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document uses "sketch" to describe an algorithm (here), but also a data structure (when used as a type name).
This is confusing.
| is available through {func}`theta_sketch_union` and | ||
| {func}`theta_sketch_estimate`, typically used in `COUNT DISTINCT` queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a typical usage, do you want to show it in an example?
Or did you mean "instead of count(DISTINCT ...) queries"?
| is available through {func}`theta_sketch_union` and | ||
| {func}`theta_sketch_estimate`, typically used in `COUNT DISTINCT` queries. | ||
| DataSketches can be created with Hive or Pig using their respective sketch | ||
| APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see a concrete instructions user can follow to create these data sketches. IMO they should not need to go to a different system. It's really not a big lift especially that Trino already has an aggregation function for building them (IcebergThetaSketchForStats), it's just not user facing.
cc @mosabua for how to reference to other systems.
|
|
||
| Note: Trino does not create new sketches. Build the Theta sketches in your | ||
| ingest pipeline or another engine (e.g., Spark or Hive) and store the | ||
| serialized sketch bytes as a ``VARBINARY`` column. Trino functions operate on those serialized sketches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataSketches supports many sketch families and serialization format is to some extend configurable. This should say what's supported.
|
|
||
| ## Functions | ||
|
|
||
| :::{function} theta_sketch_union(sketches [, nominal_entries, seed]) -> varbinary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it aggregation function? should there be a corresponding scalar function for unioning two sketches?
Also sketches -> sketch? or sketches -> varbinary? cc @mosabua
| SliceOutput sliceOutput = new DynamicSliceOutput(SIZE_OF_LONG + SIZE_OF_INT + SIZE_OF_INT + slice.length()); | ||
| sliceOutput.appendInt(state.getNominalEntries()); | ||
| sliceOutput.appendLong(state.getSeed()); | ||
|
|
||
| sliceOutput.appendInt(slice.length()); | ||
| sliceOutput.appendBytes(slice); | ||
|
|
||
| VARBINARY.writeSlice(out, sliceOutput.slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare with
Lines 51 to 55 in da1de5e
| checkArgument(state.getUpdateSketch() == null || state.getCompactSketch() == null, "A state must not have both transient accumulator and combined form set"); | |
| CompactSketch compactSketch = Optional.ofNullable(state.getCompactSketch()) | |
| .orElseGet(() -> state.getUpdateSketch().compact()); | |
| Slice slice = Slices.wrappedBuffer(compactSketch.toByteArray()); | |
| VARBINARY.writeSlice(out, slice); |
Plugin to query Apache DataSketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:
Supercedes #6643