Skip to content

Conversation

@sopel39
Copy link
Member

@sopel39 sopel39 commented Dec 5, 2025

Plugin to query Apache DataSketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.

Description

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## General
* Add DataSketches functions. ({issue}`issuenumber`)

Supercedes #6643

```sql
SELECT
o_orderdate AS date,
theta_sketch_estimate(theta_sketch_union(o_custkey_sketch)) AS unique_user_count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only functions that take a sketch as a parameter.
How to i make my first sketch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sketches are produced either by other engines or ingestion pipelines (e.g. Spark or Hive) rather than Trino itself. Apache DataSketches are used quite commonly.

Plugin to query Apache DataSketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html).
This includes merge and estimate function for theta sketch.
}

@Test
public void testSimpleMerge()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test should not be in AbstractTestQueryFramework class.
it doesn't use QueryRunner, so running this test should not require paying the cost of QR setup

Comment on lines +61 to +62
state1 = factory.createSingleState();
state2 = factory.createSingleState();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this shared mutable state in a concurrent test class?

Comment on lines +36 to +38
state.setNominalEntries(nominalEntries);
state.setSeed(seed);
state.setSketch(inputValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the code overrides state in the @InputFunction function.
Isn't the @InputFunction function supposed to accumulate?

# DataSketches functions

DataSketches is a high-performance library of stochastic streaming
algorithms, commonly called sketches. Sketches are small, stateful programs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document uses "sketch" to describe an algorithm (here), but also a data structure (when used as a type name).
This is confusing.

Comment on lines +11 to +12
is available through {func}`theta_sketch_union` and
{func}`theta_sketch_estimate`, typically used in `COUNT DISTINCT` queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a typical usage, do you want to show it in an example?
Or did you mean "instead of count(DISTINCT ...) queries"?

is available through {func}`theta_sketch_union` and
{func}`theta_sketch_estimate`, typically used in `COUNT DISTINCT` queries.
DataSketches can be created with Hive or Pig using their respective sketch
APIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see a concrete instructions user can follow to create these data sketches. IMO they should not need to go to a different system. It's really not a big lift especially that Trino already has an aggregation function for building them (IcebergThetaSketchForStats), it's just not user facing.

cc @mosabua for how to reference to other systems.


Note: Trino does not create new sketches. Build the Theta sketches in your
ingest pipeline or another engine (e.g., Spark or Hive) and store the
serialized sketch bytes as a ``VARBINARY`` column. Trino functions operate on those serialized sketches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSketches supports many sketch families and serialization format is to some extend configurable. This should say what's supported.


## Functions

:::{function} theta_sketch_union(sketches [, nominal_entries, seed]) -> varbinary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it aggregation function? should there be a corresponding scalar function for unioning two sketches?

Also sketches -> sketch? or sketches -> varbinary? cc @mosabua

Comment on lines +47 to +54
SliceOutput sliceOutput = new DynamicSliceOutput(SIZE_OF_LONG + SIZE_OF_INT + SIZE_OF_INT + slice.length());
sliceOutput.appendInt(state.getNominalEntries());
sliceOutput.appendLong(state.getSeed());

sliceOutput.appendInt(slice.length());
sliceOutput.appendBytes(slice);

VARBINARY.writeSlice(out, sliceOutput.slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare with

checkArgument(state.getUpdateSketch() == null || state.getCompactSketch() == null, "A state must not have both transient accumulator and combined form set");
CompactSketch compactSketch = Optional.ofNullable(state.getCompactSketch())
.orElseGet(() -> state.getUpdateSketch().compact());
Slice slice = Slices.wrappedBuffer(compactSketch.toByteArray());
VARBINARY.writeSlice(out, slice);

@findepi findepi requested a review from mosabua December 10, 2025 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants