You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running some high-volume multi-stage engine queries on Pinot where the join key was high cardinality, we recently observed a disproportionate latency increase when data was increased across both sides of the joins for the following query shape:
SELECT
count(*)
FROM
table_A
WHERE (
user_uuid IN (
SELECT
user_uuid
FROM
table_B
)
)
AND (
user_uuid NOT IN (
SELECT
user_uuid
FROM
table_B
)
)
LIMIT
100 option(useMultistageEngine=true, timeoutMs=120000, useColocatedJoin = true, maxRowsInJoin = 40000000)
After profiling conducted on a server
It turns out that the major cause of the latency increase is due to inefficient groupId generation in org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys, which is happening due to a few reasons:
Open Addressing is the current collision resolution for Object2IntOpenHashMap which performs poorly for high cardinality use cases.
Low default initial size of 16 and a default load factor of 0.75 which causes a high number of multiple resizes and rehashing of existing keys for high cardinality use cases, causing a major latency contribution to the overall query runtime.
We are considering a few different strategies like better hash-map selection (avoid open addressing for high-cardinality), generating groupIds in batches, etc. We would be leveraging benchmarks for selecting the appropriate strategy with the most RoI.
This optimization can help boost performance for both Pinot v1 and v2 engines simultaneously, since both the engines rely on this logic. cc: @Jackie-Jiang
The text was updated successfully, but these errors were encountered:
While running some high-volume multi-stage engine queries on Pinot where the join key was high cardinality, we recently observed a disproportionate latency increase when data was increased across both sides of the joins for the following query shape:
After profiling conducted on a server
It turns out that the major cause of the latency increase is due to inefficient groupId generation in
org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys
, which is happening due to a few reasons:Object2IntOpenHashMap
which performs poorly for high cardinality use cases.We are considering a few different strategies like better hash-map selection (avoid open addressing for high-cardinality), generating groupIds in batches, etc. We would be leveraging benchmarks for selecting the appropriate strategy with the most RoI.
This optimization can help boost performance for both Pinot v1 and v2 engines simultaneously, since both the engines rely on this logic. cc: @Jackie-Jiang
The text was updated successfully, but these errors were encountered: