-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java
Fixed
Show fixed
Hide fixed
...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds dictionary size estimation to prevent out-of-memory (OOM) issues in historical nodes by implementing size checks before adding entries to dictionaries in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper
and its subclasses.
Key changes:
- Adds abstract
estimatedKeySize()
method to track memory usage of dictionary entries - Implements size estimation for different data types (strings, arrays, structured data)
- Adds size validation before dictionary insertion to prevent unbounded growth
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
} else { | ||
// fall back to string representation for other types, this might be under-estimating for map | ||
size += (int) estimateStringKeySize(String.valueOf(o)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method only processes StructuredData objects but ignores other types in the key array. This could lead to underestimating the total size when the key contains non-StructuredData objects.
} | |
} | |
} else if (obj != null) { | |
// For non-StructuredData objects, estimate size using their string representation | |
size += (int) estimateStringKeySize(String.valueOf(obj)); |
Copilot uses AI. Check for mistakes.
int size = estimatedKeySize(key.getKey()); | ||
if (currentEstimatedSize + size > maxDictionarySize) { | ||
return false; | ||
} | ||
currentEstimatedSize += size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size estimation is calculated for the entire key array, but only the current dimension (at index idx
) should be added to the dictionary. This leads to incorrect size accounting as it includes sizes of dimensions that may already be in the dictionary.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is correct, it should be estimateKeySize(obj)
, not the whole grouping key array
DictionaryBuildingSingleValuedRowBasedKeySerdeHelper
processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/NestedDataComplexTypeSerde.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java
Outdated
Show resolved
Hide resolved
|
||
public static final class JsonTypeStrategy implements TypeStrategy | ||
{ | ||
private final ObjectStrategy objectStrategy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it feels a bit off to me to have to make an ObjectStrategy
to implement the TypeStrategy
, where ideally I think I would expect both the nested column ObjectStrategy
and TypeStrategy
to be implemented such that they can just share some common methods instead of one relying on the other. ObjectStrategyComplexTypeStrategy
was a sort of backwards compatible fallback to make it so that we could fill in TypeStrategy
from pre-existing ObjectStrategy
.
Not a big deal though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved all functions to TypeStrategy, looks better for encapsulation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
int hashCompare = Long.compare(hash.getAsLong(), o.hash.getAsLong()); | ||
if (hashCompare != 0) { | ||
return hashCompare; | ||
} | ||
|
||
return Integer.compare(getSizeEstimate(), o.getSizeEstimate()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using only (hash, sizeEstimate) as the final tie-breakers means two unequal values that collide on 64-bit hash and share the same serialized length will compare as 0 and thus be considered equal (see equals implementation), violating the general contract of equals/compareTo. To make equality logically correct, add a final fallback that compares the actual serialized byte arrays when hash and size are identical before returning 0.
Copilot uses AI. Check for mistakes.
// guarantees that equals is consistent with compareTo | ||
return compareTo(that) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equals delegates entirely to compareTo, which (due to the current hash/size tie-breaker) can declare different logical values equal on hash collision plus equal length, creating a possibility of incorrect equality semantics; equals should instead directly compare underlying values (e.g., by serialized bytes or structured value comparison) and only use compareTo for ordering. Recommend implementing equals via structural/byte-array equality and letting compareTo remain a total ordering.
// guarantees that equals is consistent with compareTo | |
return compareTo(that) == 0; | |
return java.util.Objects.equals(this.value, that.value); |
Copilot uses AI. Check for mistakes.
return Arrays.deepEquals((Object[]) value, (Object[]) that.value); | ||
} | ||
return Objects.equals(value, that.value); | ||
// guarantees that equals is consistent with compareTo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Routing equals through compareTo forces the full ordering path (including hash/size computations and potential serialization) for simple equality checks, adding overhead to hash-based collections; introduce a fast path (e.g., identity, null, primitive/String direct comparison, then serialized-bytes equality) before invoking ordering logic.
// guarantees that equals is consistent with compareTo | |
// Fast path for null | |
if (isNull() && that.isNull()) { | |
return true; | |
} else if (isNull() || that.isNull()) { | |
return false; | |
} | |
// Fast path for String | |
if (isString() && that.isString()) { | |
return asString().equals(that.asString()); | |
} | |
// Fast path for Number | |
if (isNumber() && that.isNumber()) { | |
// Use Double.compare for equality of numbers | |
return Double.compare(asNumber().doubleValue(), that.asNumber().doubleValue()) == 0; | |
} | |
// For complex objects, compare hash and size estimate first | |
if (hash.getAsLong() != that.hash.getAsLong()) { | |
return false; | |
} | |
if (getSizeEstimate() != that.getSizeEstimate()) { | |
return false; | |
} | |
// Fallback to compareTo for final check (should be rare) |
Copilot uses AI. Check for mistakes.
// hashCode that relies on the object equality. Translates the hashcode to an integer as well | ||
public int equalityHash() | ||
{ | ||
return Longs.hashCode(hash.getAsLong()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] hashCode truncates the 64-bit XXHash to 32 bits, increasing collision probability in hash-based collections for large datasets; consider mixing both high and low 32-bit parts (e.g., (int)(h ^ (h >>> 32))) to better distribute entropy.
return Longs.hashCode(hash.getAsLong()); | |
long h = hash.getAsLong(); | |
return (int)(h ^ (h >>> 32)); |
Copilot uses AI. Check for mistakes.
Description
Add size check in
DictionaryBuildingSingleValuedRowBasedKeySerdeHelper
inputToKeyBuffer
. Without this change, dictionary size is unbounded and can cause historical node OOM. Additionally:JsonTypeStrategy
for json columnStructuredData
to return consistent result forequals
andcompareTo
, replaced its hashCode function to use the hash of serialized bytes instead.This PR has: