Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541

cecemei · 2025-09-17T04:18:13Z

Description

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer . Without this change, dictionary size is unbounded and can cause historical node OOM. Additionally:

add JsonTypeStrategy for json column
refactor StructuredData to return consistent result for equals and compareTo, replaced its hashCode function to use the hash of serialized bytes instead.

This PR has:

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java

Copilot

Pull Request Overview

This PR adds dictionary size estimation to prevent out-of-memory (OOM) issues in historical nodes by implementing size checks before adding entries to dictionaries in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper and its subclasses.

Key changes:

Adds abstract estimatedKeySize() method to track memory usage of dictionary entries
Implements size estimation for different data types (strings, arrays, structured data)
Adds size validation before dictionary insertion to prevent unbounded growth

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-18T01:08:35Z

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java

+            } else {
+              // fall back to string representation for other types, this might be under-estimating for map
+              size += (int) estimateStringKeySize(String.valueOf(o));
+            }


The method only processes StructuredData objects but ignores other types in the key array. This could lead to underestimating the total size when the key contains non-StructuredData objects.

Suggested change

}

}

} else if (obj != null) {

// For non-StructuredData objects, estimate size using their string representation

size += (int) estimateStringKeySize(String.valueOf(obj));

Copilot · 2025-09-18T01:08:35Z

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java

+          int size = estimatedKeySize(key.getKey());
+          if (currentEstimatedSize + size > maxDictionarySize) {
+            return false;
+          }
+          currentEstimatedSize += size;


The size estimation is calculated for the entire key array, but only the current dimension (at index idx) should be added to the dictionary. This leads to incorrect size accounting as it includes sizes of dimensions that may already be in the dictionary.

this comment is correct, it should be estimateKeySize(obj), not the whole grouping key array

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

processing/src/main/java/org/apache/druid/segment/nested/NestedDataComplexTypeSerde.java

processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java

clintropolis · 2025-09-29T22:52:17Z

processing/src/main/java/org/apache/druid/segment/column/TypeStrategies.java

+
+  public static final class JsonTypeStrategy implements TypeStrategy
+  {
+    private final ObjectStrategy objectStrategy;


nit: it feels a bit off to me to have to make an ObjectStrategy to implement the TypeStrategy, where ideally I think I would expect both the nested column ObjectStrategy and TypeStrategy to be implemented such that they can just share some common methods instead of one relying on the other. ObjectStrategyComplexTypeStrategy was a sort of backwards compatible fallback to make it so that we could fill in TypeStrategy from pre-existing ObjectStrategy.

Not a big deal though

moved all functions to TypeStrategy, looks better for encapsulation!

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-02T00:43:11Z

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

+    int hashCompare = Long.compare(hash.getAsLong(), o.hash.getAsLong());
+    if (hashCompare != 0) {
+      return hashCompare;
+    }
+
+    return Integer.compare(getSizeEstimate(), o.getSizeEstimate());


Using only (hash, sizeEstimate) as the final tie-breakers means two unequal values that collide on 64-bit hash and share the same serialized length will compare as 0 and thus be considered equal (see equals implementation), violating the general contract of equals/compareTo. To make equality logically correct, add a final fallback that compares the actual serialized byte arrays when hash and size are identical before returning 0.

Copilot · 2025-10-02T00:43:11Z

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

+    // guarantees that equals is consistent with compareTo
+    return compareTo(that) == 0;


equals delegates entirely to compareTo, which (due to the current hash/size tie-breaker) can declare different logical values equal on hash collision plus equal length, creating a possibility of incorrect equality semantics; equals should instead directly compare underlying values (e.g., by serialized bytes or structured value comparison) and only use compareTo for ordering. Recommend implementing equals via structural/byte-array equality and letting compareTo remain a total ordering.

Suggested change

// guarantees that equals is consistent with compareTo

return compareTo(that) == 0;

return java.util.Objects.equals(this.value, that.value);

Copilot · 2025-10-02T00:43:12Z

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

-      return Arrays.deepEquals((Object[]) value, (Object[]) that.value);
-    }
-    return Objects.equals(value, that.value);
+    // guarantees that equals is consistent with compareTo


[nitpick] Routing equals through compareTo forces the full ordering path (including hash/size computations and potential serialization) for simple equality checks, adding overhead to hash-based collections; introduce a fast path (e.g., identity, null, primitive/String direct comparison, then serialized-bytes equality) before invoking ordering logic.

Suggested change

// guarantees that equals is consistent with compareTo

// Fast path for null

if (isNull() && that.isNull()) {

return true;

} else if (isNull() || that.isNull()) {

return false;

}

// Fast path for String

if (isString() && that.isString()) {

return asString().equals(that.asString());

}

// Fast path for Number

if (isNumber() && that.isNumber()) {

// Use Double.compare for equality of numbers

return Double.compare(asNumber().doubleValue(), that.asNumber().doubleValue()) == 0;

}

// For complex objects, compare hash and size estimate first

if (hash.getAsLong() != that.hash.getAsLong()) {

return false;

}

if (getSizeEstimate() != that.getSizeEstimate()) {

return false;

}

// Fallback to compareTo for final check (should be rare)

Copilot · 2025-10-02T00:43:12Z

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java

-  // hashCode that relies on the object equality. Translates the hashcode to an integer as well
-  public int equalityHash()
  {
    return Longs.hashCode(hash.getAsLong());


[nitpick] hashCode truncates the 64-bit XXHash to 32 bits, increasing collision probability in hash-based collections for large datasets; consider mixing both high and low 32-bit parts (e.g., (int)(h ^ (h >>> 32))) to better distribute entropy.

Suggested change

return Longs.hashCode(hash.getAsLong());

long h = hash.getAsLong();

return (int)(h ^ (h >>> 32));

oom

ed6b1d1

cryptoe requested a review from clintropolis September 17, 2025 04:29

github-advanced-security bot found potential problems Sep 17, 2025

View reviewed changes

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java Fixed Show fixed Hide fixed

...essing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java Fixed Show fixed Hide fixed

cecemei added 2 commits September 17, 2025 15:58

size

5c40a17

map

12c7726

cecemei requested a review from Copilot September 18, 2025 01:07

cecemei marked this pull request as ready for review September 18, 2025 01:08

Copilot AI reviewed Sep 18, 2025

View reviewed changes

cecemei added 3 commits September 18, 2025 17:41

Merge branch 'master' into heap

ec2cffd

structured-data

7d7cca9

Merge branch 'master' into heap

56e6ca8

github-actions bot added the Area - Segment Format and Ser/De label Sep 23, 2025

cecemei changed the title ~~Add dictionary size estimate in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper~~ Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer Sep 23, 2025

format

2c1cfc7

github-advanced-security bot found potential problems Sep 23, 2025

View reviewed changes

processing/src/main/java/org/apache/druid/segment/nested/StructuredData.java Fixed Show fixed Hide fixed

cecemei added 4 commits September 23, 2025 17:58

wrap

137e099

object

4dcd141

ignore-result

a9fa80a

ignore-result

8d1a9b0

clintropolis reviewed Oct 1, 2025

View reviewed changes

cecemei added 3 commits October 1, 2025 16:03

format

16a1174

Merge branch 'master' into heap

e672235

format2

f1bb670

cecemei requested review from Copilot and clintropolis October 2, 2025 00:41

Copilot AI reviewed Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541

Uh oh!

cecemei commented Sep 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

clintropolis Sep 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clintropolis Sep 29, 2025

Uh oh!

cecemei Oct 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

Copilot AI Oct 2, 2025

Uh oh!

Uh oh!

-            }
+            }
+          } else if (obj != null) {
+            // For non-StructuredData objects, estimate size using their string representation
+            size += (int) estimateStringKeySize(String.valueOf(obj));

		// guarantees that equals is consistent with compareTo
		return compareTo(that) == 0;

	// guarantees that equals is consistent with compareTo
	return compareTo(that) == 0;
	return java.util.Objects.equals(this.value, that.value);

-    // guarantees that equals is consistent with compareTo
+    // Fast path for null
+    if (isNull() && that.isNull()) {
+      return true;
+    } else if (isNull() || that.isNull()) {
+      return false;
+    }
+    // Fast path for String
+    if (isString() && that.isString()) {
+      return asString().equals(that.asString());
+    }
+    // Fast path for Number
+    if (isNumber() && that.isNumber()) {
+      // Use Double.compare for equality of numbers
+      return Double.compare(asNumber().doubleValue(), that.asNumber().doubleValue()) == 0;
+    }
+    // For complex objects, compare hash and size estimate first
+    if (hash.getAsLong() != that.hash.getAsLong()) {
+      return false;
+    }
+    if (getSizeEstimate() != that.getSizeEstimate()) {
+      return false;
+    }
+    // Fallback to compareTo for final check (should be rare)

	return Longs.hashCode(hash.getAsLong());
	long h = hash.getAsLong();
	return (int)(h ^ (h >>> 32));

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541

Are you sure you want to change the base?

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer #18541

Uh oh!

Conversation

cecemei commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

clintropolis Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clintropolis Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

cecemei Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cecemei commented Sep 17, 2025 •

edited

Loading