Skip to content

Conversation

@benbellick
Copy link
Member

@benbellick benbellick commented Nov 17, 2025

Closes #603

@benbellick benbellick force-pushed the benbellick/handle-structured-udt branch 4 times, most recently from b7ea41e to bee33ea Compare November 18, 2025 22:00
Support both opaque (google.protobuf.Any) and
structured (Literal.Struct) encodings for user-defined type literals per Substrait spec.

- Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct
- Move type parameters to interface level for parameterized types
- Add ExtensionCollector.getExtensionCollection() method
- Full Calcite integration with REINTERPRET pattern for Any-based UDTs
- Add Scala visitor methods and comprehensive documentation
- Comprehensive test coverage including roundtrip tests
@benbellick benbellick force-pushed the benbellick/handle-structured-udt branch from bee33ea to b4fbb07 Compare November 18, 2025 22:05
Copy link

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some initial suggestions, there are a couple of places where I am not 100% sure like switching from SqlTypeName.ROW to SqlTypeName.OTHER and I have explicitly mentioned that, the rest should be safe to be implemented, feel free to ping me when it's reviewable again :)


/**
* Test-specific variant of the core implementation that treats Calcite-copied types as
* equivalent, even when they are not the same Java instance. Calcite often clones UDTs during

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is accurate, in general you use canonize to "create" data types, which internally uses a cache and the same object is returned if already seen before.

Lots of places in Calcite compare them by instance (Java's ==), so if that was true it would break in a few places.

RelDataTypeFactory is pretty explicit on this in its javadoc.

public RelDataType createUserDefinedType(
      String urn, String name,
      List<Type.Parameter> params, boolean nullable) {
    SubstraitUserDefinedType type =
        new SubstraitUserDefinedType(urn, name, params, nullable);
    return canonize(type); <-- this is what guarantees uniqueness for your types
  }

Make sure you never create types manually, at that point you should be able to drop methods like matchesCalciteAlias that are just working around the issue, possibly also matchesSubstraitType


/** Creates a SubstraitUserDefinedType from a Substrait Type.UserDefined. */
public static SubstraitUserDefinedType from(io.substrait.type.Type.UserDefined type) {
return new SubstraitUserDefinedType(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you must canonize, and for doing that you need a handle to a Calcite type factory


/** Creates a SubstraitUserDefinedAnyType from a Substrait Type.UserDefined. */
public static SubstraitUserDefinedAnyType from(io.substrait.type.Type.UserDefined type) {
return new SubstraitUserDefinedAnyType(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, canonization is needed here

io.substrait.type.Type.UserDefined type,
List<RelDataType> fieldTypes,
List<String> fieldNames) {
return new SubstraitUserDefinedStructType(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonization needed

throw new IllegalArgumentException("Field types and names must have same length");
}
this.fieldTypes = fieldTypes;
this.fieldNames = fieldNames;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you must call again computeDigest() after setting fieldTypes and fieldNames (true for any field in general), because RelDataTypeImpl identity is based on the digest field. This is true for all subclasses adding or changing the field.

Similarly, I expect to see equals() and hashCode() overrides when you have extra fields, I think you should add them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RelDataTypeImpl is Serializable, are we making sure the extra fields are serializable too?

Comment on lines +222 to +223
// Can be considered as ROW since it has structure
return SqlTypeName.ROW;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure, but I am afraid you would then be matching built-in ROW types, I'd rather play it safe and use SqlTypeName.OTHER (like the base class) and use isStruct() override to indicate structure, something like:

@Override
public boolean isStruct() {
  return true;
}

I am starting from the assumption that, if any structured type would make sense as ROW, we would not have this boolean method, but I don't have any evidence for saying you shouldn't do it.

Comment on lines +137 to +140
java.util.List<String> fieldNames =
java.util.stream.IntStream.range(0, expr.fields().size())
.mapToObj(i -> "f" + i)
.collect(java.util.stream.Collectors.toList());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://substrait.io/types/type_classes/#communicating-user-defined-types I was under the impression that UDTs were coming with their own field names, expr.fields() might have them set?

If that's not necessarily the case, we should at least preserve them whenever they are set.

}

@Override
public RelDataTypeField getField(String fieldName, boolean caseSensitive, boolean elideRecord) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elideRecord parameter is not used here, IIRC it's used for nested field access. You should check how RelDataTypeImpl.getField() behaves and possibly:

  • remove the override if not needed,
  • call super.getField() at the top and add your extra logic after,
  • add the nested type handling and honor the elideRecord param

new SubstraitUserDefinedType.SubstraitUserDefinedAnyType(
URN, NAME, List.of(integerParam), true);

assertEquals(left, right);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to assert left == right after fixing type canonization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests for nested UDT field access? (to exercise the elideRecord argument in getField) and, if supported, a UDT within a struct within UDT (if such a mixup is valid)?

It might be interesting to test also two UDTs with same URN/name but different type parameters like Point<i32> and Point<i64>, I expect them to be different, as i32 != i64

@benbellick
Copy link
Member Author

I decided to break this up into 2 PRs, one for core and one for isthmus. The one for core is here: #613

@benbellick
Copy link
Member Author

Working on another version of this here: #623

This will be built off of #613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UDT literals only supported Any representation, not Struct representation

3 participants