-
Notifications
You must be signed in to change notification settings - Fork 93
Benbellick/handle structured udt #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b7ea41e to
bee33ea
Compare
Support both opaque (google.protobuf.Any) and structured (Literal.Struct) encodings for user-defined type literals per Substrait spec. - Split UserDefinedLiteral into UserDefinedAny and UserDefinedStruct - Move type parameters to interface level for parameterized types - Add ExtensionCollector.getExtensionCollection() method - Full Calcite integration with REINTERPRET pattern for Any-based UDTs - Add Scala visitor methods and comprehensive documentation - Comprehensive test coverage including roundtrip tests
bee33ea to
b4fbb07
Compare
asolimando
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some initial suggestions, there are a couple of places where I am not 100% sure like switching from SqlTypeName.ROW to SqlTypeName.OTHER and I have explicitly mentioned that, the rest should be safe to be implemented, feel free to ping me when it's reviewable again :)
|
|
||
| /** | ||
| * Test-specific variant of the core implementation that treats Calcite-copied types as | ||
| * equivalent, even when they are not the same Java instance. Calcite often clones UDTs during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is accurate, in general you use canonize to "create" data types, which internally uses a cache and the same object is returned if already seen before.
Lots of places in Calcite compare them by instance (Java's ==), so if that was true it would break in a few places.
RelDataTypeFactory is pretty explicit on this in its javadoc.
public RelDataType createUserDefinedType(
String urn, String name,
List<Type.Parameter> params, boolean nullable) {
SubstraitUserDefinedType type =
new SubstraitUserDefinedType(urn, name, params, nullable);
return canonize(type); <-- this is what guarantees uniqueness for your types
}
Make sure you never create types manually, at that point you should be able to drop methods like matchesCalciteAlias that are just working around the issue, possibly also matchesSubstraitType
|
|
||
| /** Creates a SubstraitUserDefinedType from a Substrait Type.UserDefined. */ | ||
| public static SubstraitUserDefinedType from(io.substrait.type.Type.UserDefined type) { | ||
| return new SubstraitUserDefinedType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you must canonize, and for doing that you need a handle to a Calcite type factory
|
|
||
| /** Creates a SubstraitUserDefinedAnyType from a Substrait Type.UserDefined. */ | ||
| public static SubstraitUserDefinedAnyType from(io.substrait.type.Type.UserDefined type) { | ||
| return new SubstraitUserDefinedAnyType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, canonization is needed here
| io.substrait.type.Type.UserDefined type, | ||
| List<RelDataType> fieldTypes, | ||
| List<String> fieldNames) { | ||
| return new SubstraitUserDefinedStructType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonization needed
| throw new IllegalArgumentException("Field types and names must have same length"); | ||
| } | ||
| this.fieldTypes = fieldTypes; | ||
| this.fieldNames = fieldNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you must call again computeDigest() after setting fieldTypes and fieldNames (true for any field in general), because RelDataTypeImpl identity is based on the digest field. This is true for all subclasses adding or changing the field.
Similarly, I expect to see equals() and hashCode() overrides when you have extra fields, I think you should add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RelDataTypeImpl is Serializable, are we making sure the extra fields are serializable too?
| // Can be considered as ROW since it has structure | ||
| return SqlTypeName.ROW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure, but I am afraid you would then be matching built-in ROW types, I'd rather play it safe and use SqlTypeName.OTHER (like the base class) and use isStruct() override to indicate structure, something like:
@Override
public boolean isStruct() {
return true;
}
I am starting from the assumption that, if any structured type would make sense as ROW, we would not have this boolean method, but I don't have any evidence for saying you shouldn't do it.
| java.util.List<String> fieldNames = | ||
| java.util.stream.IntStream.range(0, expr.fields().size()) | ||
| .mapToObj(i -> "f" + i) | ||
| .collect(java.util.stream.Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://substrait.io/types/type_classes/#communicating-user-defined-types I was under the impression that UDTs were coming with their own field names, expr.fields() might have them set?
If that's not necessarily the case, we should at least preserve them whenever they are set.
| } | ||
|
|
||
| @Override | ||
| public RelDataTypeField getField(String fieldName, boolean caseSensitive, boolean elideRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elideRecord parameter is not used here, IIRC it's used for nested field access. You should check how RelDataTypeImpl.getField() behaves and possibly:
- remove the override if not needed,
- call super.getField() at the top and add your extra logic after,
- add the nested type handling and honor the
elideRecordparam
| new SubstraitUserDefinedType.SubstraitUserDefinedAnyType( | ||
| URN, NAME, List.of(integerParam), true); | ||
|
|
||
| assertEquals(left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to assert left == right after fixing type canonization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add tests for nested UDT field access? (to exercise the elideRecord argument in getField) and, if supported, a UDT within a struct within UDT (if such a mixup is valid)?
It might be interesting to test also two UDTs with same URN/name but different type parameters like Point<i32> and Point<i64>, I expect them to be different, as i32 != i64
|
I decided to break this up into 2 PRs, one for core and one for isthmus. The one for core is here: #613 |
Closes #603