Skip to content

Conversation

@ZorinAnton
Copy link
Contributor

@ZorinAnton ZorinAnton commented Aug 5, 2025

This PR introduces support for dynamic scalar User-Defined Functions (UDFs) for round-trip conversion between Substrait plans and Calcite RelNodes.

It establishes the Substrait YAML extension files as the single source of truth (SSoT) for all function signatures. The primary addition is a new utility, YamlToSqlOperator, which acts as a bridge between the Substrait and Calcite worlds by dynamically generating Calcite SqlOperator objects from the parsed YAML definitions.

The implementation is optimized to only generate these operators for functions that are truly dynamic —that is, functions present in the provided extensions but not already defined in the default SubstraitOperatorTable.

Key Changes:

@ZorinAnton ZorinAnton marked this pull request as draft August 5, 2025 13:05
@ZorinAnton ZorinAnton force-pushed the zor-udf branch 6 times, most recently from 2f8f79a to ad11957 Compare August 7, 2025 10:08
@ZorinAnton ZorinAnton marked this pull request as ready for review August 7, 2025 11:15
@nielspardon nielspardon self-requested a review August 7, 2025 13:30
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started taking a look at this today, but didn't have time to finish.

I'm going to restate what I think your doing with these changes to make sure that we're on the same page.

To start, you're generating SqlOperators dynamically from the function definitions in SimpleExtensions. You then use these to generate a SqlOperatorTable that covers functions that aren't already in the SubstraitOperatorTable.

This let's you do things like parse SQL queries with functions that are defined only in the SimpleExtension, and aren't part of the SubstraitOperatorTable.

With these definitions, you can also add additional mappings when converting from Substrait to Calcite, and back, for functions that aren't explicitly mapped in the static SCALAR_SIGS mapping.

Meta Comment

I find the usage of "custom" for this concept somewhat confusing. These aren't really custom functions because they are part of the core Substrait spec. A better word for this might be "dynamic". In my head, you're effectively generating SqlOperators dynamically based on the Substrait definitions, and then using them as fall-backs when explicit mappings.

SimpleExtension.ExtensionCollection customExtensionCollection =
ExtensionUtils.getCustomExtensions(extensions);
List<SqlOperator> generatedCustomOperators =
YamlToSqlOperator.from(customExtensionCollection, this.factory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to distinguish between the SqlOperators defined explicitly in the SubstraitOperatorTable and those that you are generating here?

I think we can just generate mappings for every function found in the extension collection, and then chain the operator tables like:

SqlOperatorTables.chain(
  SubstraitOperatorTable.INSTANCE,
  SqlOperatorTables.of(generatedCustomOperators)
);

we automatically give precedence to the non-custom ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that precedence should be given the to default (non-custom) operators. What if users want to override them?

If I understood correctly, you are suggesting to generate SqlOperators for all extensions, the default ones and the dynamic:

List<SqlOperator> generatedCustomOperators =
        SimpleExtensionToSqlOperator.from(dynamicExtensionCollection, this.factory);

I see two issues here:

  • duplicated work (default operators already exist in SubstraitOperatorTable.INSTANCE
  • it requires the fully fledged TypeExpressionEvaluator that will probably be needed at some point anyway, but I guess this should be done in a different PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that precedence should be given the to default (non-custom) operators.

I would argue that it's preferrable to bind to the built-ins when possible, because internally in both the RelBuilder and optimizer rules Calcite looks for specific operator instances when performing optimization (e.g. comparison simplification, aggregate rewrititing, etc). If we use dynamic operators instead of built-in ones, Calcite can no longer perform these optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed the order so that built-in operators have precedence.

this(SimpleExtension.loadDefaults(), features);
}

public SqlToSubstrait(SimpleExtension.ExtensionCollection extensions, FeatureBoard features) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be

public SqlToSubstrait(SqlOperatorTable operatorTable, FeatureBoard features)

to give users direct control over the operator table they are using.

Copy link
Contributor Author

@ZorinAnton ZorinAnton Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this constructor we need to call the constructor of the base class, SqlConverterBase that requires an instance of SimpleExtension.ExtensionCollection as an argument. Otherwise it uses SimpleExtension.loadDefaults() which will cause an inconsistency between the operatorTable and the extensionCollection fields, as the former will contain operators that don't have corresponding extensions in the latter.

@ZorinAnton
Copy link
Contributor Author

ZorinAnton commented Aug 8, 2025

I'm going to restate what I think your doing with these changes to make sure that we're on the same page.

Yes, you've correctly summarized the changes.

Meta Comment

I find the usage of "custom" for this concept somewhat confusing. These aren't really custom functions because they are part of the core Substrait spec. A better word for this might be "dynamic". In my head, you're effectively generating SqlOperators dynamically based on the Substrait definitions, and then using them as fall-backs when explicit mappings.

I renamed all "custom" to "dynamic" in the changes, however in my mind User Defined Functions is always something custom :)

@ZorinAnton ZorinAnton force-pushed the zor-udf branch 2 times, most recently from 2356755 to b232a95 Compare August 11, 2025 15:07
TypeConverter typeConverter) {
List<SimpleExtension.ValueArgument> requiredArgs =
function.args().stream()
.filter(SimpleExtension.Argument::required)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is meaningless because it's always true if you look at the implementations. I'm not even sure what require is used for actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The required method is used here:

return args().stream().filter(Argument::required).collect(Collectors.toList());
. I would think it is supposed to reflect whether the argument is required or optional. So probably it is not correct to always return true.

List<SimpleExtension.ValueArgument> requiredArgs =
function.args().stream()
.filter(SimpleExtension.Argument::required)
.filter(t -> t instanceof SimpleExtension.ValueArgument)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's valid to ignore Enum arguments when generating functions.

For example in extract

    impls:
      - args:
          - name: component
            options: [ YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND,
                       MILLISECOND, MICROSECOND, SUBSECOND, PICOSECOND, UNIX_TIME, TIMEZONE_OFFSET ]
            description: The part of the value to extract.
          - name: x
            value: timestamp_tz
          - name: timezone
            description: Timezone string from IANA tzdb.
            value: string
        return: i64

the enumeration argument for component is part of the function signature. It might make sense to ignore function with enum arguments to start with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the code to handle EnumArgument, however I'm not sure if currently sql with functions using it can be translated to substrait, at least I couldn't manage to do it. Comments in this function indirectly mention that.

TypeExpressionEvaluator.evaluateExpression(
returnExpression, function.args(), substraitArgTypes);

return typeConverter.toCalcite(typeFactory, resolvedSubstraitType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not using the nullability of the function to determine if the output is nullable or not. We should probably account for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added handling of nullability.

@ZorinAnton ZorinAnton force-pushed the zor-udf branch 5 times, most recently from f2527e1 to e7a3d0a Compare August 15, 2025 10:37
@ZorinAnton ZorinAnton requested a review from vbarua August 18, 2025 08:28
@ZorinAnton ZorinAnton changed the title Add udf support for Substrait<->Calcite conversion feat(isthmus): add udf support for Substrait<->Calcite conversion Sep 1, 2025
# Conflicts:
#	isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java

# Conflicts:
#	isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java
#	isthmus/src/test/java/io/substrait/isthmus/OptimizerIntegrationTest.java
#	isthmus/src/test/java/io/substrait/isthmus/PlanTestBase.java
# Conflicts:
#	isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java
@nielspardon nielspardon changed the title feat(isthmus): add udf support for Substrait<->Calcite conversion feat(isthmus): add dynamic function conversion for Substrait<->Calcite Oct 14, 2025
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some time to take a look at this again.

Overall I think it's pretty reasonable feature, but I'm a little wary of enabling it by default everywhere like you have because I can't fully think of the implications of having dynamic mappings in places where they weren't expected before. Examples of things I'm worried about are:

  • Systems that might actually prefer mappings to fail in order to warn users about missing features.
  • Dynamic function generation failures with custom user catalogs.

Ideally, I would love to make this capability something that users can opt into. That way, we could release it and users could try it out in their system and report any issues. The way you've structured it now though, it's pretty difficult to do this because the dynamic mappings are created as part of the creating the various converters that are used to to convert between SQL, Substrait and Calcite.

How difficult do you think it would be to make the dynamic mapping behaviour opt-in?

* @param sqlStatement a SQL statement string
* @param catalogReader the {@link Prepare.CatalogReader} for finding tables/views referenced in
* the SQL statement
* @param operatorTable the {@link SqlOperatorTable} for dynamic operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, you can inject a SqlOperatorTable here that incorporates dynamic operators, but you can also inject a SqlOperatorTable that doesn't. What this allows you to do is have more fine-grained control over valid operators by allowing you to control what SqlOperatorTable you use, instead of just being forced to use the default SubstraitOperatorTable that we provide.

Suggested change
* @param operatorTable the {@link SqlOperatorTable} for dynamic operators
* @param operatorTable the {@link SqlOperatorTable} for controlling valid operators

* @param sqlStatements a string containing one or more SQL statements
* @param catalogReader the {@link Prepare.CatalogReader} for finding tables/views referenced in
* the SQL statements
* @param operatorTable the {@link SqlOperatorTable} for dynamic operators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param operatorTable the {@link SqlOperatorTable} for dynamic operators
* @param operatorTable the {@link SqlOperatorTable} for controlling valid operators

public SqlExpressionToSubstrait(
FeatureBoard features, SimpleExtension.ExtensionCollection extensions) {
super(features);
super(features, extensions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix. This would probably have resulted in some weirdness for user providing their own extensions collections.


public static List<SqlOperator> from(
SimpleExtension.ExtensionCollection collection, RelDataTypeFactory typeFactory) {
TypeConverter typeConverter = TypeConverter.DEFAULT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest making the TypeConverter injectable as well to allow users with custom extension with custom types to use this capability more easily.

Effectively we can have these 2 constructors:

public static List<SqlOperator> from(SimpleExtension.ExtensionCollection collection) {}
public static List<SqlOperator> from(
      SimpleExtension.ExtensionCollection collection, RelDataTypeFactory typeFactory, TypeConverter typeConverter)

The first for default behaviour, and the second for customizable behaviour.

argFamilies.add(typeName.getFamily());
} else if (arg instanceof SimpleExtension.EnumArgument) {
// Treat an EnumArgument as a required string literal.
argFamilies.add(SqlTypeFamily.STRING);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable enough for now. We don't have a good story for enum arguments at this point, but this can be revisited when we've figured that out.


public class ExtensionUtils {

public static SimpleExtension.ExtensionCollection getDynamicExtensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a docstring to this to explain what it's doing, and what is considered a dynamic extension.


List<SimpleExtension.ScalarFunctionVariant> customFunctions =
extensions.scalarFunctions().stream()
.filter(f -> !knownFunctionNames.contains(f.name().toLowerCase(Locale.ROOT)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the correct way to filter for functions. In createScalarFuntionConverter, your filtering the functions using the mappings in FunctionMappings.SCALAR_SIGS. Is there are reason not to do that here?

Also, would it make sense to consolidate that filtering functionality here as well?

allExtensions = allExtensions.merge(SimpleExtension.load(yamlFunctionFiles));
}
return allExtensions;
}
Copy link
Member

@vbarua vbarua Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear that this is used. Do we need this or can we delete this?

class SimpleExtensionToSqlOperatorTest {

@Test
void test() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does demonstrate that we can produce a SqlOperator for a function declared in an extension, but it doesn't necessarily show that the argument and return type inference align of the SqlOperator are aligned with the function declaration?

How feasible do you think it would be to check that the operators we are producing:

  • Apply the correct argument restrictions
  • Have the correct return type inference

this.expressionRexConverter.setRelNodeConverter(this);
}

private static ScalarFunctionConverter createScalarFunctionConverter(
Copy link
Member

@vbarua vbarua Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to make a function like this available to users so that they can produce a scalar function converter with dynamic mappings, but not force users to use dynamic mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants