-
Notifications
You must be signed in to change notification settings - Fork 7
User Defined Function ‐ Design Document
-
User Defined function in SELECT Clause,
- SELECT function1(col1, col2) FROM table1
Converted query: SELECT col1, col2 FROM table1, then run function1(col1, col2)
-
User Defined function in JOIN Clause
- SELECT col3 FROM table1 join table2 ON is_match(table1.col1, table2.col2) = True
Converted query: SELECT col3, col1, col2 FROM table1, table2, then run is_match(table1.col1, table2.col2) and select rows that the function result is True.
-
User Defined function in WHERE Clause
- SELECT col3 FROM table1, table2 WHERE is_match(table1.col1, table2.col2) = True (Same expression as above JOIN clause)
Converted query: SELECT col3, col1, col2 FROM table1, table2
- SELECT col3 FROM table1 WHERE function1(col1, col2) > 100 OR col4 > 500
Converted query: SELECT col3, col1, col2, col4 FROM table1. then run function1(col1, col2) and select rows that satisfy function1(col1, col2) > 100 OR col4 > 500. col4 > 500 can't be execute directly in query string.
-
Different Types of User Defined function:
- UDF returns zero things.
- UDF return multiple rows. For example, if there are several objects in one picture, object detection model will return serveral rows.
- UDF return multiple columns, such as the result of object detection model has columns x_min, y_min, x_max, y_max.
- UDF output with alias, the query can do filter directly based on the outputs of the function.
For a query involving User Defined Function, it can be segmented into two distinct parts. The first part is a standard database query, designed to fetch the necessary data from normal db. The second part is a specialized query that selects data from the results produced by executing a User Defined Function, along with any other relevant columns. The specialized query will return user desired results. Take following query as an example
SELECT function1(col1, col2) FROM table1 WHERE function2(col3,col4) > 100 AND col5 < 100
In the first stage, we need to parse the query. This query will retrieve required data columns, for the example query, the required columns are [col1, col2, col3, col4, col5]. When some column names in a query are identical, the resulting Dataframe will have duplicate column names, leading to ambiguity. To avoid this issue, we assign a unique alias to each selected column such as 'col1 AS col__0, col2 AS col__1 etc.'. Second, we need to extract information for User Defined Function and the specialized query. We also assign a unique alias for each function's output unless to deal the situation that the query using same User designed function multiple times. We construct a mapping for each User Defined Function and the specialized query.
udf_mapping:[
{'col_names': ['col__0', 'col__1'],
'function_name': 'function1',
'result_col_name': ['function__0']},
{'col_names': ['col__2', 'col__3'],
'function_name': 'function2',
'result_col_name': ['function__1']}
]
'dataframe_sql': {
'udf_mapping':udf_mapping,
'select_col': ['function__0'],
'filter_predicate': "function__1 > 100"
To avoid excessive memory usage caused by extracting large amounts of data from the database, we retrieve the data in batches from the database. The first query will be
SELECT col1 AS col__0, col2 AS col__1, col3 AS col__2, col4 AS col__3, col5 AS col__4
FROM table1
WHERE col5 < 100
LIMIT 10000
OFFSET 0
Next, the system will execute User Defined Function based on udf mapping. The output column of UDF will be expanded. There are several situations of UDF outputs.
- Zero output: drop that row.
- Multiple rows output: expand current row to multiple rows
- Multiple columns output: expand current column to multiple columns, e.g. for two outputs UDF, function__0 will be expanded to [function__0__0, function__0__1].
- UDF output with alias, we use defined alias as Dataframe column name.
After executing User Defined Function, we will get a dataframe to store the result. Then we will execute specialized query. Currently, we use DuckDB to execute query over Dataframe. Based on parameter 'dataframe_sql', the specialized query is as following:
SELECT function__0
FROM df
WHERE function__1 > 100
The specialized query will return user desired results.
- Remove the dependency of DuckDB, use a temp table in normal db to execute specialized query.
- Support query with User Defined Function and nested query.
- Support nested User Defined Function, e.g. function1(function2(col1), col2)
- For multiple inference services required query, we don't support User Defined Function as an intermediate filter predicate.