Skip to content

User Defined Function ‐ Design Document

Tengjun Jin edited this page Dec 22, 2023 · 5 revisions

User Defined function

Different Queries that Contain User Defined Function

  1. User Defined function in SELECT Clause,

    • SELECT function1(col1, col2) FROM table1

    Query executed in a normal DB: SELECT col1 AS col__0, col2 AS col__1 FROM table1

  2. User Defined function in JOIN Clause

    • SELECT col3 FROM table1 join table2 ON is_match(table1.col1, table2.col2) = True

    Query executed in a normal DB: SELECT col3 AS col__0, col1 AS col__1, col2 AS col__2 FROM table1 JOIN table2

  3. User Defined function in WHERE Clause

    • SELECT col3 FROM table1, table2 WHERE is_match(table1.col1, table2.col2) = True (Same expression as above JOIN clause)

    Query executed in a normal DB: SELECT col3 AS col__0, col1 AS col__1, col2 AS col__2 FROM table1 JOIN table2

    • SELECT col3 FROM table1 WHERE function1(col1, col2) > 100 OR col4 > 500

    Query executed in a normal DB: SELECT col3 AS col__0, col1 AS col__1, col2 AS col__2, col4 AS col__3 FROM table1

  4. Different Types of User Defined function:

    • UDF returns zero things.
    • UDF returns multiple rows. For instance, in an image with multiple objects, an object detection model will generate several rows, each representing a detected object.
    • UDF returns multiple columns. For example, the output of an object detection model might include columns like x_min, y_min, x_max, and y_max, representing the coordinates of detected objects.
    • Predefined aliases for UDF outputs. The query can directly apply filters based on the function's outputs.

Design Logic

For a query involving User Defined Function, it can be segmented into two distinct parts. The first part is a standard database query, designed to fetch the necessary data from normal db. The second part is a specialized query that selects data from the results produced by executing a User Defined Function, along with any other relevant columns. The specialized query will return user desired results. Take following query as an example

SELECT function1(col1, col2) FROM table1 WHERE function2(col3,col4) > 100 AND col5 < 100

Stage 1: Query Parsing

In the first stage, the query is parsed to retrieve the necessary data columns. In our example, these columns are [col1, col2, col3, col4, col5]. When a query contains identical column names, it can lead to ambiguity due to duplicate column names in the resulting Dataframe. To avoid this issue, we assign unique aliases to each selected column such as 'col1 AS col__0, col2 AS col__1', and so on. Additionally, we need to extract information about User Defined Functions and the specialized query. We also assign a unique alias to each function. This strategy is particularly useful in situations where the same User Defined Function (UDF) is utilized multiple times within a query. By doing so, we effectively manage potential ambiguities. Furthermore, we construct a a clear mapping for each UDF and the corresponding specialized query.

udf_mapping:[
    {'col_names': ['col__0', 'col__1'],
     'function_name': 'function1',
     'result_col_name': ['function__0']},
    {'col_names': ['col__2', 'col__3'],
     'function_name': 'function2',
     'result_col_name': ['function__1']}
]
'dataframe_sql': {
     'udf_mapping':udf_mapping,
     'select_col': ['function__0'],
     'filter_predicate': "function__1 > 100"

Stage 2: Executing First Query in a Normal DB

To minimize memory usage that can occur when extracting large volumes of data from the database, we implement a batch retrieval process. This approach involves extracting data in segmented portions rather than in a single query. The initial query in this batch process will be

SELECT col1 AS col__0, col2 AS col__1, col3 AS col__2, col4 AS col__3, col5 AS col__4
FROM table1
WHERE col5 < 100
LIMIT 10000
OFFSET 0

Stage 3: Executing User Defined Function

Following this, the system will execute the UDF in accordance with the established UDF mapping. If the UDF returns an output comprising multiple columns, the output column generated by the UDF will then be expanded accordingly. It's important to note that there are several scenarios regarding UDF outputs, each requiring different handling or processing approaches.

  • Zero output: drop current row.
  • Multiple rows output: the current row will be expanded into multiple rows to accommodate this output.
  • Multiple columns output: the current column will be expanded into multiple columns to accommodate this output., e.g. for two outputs UDF, function__0 will be expanded to [function__0__0, function__0__1].
  • Predefined aliases for UDF outputs: we utilize the predefined aliases as column names in the DataFrame.

Stage 4: Execute Specialized Query Over Dataframe

After executing the User Defined Function, the results are stored in a DataFrame. Subsequently, we proceed to execute the specialized query. At present, we utilize DuckDB to run queries over the DataFrame. The specialized query, which is determined based on the 'dataframe_sql' parameter, is structured as follows

SELECT function__0
FROM df
WHERE function__1 > 100

The specialized query will return the results as desired by the user.

TODO:

  • Removes the dependency of DuckDB. Instead, uses a temporary table in a standard database to execute the specialized query.
  • Supports queries incorporating both User Defined Functions and nested queries.
  • Supports nested usage of User Defined Function, e.g. function1(function2(col1), col2)
  • For queries that require multiple inference services, our system does not support using User Defined Functions (UDFs) as intermediate filter predicates for input filtering.