Skip to content

User Defined Function ‐ Design Document

Tengjun Jin edited this page Dec 22, 2023 · 5 revisions

User Defined function

Different Queries that contain user defined function

  1. User Defined function in SELECT Clause,

    • SELECT function1(col1, col2) FROM table1

    Converted query: SELECT col1, col2 FROM table1, then run function1(col1, col2)

  2. User Defined function in JOIN Clause

    • SELECT col3 FROM table1 join table2 ON is_match(table1.col1, table2.col2) = True

    Converted query: SELECT col3, col1, col2 FROM table1, table2, then run is_match(table1.col1, table2.col2) and select rows that the function result is True.

  3. User Defined function in WHERE Clause

    • SELECT col3 FROM table1, table2 WHERE is_match(table1.col1, table2.col2) = True (Same expression as above JOIN clause)

    Converted query: SELECT col3, col1, col2 FROM table1, table2

    • SELECT col3 FROM table1 WHERE function1(col1, col2) > 100 OR col4 > 500

    Converted query: SELECT col3, col1, col2, col4 FROM table1. then run function1(col1, col2) and select rows that satisfy function1(col1, col2) > 100 OR col4 > 500. col4 > 500 can't be execute directly in query string.

  4. Different Types of User Defined function:

    • UDF returns zero things.
    • UDF return multiple rows. For example, if there are several objects in one picture, object detection model will return serveral rows.
    • UDF return multiple columns, such as the result of object detection model has columns x_min, y_min, x_max, y_max.
    • UDF output with alias, the query can do filter directly based on the outputs of the function.

Design Logic

For a query involving User Defined Function, it can be segmented into two distinct parts. The first part is a standard database query, designed to fetch the necessary data from normal db. The second part is a specialized query that selects data from the results produced by executing a User Defined Function, along with any other relevant columns. The specialized query will return user desired results. Take query 'SELECT function1(col1, col2) FROM table1 WHERE function2(col3,col4) > 100 AND col5 < 100' as an example.

Stage 1: Query Parsing

In the first stage, we need to parse the query. This query will retrieve required data columns, for the example query, the required columns are [col1, col2, col3, col4, col5]. When some column names in a query are identical, the resulting Dataframe will have duplicate column names, leading to ambiguity. To avoid this issue, we assign a unique alias to each selected column such as 'col1 AS col__0, col2 AS col__1 etc.'. Second, we need to extract information for User Defined Function and the specialized query. We also assign a unique alias for each function's output unless to deal the situation that the query using same User designed function multiple times. We construct a mapping for each User Defined Function and the specialized query.

udf_mapping:[
    {'col_names': ['col__0', 'col__1'],
     'function_name': 'function1',
     'result_col_name': ['function__0']},
    {'col_names': ['col__2', 'col__3'],
     'function_name': 'function2',
     'result_col_name': ['function__1']}
]
'dataframe_sql': {
     'udf_mapping':udf_mapping,
     'select_col': ['function__0'],
     'filter_predicate': "function__1 > 100"

Stage 2: Executing First Query in Normal DB

To avoid excessive memory usage caused by extracting large amounts of data from the database, we retrieve the data in batches from the database. The first query will be

SELECT col1 AS col__0, col2 AS col__1, col3 AS col__2, col4 AS col__3, col5 AS col__4
FROM table1
WHERE col5 < 100

Stage 3: Executing User Defined Function

  1. Convert query as above, extract mapping relations. For each query, we construct a {output_index : {function_name:’’, related column name}}. {column name: column index in new SELECT clause}
  2. A function to execute user defined function and filter predicate that has OR relation with UDF
  3. Refactor full scan engine, selectively filter input data based on User-Defined Function filtering predicates. (Required for multiple inference query, previous design has an issue that if current inp_query_str doesn’t contain required table, the filter predicate doesn’t work although that columns has been materialized.)
  4. Run the converted query, execute user defined function on the results of converted query based on the extracted mapping relations, return final results.