-
Notifications
You must be signed in to change notification settings - Fork 5
Spark_SQL_Macro_examples
- Register a macro by calling the
registerMacro function. Its second argument mut be a call to theudmmacro. -
import org.apache.spark.sql.defineMacros._so theregisterMacroanfudmfunctions become implicitly available on yoursparkSession
import org.apache.spark.sql.defineMacros._
spark.registerMacro("intUDM", spark.udm({(i : Int) =>
val b = Array(5, 6)
val j = b(0)
val k = new java.sql.Date(System.currentTimeMillis()).getTime
i + j + k + Math.abs(j)
})
)Once registered you can use the macro as a udf in spark-sql. The
sql select m3(c_int) from sparktest.unit_test generates the following plan:
Project [((cast((c_int#2 + 5) as bigint) + 1613706648609) + cast(abs(5) as bigint)) AS ((CAST((c_int + 5) AS BIGINT) + 1613706648609) + CAST(abs(5) AS BIGINT))#3L]
+- SubqueryAlias spark_catalog.default.unit_test
+- Relation[c_varchar2_40#0,c_number#1,c_int#2] parquet
| Num. | macro | Catalyst expression |
|---|---|---|
| 1. | (i : Int) => i |
macroarg(0, IntegerType) |
| 2. | (i : java.lang.Integer) => i |
macroarg(0, IntegerType) |
| 3. | (i : Int) => i + 5 |
(macroarg(0, IntegerType) + 5) |
| 4. | {(i : Int) => |
5 |
| 5. | (i : Int) => org.apache.spark.SPARK_BRANCH.length + i |
(4 + macroarg(0, IntegerType)) |
| 6. | {(i : Int) => |
((macroarg(0, IntegerType) + 5) + abs(5)) |
| 7. | {(i : Int) => |
5 |
- We support all
Typesfor which Spark can infer anExpressionEncoder - See Array examples on expressions supported in Arrays
- In example 5
org.apache.spark.SPARK_BRANCH.lengthis evaluated at macro defintion time.
- We support ADT(case classes) construction and field access
- At macro call-site the `case class must be in scope, for example:
package macrotest {
object ExampleStructs {
case class Point(x: Int, y: Int)
}
}
import macrotest.ExampleStructs.Point| Num. | macro | Catalyst expression |
|---|---|---|
| 1. | {(p : Point) => |
named_struct(x, 1, y, 2) |
| 2. | {(p : Point) => |
( |
| 3. | {(p : Point) => |
named_struct( |
- We support Tuple construction and field access
| Num. | macro | Catalyst expression |
|---|---|---|
| 1. | {(t : Tuple2[Int, Int]) => |
named_struct( |
| 2. | {(t : Tuple2[Int, Int]) => |
named_struct( |
| 3. | {(t : Tuple4[Float, Double, Int, Int]) => |
named_struct( |
- We support construction and entry access for Array and immutable Map
| Num. | macro | Catalyst expression |
|---|---|---|
| 1. | {(i : Int) => |
(5 + macroarg(0, IntegerType)) |
| 2. | {(i : Int) => |
(macroarg(0, IntegerType) + (macroarg(0, IntegerType) + 1)) |
- Currently we support translation of functions in spark
DateTimeUtilsmodule function params are Date and Timestamp instead of Int and Long.org.apachespark.sql.sqlmacros.DateTimeUtilsdefines the supported functions with Date and Timestamp parameters.
import java.sql.Date
import java.sql.Timestamp
import java.time.ZoneId
import java.time.Instant
import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.sql.sqlmacros.DateTimeUtils._
{(dt : Date) =>
val dtVal = dt
val dtVal2 = new Date(System.currentTimeMillis())
val tVal = new Timestamp(System.currentTimeMillis())
val dVal3 = localDateToDays(java.time.LocalDate.of(2000, 1, 1))
val t2 = instantToMicros(Instant.now())
val t3 = stringToTimestamp("2000-01-01", ZoneId.systemDefault()).get
val t4 = daysToMicros(dtVal, ZoneId.systemDefault())
getDayInYear(dtVal) + getDayOfMonth(dtVal) + getDayOfWeek(dtVal2) +
getHours(tVal, ZoneId.systemDefault) + getSeconds(t2, ZoneId.systemDefault) +
getMinutes(t3, ZoneId.systemDefault()) +
getDayInYear(dateAddMonths(dtVal, getMonth(dtVal2))) +
getDayInYear(dVal3) +
getHours(
timestampAddInterval(t4, new CalendarInterval(1, 1, 1), ZoneId.systemDefault()),
ZoneId.systemDefault) +
getDayInYear(dateAddInterval(dtVal, new CalendarInterval(1, 1, 1L))) +
monthsBetween(t2, t3, true, ZoneId.systemDefault()) +
getDayOfMonth(getNextDateForDayOfWeek(dtVal2, "MO")) +
getDayInYear(getLastDayOfMonth(dtVal2)) + getDayOfWeek(truncDate(dtVal, "week")) +
getHours(toUTCTime(t3, ZoneId.systemDefault().toString), ZoneId.systemDefault())
}is translated to:
((((((((((((((
dayofyear(macroarg(0)) +
dayofmonth(macroarg(0))) +
dayofweek(CAST(timestamp_millis(1613603994973L) AS DATE))) +
hour(timestamp_millis(1613603995085L))) +
second(TIMESTAMP '2021-02-17 15:19:55.207')) +
minute(CAST('2000-01-01' AS TIMESTAMP))) +
dayofyear(add_months(macroarg(0), month(CAST(timestamp_millis(1613603994973L) AS DATE))))) +
dayofyear(make_date(2000, 1, 1))) +
hour(CAST(macroarg(0) AS TIMESTAMP) + INTERVAL '1 months 1 days 0.000001 seconds')) +
dayofyear(macroarg(0) + INTERVAL '1 months 1 days 0.000001 seconds')) +
months_between(TIMESTAMP '2021-02-17 15:19:55.207', CAST('2000-01-01' AS TIMESTAMP), true)) +
dayofmonth(next_day(CAST(timestamp_millis(1613603994973L) AS DATE), 'MO'))) +
dayofyear(last_day(CAST(timestamp_millis(1613603994973L) AS DATE)))) +
dayofweek(trunc(macroarg(0), 'week'))) +
hour(to_utc_timestamp(CAST('2000-01-01' AS TIMESTAMP), 'America/Los_Angeles')))
We support translation for logical operators(AND, OR, NOT), for comparison operators
(>, >=, <, <=, ==, !=), for string predicate functions(startsWith, endsWith, contains),
the if statement and the case statement.
Support for case statements is limited:
-
case pattern must be
cq"$pat => $expr2", so noifin case -
the pattern must be a literal for constructor pattern like
(a,b),Point(1,2)etc. -
org.apache.spark.sql.sqlmacros.PredicateUtilsprovides a class that provides marker functionsis_null, is_not_null, null_safe_eq(o : Any), in(a : Any*), def not_in(a : Any*)forAnyvalue. By havingimport org.apache.spark.sql.sqlmacros.PredicateUtils._in scope at the macro call-site you can write expressions likei.is_not_null,i.in(4, 5)(see Example 1 below).- note that these functions are there for the purpose of translation only. Since, if the macro cannot be translated, the scala function is registered with Spark; then at runtime invocation of the function will fail at these expressions.
import org.apache.spark.sql.sqlmacros.PredicateUtils._
import macrotest.ExampleStructs.Point
{ (i: Int) =>
val j = if (i > 7 && i < 20 && i.is_not_null) {
i
} else if (i == 6 || i.in(4, 5) ) {
i + 1
} else i + 2
val k = i match {
case 1 => i + 2
case _ => i + 3
}
val l = (j, k) match {
case (1, 2) => 1
case (3, 4) => 2
case _ => 3
}
val p = Point(k, l)
val m = p match {
case Point(1, 2) => 1
case _ => 2
}
j + k + l + m
}is translated to the expression tree:
((((
IF((((macroarg(0) > 7) AND (macroarg(0) < 20)) AND (macroarg(0) IS NOT NULL)),
macroarg(0),
(IF(((macroarg(0) = 6) OR (macroarg(0) IN (4, 5))),
(macroarg(0) + 1),
(macroarg(0) + 2))))
) +
CASE
WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2)
ELSE (macroarg(0) + 3) END) +
CASE
WHEN (named_struct(
'col1',
(IF((((macroarg(0) > 7) AND (macroarg(0) < 20)) AND (macroarg(0) IS NOT NULL)), macroarg(0),
(IF(((macroarg(0) = 6) OR (macroarg(0) IN (4, 5))),
(macroarg(0) + 1),
(macroarg(0) + 2))))
),
'col2',
CASE WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2) ELSE (macroarg(0) + 3) END
) = [1,2]) THEN 1
WHEN (named_struct(
'col1',
(IF((((macroarg(0) > 7) AND (macroarg(0) < 20)) AND (macroarg(0) IS NOT NULL)), macroarg(0),
(IF(((macroarg(0) = 6) OR (macroarg(0) IN (4, 5))),
(macroarg(0) + 1),
(macroarg(0) + 2))))
),
'col2',
CASE WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2) ELSE (macroarg(0) + 3) END
) = [3,4]) THEN 2
ELSE 3 END) +
CASE WHEN (named_struct(
'x',
CASE WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2) ELSE (macroarg(0) + 3) END,
'y',
CASE WHEN (named_struct('col1',
(IF((((macroarg(0) > 7) AND (macroarg(0) < 20))
AND (macroarg(0) IS NOT NULL)),
macroarg(0),
(IF(((macroarg(0) = 6) OR (macroarg(0) IN (4, 5))),
(macroarg(0) + 1), (macroarg(0) + 2))))
),
'col2',
CASE WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2)
ELSE (macroarg(0) + 3) END
) = [1,2]
) THEN 1
WHEN (named_struct('col1',
(IF((((macroarg(0) > 7) AND (macroarg(0) < 20))
AND (macroarg(0) IS NOT NULL)),
macroarg(0),
(IF(((macroarg(0) = 6) OR (macroarg(0) IN (4, 5))),
(macroarg(0) + 1), (macroarg(0) + 2))))
),
'col2',
CASE WHEN (macroarg(0) = 1) THEN (macroarg(0) + 2)
ELSE (macroarg(0) + 3) END
) = [3,4]
) THEN 2
ELSE 3 END
) = [1,2]) THEN 1
ELSE 2 END
)
{ (s: String) =>
val i = if (s.endsWith("abc")) 1 else 0
val j = if (s.contains("abc")) 1 else 0
val k = if (s.is_not_null && s.not_in("abc")) 1 else 0
i + j + k
}is translated to the expression tree:
(((
IF(endswith(macroarg(0), 'abc'), 1, 0)) +
(IF(contains(macroarg(0), 'abc'), 1, 0))) +
(IF(((macroarg(0) IS NOT NULL) AND (NOT (macroarg(0) IN ('abc')))), 1, 0))
)
- A macro definition may have invocations to already registered macros
- The syntax to call a registered macro is
registered_macros.<macro_name>(args...)
import org.apache.spark.sql.defineMacros._
import org.apache.spark.sql.sqlmacros.registered_macros
spark.registerMacro("m2", spark.udm({(i : Int) =>
val b = Array(5, 6)
val j = b(0)
val k = new java.sql.Date(System.currentTimeMillis()).getTime
i + j + k + Math.abs(j)
})
)
spark.registerMacro("m3", spark.udm({(i : Int) =>
val l : Int = registered_macros.m2(i)
i + l
})
)Then the sql select m3(c_int) from sparktest.unit_test has the following plan:
Project [(cast(c_int#3 as bigint) + ((cast((c_int#3 + 5) as bigint) + 1613707983401) + cast(abs(5) as bigint))) AS (CAST(c_int AS BIGINT) + ((CAST((c_int + 5) AS BIGINT) + 1613707983401) + CAST(abs(5) AS BIGINT)))#4L]
+- SubqueryAlias spark_catalog.default.unit_test
+- Relation[c_varchar2_40#1,c_number#2,c_int#3] parquet
- We collapse
sparkexpr.GetMapValue,sparkexpr.GetStructFieldandsparkexpr.GetArrayItemexpressions. - We also simplify
Unwrap <- Wrapexpression sub-trees for Option values.
| Num. | macro | Catalyst expression |
|---|---|---|
| 1. | {(p : Point) => |
(((macroarg(0, StructField(x,IntegerType,false), StructField(y,IntegerType,false)).x |