-
Notifications
You must be signed in to change notification settings - Fork 17
NPath
Harish Butani edited this page Nov 2, 2012
·
2 revisions
On a flights table, here is sa sample query:
flights_tiny (ORIGIN_CITY_NAME string, DEST_CITY_NAME string, YEAR int, MONTH int, DAY_OF_MONTH int, ARR_DELAY float, FL_NUM string); select origin_city_name, fl_num, year, month, day_of_month, sz, tpath from npath( flights_tiny partition by fl_num order by year, month, day_of_month, 'ONTIME.LATE+', 'LATE', arr_delay > 15, 'EARLY', arr_delay < 0, 'ONTIME', arr_delay >=0 and arr_delay <= 15, 'origin_city_name, fl_num, year, month, day_of_month, size(tpath) as sz, tpath as tpath' ) ;
The interface is:
- A Symbol Pattern (dot separated list of Symbol names, with support for ‘+’ and ‘*’)
- One or more Symbol definitions: A Symbol definition is a pair of arguments: a name and any Boolean expression on the row.
- A result expressionstring : which is interpreted as a select list on the input row that matches the pattern. Additionally tpath is available during this evaluation as an array of input rows. These are the rows that match the specified pattern.
Here is sample output from the above query:
Chicago 1531 2010 10 20 3 [{"origin_city_name":"Chicago","dest_city_name":"New York...},{"origin_city_name" ...}]
Baltimore 1599 2010 10 20 3 [...]
Baltimore 1599 2010 10 24 4 [...]
Some thoughts on how to extend the functionality:
- Add a skipsAhead parameter.
- Currently every row is checked for the pattern.
- With skipsAhead = true; once a match is found, matching would jump ahead past the last row in the matched set.
- The matched set of rows is exposed as an array of struct. This restricts the kind of operations possible in the resultSet Expression.
- Originally it was possible to write Groovy expressions on the array to do things like compute the average Delay, max delay etc.
- Since Hive doesn’t have functions to operate on arrays, one idea is to set up a partition of the rows in tpath and run them through an aggregation step.
- The shape of these rows would have the original columns + the columns from the tpath structure.
- The user would be express aggregations like so:
select origin_city_name, fl_num, year, month, day_of_month, sz, tpath from npath( ..., 'origin_city_name, fl_num, year, month, day_of_month, avg(tpath.arr_delay) as avgDelay' ) into path='/tmp/testNPath';
- Is the mechanics for specifying Patterns sufficient?
- In groovy each step in the pattern could be an arbitrary groovy expression: so you could say: ‘(LATE && isWeekend(day_of_month)).LATE+’