-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Allow fuzzy matches on array-valued columns #1994
Comments
Yeah - so we've been experimenting with these new duckdb functions too. We plan to include a function in the comparison library/comparison level library eventually, but for the moment you can do it using custom sql. Here's some pointers if you didn't work it out already: https://gist.github.com/RobinL/d8a84f7a31fa7cb17dafb05c94518225 |
Thanks, that may be a neater approach. Do you have any idea how to make this work on spark? I see there are similar array functions, and will experiment with some. |
For spark we define a custom udf ARRAY_MIN(
TRANSFORM(DUALARRAYEXPLODE(name_l, name_r), x -> LEVENSHTEIN(x['_1'], x['_2']))
) <= 2 |
Thanks. Following Robin's sample code, I think udf's can be avoided with an iterated transform to do the cartesian product (a little easier to understand than my iterated reduce proposal, though it involves one extra array operation and possibly intermediate storage). The approach can be put together as something like this:
|
I'm planning to work on a PR for this! |
Is your proposal related to a problem?
One typical use of array-valued columns is when one known entity has multiple values for a single attribute, a match on any of which is acceptable. For example one person may have multiple telephone numbers or an address history, or even multiple names (married/unmarried names, aliases, etc). While exploding the records outside of splink and post-processing matches is one alternative to array-valued columns, this affects match probabilities and may create memory challenges.
Currently the only way to compare array-valued columns is to use the array_intersect functions, which require exact matches among the elements of the array.
However, in many circumstances, these values are prone to error and fuzzy matching may be required to identify matches.
Describe the solution you'd like
I propose a family of matching functions of the form array_min_sim(x, y, simtype) and array_max_sim(x,y,simtype). Here x and y are arrays, and "simtype" is a string identifying one of the existing fuzzy string comparison metrics (alternatively, a separate array function could be implemented for each metric).
The semantics would be to return the maximum or minimum similarity (alternatively, distance) between an element of array x and an element of array y. These max/min similarities could be thresholded to get comparison levels.
I'm not sure what the best implementation of this would be. Duckdb may allow an implementation using an iterated call to list_reduce. For example, for array_min_sim, the underlying code could be a list_reduce applied to [Infinity,x] (Infinity prepended to x), with reduce function (a, b) -> min(a, list_reduce( [Infinity,y], (c,d) -> min(c, dist(b,d ) ) .
Alternatively, the arrays could be exploded locally and a full cartesian product could be constructed temporarily (with no other variables to reduce memory overhead), the string comparison run on on the expanded data, then aggregated with min/max; I believe this is similar to what is/was being considered for array valued blocking keys (#1448) (though I can't say I understand the internals enough to understand if this will work!)
Describe alternatives you've considered
Un-nesting the data before running splink and post-processing is one option, but this affects m-values and creates memory challenges via duplicated records.
Using array intersections as another alternative, but it demands exact matching.
Additional context
The text was updated successfully, but these errors were encountered: