POC for groupby-aggregator #384

vafl · 2021-09-20T07:22:01Z

Description of changes:

This is a small work in progress POC to test how we can compute metrics per group in one pass without doing an actual groupBy (see e.g. #149).
The idea is to use an Aggregator that keeps track of aggregations per group of a column. For this example I used the spark Aggregator api. It could perhaps also be implemented using the existing Analyzer with some changes to the current code (one has to be able to access the underlying aggregation function, the state etc in the wrapper that aggregates per group).

One issue could be that the aggregator state may become big if there are many values in the groupBy column, but I think the use case is low-cardinality columns anyway.

Let me know your thoughts.

Running the example gives

Map(thingD -> 1.0, thingA -> 13.0, thingE -> 20.0, thingB -> 0.3, thingC -> 20.0)
Map(thingD -> 1.0, thingA -> 9.0, thingE -> 20.0, thingB -> 0.3, thingC -> 9.700000000000001)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

sscdotopen · 2021-09-20T07:45:10Z

Hi Valentin,

Great to see you work on Deequ! We have something like this already in the profiler, but it does not use the aggregator API, maybe you want to include it in your PR as well. Here is the corresponding place in the code:

https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/profiles/ColumnProfiler.scala#L581

vafl · 2021-09-20T08:01:46Z

Hi Sebastian :-)

thanks for the pointer! This would fit in well I think.

Ideally, we could support any of the existing metrics on a per-group bases. I'll play around to see if I can integrate this with the existing Analyzer API. The other option is to move all current code into the Aggregator API, but this would be a bigger change and I'm not sure if there are any downsides with this. Do you have thoughts on this?

Also one would need a user API to specify e.g. constraints per group similar to the where.

sscdotopen · 2021-09-20T10:58:57Z

I don't think we should run all aggregations via the aggregator API, because there are some aggregations which might be run on high-cardinality columns (e.g. testing whether a key column contains no duplicates).

vafl · 2021-09-20T13:12:45Z

Thanks. Just to clarify, we can use the aggregator api without any grouping and it will work exactly like the existing metrics that we compute. The problem I mentioned only occurs if you use it to specifically compute a metric for each group. Anyway, I'll iterate on this a bit.

POC for groupby-aggregator

c389fd9

obaidcuet mentioned this pull request Oct 10, 2021

analyzers/checks to generate metrics based on certain dimensions/group by columns #388

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC for groupby-aggregator #384

POC for groupby-aggregator #384

vafl commented Sep 20, 2021

sscdotopen commented Sep 20, 2021

vafl commented Sep 20, 2021

sscdotopen commented Sep 20, 2021

vafl commented Sep 20, 2021

POC for groupby-aggregator #384

Are you sure you want to change the base?

POC for groupby-aggregator #384

Conversation

vafl commented Sep 20, 2021

sscdotopen commented Sep 20, 2021

vafl commented Sep 20, 2021

sscdotopen commented Sep 20, 2021

vafl commented Sep 20, 2021