Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC for groupby-aggregator #384

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

vafl
Copy link

@vafl vafl commented Sep 20, 2021

Description of changes:

This is a small work in progress POC to test how we can compute metrics per group in one pass without doing an actual groupBy (see e.g. #149).
The idea is to use an Aggregator that keeps track of aggregations per group of a column. For this example I used the spark Aggregator api. It could perhaps also be implemented using the existing Analyzer with some changes to the current code (one has to be able to access the underlying aggregation function, the state etc in the wrapper that aggregates per group).

One issue could be that the aggregator state may become big if there are many values in the groupBy column, but I think the use case is low-cardinality columns anyway.

Let me know your thoughts.

Running the example gives

Map(thingD -> 1.0, thingA -> 13.0, thingE -> 20.0, thingB -> 0.3, thingC -> 20.0)
Map(thingD -> 1.0, thingA -> 9.0, thingE -> 20.0, thingB -> 0.3, thingC -> 9.700000000000001)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sscdotopen
Copy link
Contributor

Hi Valentin,

Great to see you work on Deequ! We have something like this already in the profiler, but it does not use the aggregator API, maybe you want to include it in your PR as well. Here is the corresponding place in the code:

https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/profiles/ColumnProfiler.scala#L581

@vafl
Copy link
Author

vafl commented Sep 20, 2021

Hi Sebastian :-)

thanks for the pointer! This would fit in well I think.

Ideally, we could support any of the existing metrics on a per-group bases. I'll play around to see if I can integrate this with the existing Analyzer API. The other option is to move all current code into the Aggregator API, but this would be a bigger change and I'm not sure if there are any downsides with this. Do you have thoughts on this?

Also one would need a user API to specify e.g. constraints per group similar to the where.

@sscdotopen
Copy link
Contributor

I don't think we should run all aggregations via the aggregator API, because there are some aggregations which might be run on high-cardinality columns (e.g. testing whether a key column contains no duplicates).

@vafl
Copy link
Author

vafl commented Sep 20, 2021

Thanks. Just to clarify, we can use the aggregator api without any grouping and it will work exactly like the existing metrics that we compute. The problem I mentioned only occurs if you use it to specifically compute a metric for each group. Anyway, I'll iterate on this a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants