-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add aggregate functions #2925
Add aggregate functions #2925
Conversation
@steve-chavez would appreciate your thoughts when you have a moment! |
@timabdulla Thanks for taking the initiative!
This part seems different now that we have Impersonated Role Settings, we could leave the job to https://github.com/pgexperts/pg_plan_filter. For users that don't want to deal with aggregates, we could add a config option to enable/disable them (better save this as a last step of the implementation though). One other thing that we should consider is avoiding failure from clients whenever they add an aggregate but forget to specify GROUP BY. Using a window function for no GROUP BY (as mentioned on #2066 (comment), sorry this is all over the place) might be a good solution for this. |
src/PostgREST/Query/SqlFragment.hs
Outdated
@@ -496,6 +519,7 @@ setConfigLocalJson prefix keyVals = [setConfigLocal mempty (prefix, gucJsonVal k | |||
arrayByteStringToText :: [(ByteString, ByteString)] -> [(Text,Text)] | |||
arrayByteStringToText keyVal = (T.decodeUtf8 *** T.decodeUtf8) <$> keyVal | |||
|
|||
-- Investigate this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timabdulla You don't need to worry about this part.
We could say that these are "final aggregates" (json_agg, string_agg, etc), they're used to send an "interchange format" to clients. i.e. json, csv, xml, binary, etc.
For this feature we're dealing with regular aggregates (sum, max, etc), used to fold a number of rows into a postgres native format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think in terms of the whole "data representation" feature and the big discussion in #1582, then I'm not sure whether you can actually make a reasonable distinction between those aggregations. @aljungberg WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment @wolfgangwalther. To be honest, I don't have as much context on the bigger-picture elements of the architecture and the like, so I am not totally clear on the practical implications of your comment as it relates to this PR. Any examples or illustrations would be very helpful! Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there are any practical implications for this PR right now. This was just a thought, not more, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting without being fully read in on this PR, just to answer the question: are the 'final' aggregates different from other aggregation (in our envisioned future)?
So I think the answer is 'no'. With pervasive data representations, json_agg
, CSV representation, or even more exotic forms like "encode into a protobuf buffer" are all just pluggable transformations which we would no longer hard-code special cases for.
So json_agg
would just be one of the built-in data representations when "what we have" (an intermediate type that needs to be aggregated) differs from "what we want" (output consistent with the negotiated content type application/json
).
Considering the above, it might possible to isolate a first pass of the feature to only supporting aggregates and not tie them to GROUP BY. |
Hey @steve-chavez, Really appreciate your quick feedback! In terms of Specifically, this part:
The Let me know what you think. I'll also take a look at Impersonated Role Settings to get a better understanding of that feature. |
Ah, I missed the above. Regarding inferring the GROUP BY columns, I think it would be better to default to doing aggregates with the window function approach. This way the elements in @wolfgangwalther Please give your feedback whenever you have the time. |
Ok, got it. Thanks for explaining. That seems a simple enough change to make: Basically without an explicit GROUP BY, default to doing an aggregate over an "empty" window (e.g. We would then need a new query parameter for I also still need to add support for |
I disagree, for a couple of reasons:
It's not really the same. The inner join example actually changed the underlying resource (i.e. "which rows go into the resultset"). This is not the case here. The same rows still go into the resultset, they are just aggregated into a single value. But they have been part of the computation and have not been filtered out before. The semantics are still very clear. So we should:
Just the very basic "allow aggregate functions and auto-derive group by" should be plenty enough for this MR at first, I think. |
It is very useful, I don't disagree with that.
Hm, the fact remains that adding an element in I fear that violating a semantic we agreed upon before ( How about special syntax for auto-derive group by though? Could be like groupby=*
#or
groupby=! Idea taken from https://stackoverflow.com/a/416764/4692662 |
As I said before, the problem is not the number of output rows, which is really just a different "format" of the response. This is very similar to a different content-type, which is only changed via a header - but also returns a much different format. The problem is the number of input rows. An implicit inner join does change the input and should not be part of select. An aggregation does not change the input rows and is fine there.
A syntax for that would only ever make sense, if we plan to support manual specification of group by as well... .. but would that ever be useful? I doubt that anyone really needs to either:
None of that makes sense. The only useful thing we could do would be to allow different grouping sets or rollup etc: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS So if we want to introduce any syntax now, we should think about how to represent those as syntax.. and then go on from there for the default / simple case. However, I'd feel fine to use the default / simple case in the absence of any specific syntax. One advantage of that is, that we will also have the case of |
You're right. My bad, I forgot about the
Great points. Agree, we should omit @timabdulla Sorry for the back and forth on this. Let's stick with Wolfgang's proposal. |
Thanks folks! Then I'll work to get this finished as per the original plan. One thing: I might omit support for |
Agreed. |
f10c2c9
to
997289e
Compare
Hi folks, I'm getting pretty close with my changes, but I had a few things I wanted to bring up: 'Hoisting' Aggregate Functions in the Case of Has-One SpreadsImagine the following example (and please forgive me if the details are a little contrived): We have a table called As it stands, the aggregation occurs within the subquery of the lateral join, and thus the sum basically does nothing: There is only one row within the subquery (this being a has-one relationship), and thus the sum of the Given this, I can see an argument for why we might 'hoist' the aggregate function to the top-level in the case of has-one spreads: Aggregate functions would be brought to the top-level (along with the associated But I then wonder a bit about consistency, as without resource spreading, this form of hoisting is not possible (or sensible, really), which means that It is important to note that in any case, there is also a workaround, in that if you invert the original query (i.e. The question: Should aggregate functions be hoisted in the case of has-one spreads? The implementation is not trivial, but it's also not hugely complex. Order of Operations on SelectsCurrently, the select syntax is implemented like so: I have opted for casting to be performed on the column itself before aggregation, as this seems much more useful than casting the return value of the aggregate function itself. This would be most useful in the case of wanting to perform an aggregation on a Please let me know if you have other thoughts or otherwise disagree with this choice.
|
I am thinking of the spread operator as a "shortcut" to express something like this (pseudo-syntax):
Ofc, this "direct access to a single nested field" is not possible with this syntax anyway, but the spread operator is kind of a shortcut to put multiple of those together. As such, I consider whatever is inside the spread operator to be part of the outer query afterwards. Since the spread operator is only possible with has-one relationships and, as you say, aggregation for 1 row doesn't make sense... I think hoisting you propose is a very nice solution. It has nice semantics and should be explainable in the docs, I think. In a perfect world our query syntax would only allow us to specify sensible queries. Given that hoisting would eliminate some variants that are not sensible and gives us better options.. I don't see much argument against it. 👍
I can see reasons for both casting the column before aggregation (like you gave an example), but also after aggregation (thinking data representations and how I actually want the final value to be rendered to the client. Can we do
Certainly out of scope for this PR, yes. The proper way to get rid of that is to validate all columns in |
Thanks for the quick reply!
|
Random thought: Perhaps (not in this PR) there could be a possibility of has-many spreads, so long as aggregate functions are involved. Example use case: |
997289e
to
35e4ddb
Compare
Ok, I now have something I feel that is pretty feature-complete. Still pending code cleanup (not yet ready for review, I would say), automated tests and docs. Hopefully I can knock that off that off this week, and then bring it for review. The PR has grown a fair bit in size, as handling all the various cases well took a bit of refactoring. |
35e4ddb
to
78fe42a
Compare
Ok, this PR is finally ready for review. The only thing I've yet to do is finish rounding out the tests, though I already have some tests in. That said, the actual code itself is more or less complete from my perspective. I welcome your comments and feedback; while I await that, I'll finish up the tests. OverviewAs a reminder, this PR enables support for aggregate functions. The basic request syntax works like so: You may also optionally provide a cast for the output of the aggregate function, like so: You may also provide a The following aggregate functions are supported: Aggregate functions are also supported in the case of embedded resource spreads by means of 'hoisting'. Further detail provided here: https://github.com/PostgREST/postgrest/pull/2925/files#diff-2295ae0baafabcf593f2991ae456859ad5c3a8190ff532155da6e8760d08411cR611 Notes and observations
Thank you folks. I look forward to hearing from you. |
if ("*" `elem` map (\(field, _, _) -> cfName field) selectItems) && any hasOutputRep knownColumns | ||
then rplan{select=concatMap (expandStarSelectItem knownColumns) selectItems} | ||
else rplan | ||
adjustContext context@ResolverContext{qi=ctxQI} (QualifiedIdentifier "" "pgrst_source") (Just a) = context{qi=ctxQI{qiName=a}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice move sticking this bit of the action into adjustContext
. That feels quite natural.
Somehow I do miss the explicitness of matching on from=..., fromAlias=...tblAlias
directly in the plan though, to me that really spelled out what condition triggers this special case. This is probably bike-shedding but to get back a little of that one could perhaps imagine renaming adjustContext
to resolveTable
since that's basically what it does (if we are accessing indirectly via pgrst_source
we figure out the true underlying table as documented, otherwise we're already resolved), and a
in the capture to fromAlias
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion, will do!
-- We expand if either of the below are true: | ||
-- * We have a '*' select AND there is an aggregate function in this ReadPlan's sub-tree. | ||
-- * We have a '*' select AND the target table has at least one data representation. | ||
-- We ignore any '*' selects that have an aggregate function attached (i.e for COUNT(*)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a case where an aggregate function needs to operate on a data rep domain column. It doesn't seem out of the realm of possibility. In that case we'd need to expand these stars too, wouldn't we? But I guess that would be a future enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stars will be expanded in the case of data reps as well. I think the code comment I left is poorly worded, will revise.
In some sense, you can use aggregate functions already with data rep columns, but the semantics aren't great, tbh. For instance, if I am requesting JSON from an endpoint, then all data rep columns will be cast to JSON (by design.) The issue becomes that even though you could use any of the aggregate functions, you would then be applying an aggregate function to a JSON column, which generally won't work.
I think some thought will be required in order to make data reps play nicely with aggregates, and I would concur that this should be a future enhancement, as this PR is already big enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I imagine the same and like discussed one could even express the idea of aggregation itself as a kind of data rep. But no need to try to solve every problem at once. This looks like a nice feature and it's ready to go as is.
Did you by any chance do any testing of what happens when domain driven data reps are involved as the source data for an aggregation? Like maybe a data rep formats dates as strings and then the results are fed through string_agg
via your aggregation syntax. I don't think I saw any domains used in the tests. (If it doesn't work, I think that's fine, no need to let perfection be the enemy of good.)
Agree with the strategy of doing most of the heavy lifting in the planning stage. And yes, the 'shadow' type structure we're arriving at is perhaps a little awkward. Part of it started with data reps because we had different correctness guarantees in the query versus the plan, if I remember right, so we decided to not just extend the existing types. |
Got it. It's not an impediment or anything to development, but there is a nagging feeling that perhaps it could be simplified, not that I have any proposal for how to accomplish that. At the moment, there may be little value in that anyway, but just noting my own experience. Thank you very much for your review! |
@steve-chavez I see some of my tests are failing due to test data differences for different versions of Postgres. Is that deliberate? It looks like |
881b325
to
d0e3c48
Compare
Hey @steve-chavez, I think I added the missing test that should hopefully fix the coverage issue. Note that the very helpful tests that you added do not actually cover hoisting, as hoisting is only used in the case of spreads; without spreads, aggregates are applied within the context the joined table (and thus not hoisted.) I've added a test to cover hoisting/spreads, but I'm afraid hoisting isn't as useful as I'd hoped without support for has-many spreads. That said, I think we can get an easy win once has-many spreads are supported. I think adding support for domain reps would be good in the future too, but there are some conflicting semantics that need to be worked out. |
@steve-chavez seems like the latest tests fixed the code coverage issue and all checks are now passing. Two questions:
|
@timabdulla Great work on adding the remaining tests!
Yes, I'll squash here. Makes sense when all the commits are related to the same feature.
|
What about adding a config option to this PR that makes aggregate functions opt-in by default? That should be a relatively small change (I think) that would allow users to decide whether it's safe in their situation (for instance, they may have a small dataset, or they may have other safeguards, like the
No, it wouldn't, I don't think. |
@timabdulla Sounds good! Maybe |
added! |
38c26a7
to
4a4e73f
Compare
I went ahead and squashed the commits. I think all the CI steps should pass now. By the way, is there a way to run all the CI steps in the nix shell environment? Like one command to do the specs, the IO tests, linting, etc? |
@timabdulla There's
But maybe it makes sense to add a single |
4a4e73f
to
8b79806
Compare
@@ -45,7 +45,8 @@ prefix = "pgrst." | |||
dbSettingsNames :: [Text] | |||
dbSettingsNames = | |||
(prefix <>) <$> | |||
["db_anon_role" | |||
["db_aggregates_enabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing for the in-db config to work.
Will refactor it soon so it can be missed.
@@ -137,7 +138,8 @@ toText conf = | |||
where | |||
-- apply conf to all pgrst settings | |||
pgrstSettings = (\(k, v) -> (k, v conf)) <$> | |||
[("db-anon-role", q . T.decodeUtf8 . fromMaybe "" . configDbAnonRole) | |||
[("db-aggregates-enabled", T.toLower . show . configDbAggregates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name so it's more consistent with other boolean configs
The aggregate functions SUM(), MAX(), MIN(), AVG(), and COUNT() are now supported.
59f9995
to
bf6a909
Compare
@timabdulla Awesome work 🔥 ! Would you help me document this feature? It should be in a new page https://postgrest.org/en/latest/references/api.html https://github.com/PostgREST/postgrest-docs/tree/main/docs/references/api |
@timabdulla Btw, forgot that postgrest/nix/tools/devTools.nix Lines 66 to 84 in c3301a1
|
Thanks for all your help getting this over the line! It's very much appreciated. Sure thing - I will write something up shortly and leave it for your review. |
Do the new aggregate functions include https://postgrest.org/en/stable/references/api/aggregate_functions.html J |
@jdgamble555 no, not yet, but wouldn't be too tricky to add, I don't think. With the size of this PR being so large, I felt that was better to defer to a future PR, though it's obviously a super helpful feature. |
Does this PR actually support counting on embedded resources? Before it worked like this: So is there any "modern" equivalent after this PR? 😊 |
This was never really intended to work. You can read about the new way here: https://postgrest.org/en/stable/references/api/aggregate_functions.html#the-case-of-count |
This is an initial attempt to add support for aggregate functions, using the call pattern suggested here: #915 (comment) by @wolfgangwalther.
Note that I am aware that this PR is missing a number of things (at the very least,
count()
, support forHAVING
, the proposed limit on query cost to prevent denial-of-service failures, and, of course, automated tests), and I haven't fully tested this yet beyond some initial sanity checks.What I would really like to understand at this point is a) is the technical direction I am taking one that is appropriate, and b) what are the must-haves before this feature could be considered merge-able.
Thank you for your time and attention!