-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(statistics): use subquery instead of join to avoid cartesian product #558
Conversation
084550f
to
0fcff27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. The performance bump is relatively high with small numbers but still in an acceptable range in human time. It's very soon outperformed on larger numbers, so I think that's definitely the way to go.
I would suggest that the EXISTS
variant is generally preferable where large number of relations potientially exist. Even if most queries are not eligible, those few that acutally trigger those high-dimension joins will affect every other query. So basically, starting from a certain size of a table that might explode the query, that pattern is preferable. And, given the relatively small trade off when below the threshold, it's probably safe to start from here and fine tune on demand.
Note this was a one-shot attempt at fixing our issues, created during a hack session with @trowik and @MitanOmar . However it doesn't solve the issue and I think a general refactoring of the whole statistics backend is in order. |
0fcff27
to
47bfc07
Compare
The `docker compose run --rm` starts up a new container, which does not have access to the actual database, so the make target didn't actually work
This improves the test such that from the resulting sum, we can see exactly which two reports were affected, because every sum of reports will now return a distinct sum (any combination of 1,2, and 4 will return a different result) This was not the case with 1,2, and 3 before.
The way we filter in the statistics view, any added filter (that affects reports) adds a "dimension" to the cartesian product, exploding the total number of hours reported. Instead of using JOIN, we do EXISTS(SUBQUERY) now, which should avoid this issue. Might be a tiny bit slower, but let's try to make it correct first, then fast.
…oints Note: experimental, will likely not work. WIP commit
47bfc07
to
77ee70c
Compare
…VIEW This is still WIP, but I think this is the right way forward. TODO: * Further refactor the views, as the customer,project,task viewsets should return all the selected objects, regardless of reported sum. I think we will need to base those viewsets off of their respective models and join/prefetch the report statistic onto it * Finalize the rest of the refactoring to make it work again Optionally (and I hope it's not necessary): Build a statistic (database) view for all statistics endpoints. This would allow us to optimize the whole thing much more (and improve performance AND readability) at the cost of quite some more boilerplate code
77ee70c
to
18d2969
Compare
Closing, the resulting work is now in #568 |
The way we filter in the statistics view, any added filter (that affects reports) adds a "dimension" to the cartesian product, exploding the total number of hours reported.
Instead of using JOIN, we do EXISTS(SUBQUERY) now, which should avoid this issue. Might be a tiny bit slower, but let's try to make it correct first, then fast.
Fixes #89