Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(statistics): use subquery instead of join to avoid cartesian product #558

Closed

Conversation

winged
Copy link
Member

@winged winged commented Dec 23, 2024

The way we filter in the statistics view, any added filter (that affects reports) adds a "dimension" to the cartesian product, exploding the total number of hours reported.

Instead of using JOIN, we do EXISTS(SUBQUERY) now, which should avoid this issue. Might be a tiny bit slower, but let's try to make it correct first, then fast.

Fixes #89

@winged winged requested a review from a team as a code owner December 23, 2024 13:21
@winged winged force-pushed the fix_result_explosion_in_statistics branch from 084550f to 0fcff27 Compare December 23, 2024 13:28
Copy link
Contributor

@fugal-dy fugal-dy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. The performance bump is relatively high with small numbers but still in an acceptable range in human time. It's very soon outperformed on larger numbers, so I think that's definitely the way to go.

I would suggest that the EXISTS variant is generally preferable where large number of relations potientially exist. Even if most queries are not eligible, those few that acutally trigger those high-dimension joins will affect every other query. So basically, starting from a certain size of a table that might explode the query, that pattern is preferable. And, given the relatively small trade off when below the threshold, it's probably safe to start from here and fine tune on demand.

@winged
Copy link
Member Author

winged commented Dec 23, 2024

Note this was a one-shot attempt at fixing our issues, created during a hack session with @trowik and @MitanOmar . However it doesn't solve the issue and I think a general refactoring of the whole statistics backend is in order.

@winged winged force-pushed the fix_result_explosion_in_statistics branch from 0fcff27 to 47bfc07 Compare December 27, 2024 15:07
The `docker compose run --rm` starts up a new container, which does not
have access to the actual database, so the make target didn't actually work
This improves the test such that from the resulting sum, we can see
exactly which two reports were affected, because every sum of reports
will now return a distinct sum (any combination of 1,2, and 4 will return
a different result)
This was not the case with 1,2, and 3 before.
The way we filter in the statistics view, any added filter (that affects
reports) adds a "dimension" to the cartesian product, exploding the total
number of hours reported.

Instead of using JOIN, we do EXISTS(SUBQUERY) now, which should avoid this
issue. Might be a tiny bit slower, but let's try to make it correct first, then fast.
…oints

Note: experimental, will likely not work.

WIP commit
@winged winged force-pushed the fix_result_explosion_in_statistics branch from 47bfc07 to 77ee70c Compare December 27, 2024 17:45
winged added 2 commits January 7, 2025 15:03
…VIEW

This is still WIP, but I think this is the right way forward.

TODO:

* Further refactor the views, as the customer,project,task viewsets should
  return all the selected objects, regardless of reported sum. I think we
  will need to base those viewsets off of their respective models and
  join/prefetch the report statistic onto it
* Finalize the rest of the refactoring to make it work again

Optionally (and I hope it's not necessary): Build a statistic (database) view
for all statistics endpoints. This would allow us to optimize the whole thing
much more (and improve performance AND readability) at the cost of quite some
more boilerplate code
@winged winged force-pushed the fix_result_explosion_in_statistics branch from 77ee70c to 18d2969 Compare January 7, 2025 14:03
@winged
Copy link
Member Author

winged commented Jan 8, 2025

Closing, the resulting work is now in #568

@winged winged closed this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

statistic page loading wrong data
2 participants