-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark of integrated probability #55
base: main
Are you sure you want to change the base?
Add benchmark of integrated probability #55
Conversation
@mcoughlin, I had time to isolate this test case for the performance of the integrated probability calculation, but I haven't had time yet to try to tune the query and optimize it. Do you have a student who is an SQL wizard that you could assign to take a look at it? |
Codecov ReportBase: 96.34% // Head: 94.93% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #55 +/- ##
==========================================
- Coverage 96.34% 94.93% -1.41%
==========================================
Files 4 4
Lines 82 79 -3
==========================================
- Hits 79 75 -4
- Misses 3 4 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Hi @lpsinger @mcoughlin I'd be happy to take a look at this and try optimizing the query. I'm definitely not a SQL wizard but I'm taking a class on spatial databases rn so I'm learning more about SQL and such |
Great idea @jadalilleboe! |
so the query looks like this in SQL:
I think the bottleneck is in the WHERE clause with finding the overlap of a skymap tile with a specific id and the field tile ranges, because removing the |
Interestingly, simply removing the
|
Even more curious, "group by" is also fast. It's just filtering that is slow. Nothing
Group by
Filter
|
Here is the Nothing
Group by
Filter
|
...no way. If I just run
@mcoughlin, do you confirm? |
One difference between this minimal reproducer and Fritz is that this test only has 1 sky map loaded, whereas Fritz has several. Here are the results if in this test I populate the database with 10 sky maps. I do not see a massive slowdown. With 1 sample sky map
With 10 sample sky maps
|
@lpsinger Are these ZTF fields you are trying? How many are in the query? |
The fields are squares the size of the ZTF field of view, but they are randomly and isotropically distributed. Chip gaps are not represented in the geometry. The number of fields is in the name of the test. For example, in |
This is what I find using the full ZTF field geometry including the 64 chips. There is 1 sky map in the database in this example:
And there are 10 sky maps in the database in this example:
|
If I have 10 sky maps in the database, and I use the full ZTF field geometry, and I neglect to run
And at this point, it does take minutes. |
2d9e4b8
to
edbbb5c
Compare
@mcoughlin asked me to take a look at this, and I could reproduce the slowdown that @lpsinger observed with the full ZTF field geometry and ~50 skymaps in the database (following the README and taking the latest 50 superevents from GraceDB). The main problem is the
A more detailed explanation is below. Looking at the "bad" plan, I see that the planner expects
Since the query planner expects only 100 32-byte rows (i.e. less than a memory page), it also expects that the outer product of (filtered subsets of) skymaptile and fieldtile will be quicker to form from sequential reads than potentially slow random reads into skymaptile (more on this later). This happens because the query planner treats functions as black boxes by default, and assumes constants for the number of rows they return. These are however configurable and stored in
Here you can see that both
The other factor that the planner uses to find the best plan are the estimated costs of each operation, in an arbitrary time unit. Out of the box Postgres assumes that random reads are on average 4x slower than sequential reads, which will lead the query planner to prefer sequential scans (or their cousins, bitmap index scans) that examine up to 4x as many rows as the equivalent index scan. This makes sense for mechanical storage, but not for solid state, where random reads are not that much different. If you're using storage backed by SSDs in production, it may make sense to adjust the planner constants accordingly. On my system I can recover the "good" plan by setting the row estimate for unnest back to 100, bu the random page cost to 1, but YMMV:
Hope that helps! |
While investigating temporary tables as a workaround for the Looking at a plan for a similar query (without the skymaptile.id restriction, and that still uses the "good" plan), I see that the planner expects each iteration of the loop to match 18816 rows, whereas it actually matches 125 on average:
This happens because the selectivity function just returns 0.01 whenever one of one of the operands is not a constant. For a constant operand, it uses sample ranges to estimate the overlap fraction. There may be a way to extend this to non-constant operands, but I imagine it involves quite a bit of extra plumbing. For this particular case I was able to convince the planner to use the right plan by adding a multicolumn index that includes both id and hpx. This should at least break the scaling of the result set with number of skymaps, since the (properly estimated) selectivity of the integer-equality operator gets multiplied into the total selectivity:
This index is ~50% larger than the existing spgist index (which itself was almost as large as the relation itself), so keeping both around is likely not an option. |
edbbb5c
to
21cec74
Compare
@jvansanten, I tried adding a custom function with hardcoded planner stats as you suggested (21cec74), but the plan is still flipping between a "good" one and a "bad" one. Good
Bad
|
I think the row estimate for unnest() may not have been the primary issue after all. A bigger problem, especially when the skymaptile table becomes large, is that postgres assumes that the selectivity of
Reducing the default selectivity for && by a factor 10 restored the good plan:
It might be worth taking this question to pgsql-performance and see what the experts make of it. The answer may well end up being "your ranges are too small; use bigger ones," but maybe there's something we haven't thought of yet. Another option that may work well enough for now (and in versions of postgres that cloud providers will give you) is to use a multicolumn index that includes both the skymap id and pixel range. In my limited testing that also coaxed postgres to use the "good" plan, but again I'm not sure what impact that has on other queries you might need to support. |
@lpsinger This seems like it may be good to raise at some higher level? What do you think? |
Add benchmark of integrated probability within the union of a set of fields.
As reported by @mcoughlin and @Tohuvavohu, this operation is slower than it should be: