-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large schema cache reloads cause 100% CPU spike when schema cache metadata is accessed. #3046
Comments
Is this |
Sorry, forgot to mention that in the original post – no, we only have The tables on e.g.
|
Ah right, my bad. We only filter for extra search path schemas in the postgrest/src/PostgREST/SchemaCache.hs Line 952 in f10d413
vs. postgrest/src/PostgREST/SchemaCache.hs Line 784 in f10d413
This is because we might detect base tables in those views that are outside the search path (probably the case most of the time). IIRC we discussed a potential for future optimization by just querying the tables detected via the views query in the tables query. Seems like you have hit that limitation right there. |
@wolfgangwalther Hey - just wanted to check in to see if this is something that is likely to be worked on in an upcoming release of PostgREST? If not, how feasible do you think it would be for me as a relatively inexperienced Haskell developer to dive in an try to submit a PR fixing this (and if this is the route we decide to go, do you have a preference for how a fix is implemented)? |
No plans from me regarding this in the near future. Too many other things on my list I'd like to do.
I would say in general the area around the schema cache should be easier to get started with than other areas - because big chunks here are actually SQL, not haskell. So if you know SQL well, that should make it easier.
I have not thought in detail about this. Reducing the size of the schema cache / improving the performance of the schema cache queries is always a good thing. I'm not exactly sure how, but one way to do this could be to merge the |
Since #3213, we now log some schema cache stats:
It would be interesting to know the metrics for your case @colophonemes. If there's a large number of relations, perhaps we should exclude |
@steve-chavez Heres the output from when I load using that build:
And here's some output from ELAPSED %CPU %MEM
00:01 2.0 0.2
00:06 0.0 0.2
00:11 1.6 0.3
00:17 1.5 0.3
00:22 1.1 0.4
00:27 0.0 0.4
00:32 0.0 0.4
00:37 7.6 0.5
00:42 1.3 0.5
# 21/Feb/2024:16:24:04 +0000: Schema cache queried in 640.9 milliseconds
00:47 100.1 0.6
00:52 100.1 0.6
00:57 98.0 0.6
01:02 98.2 0.6
01:07 100.1 0.6
01:12 99.1 0.6
01:17 100.1 0.6
01:22 100.1 0.6
01:27 100.1 0.6
01:32 99.1 0.6
01:37 100.1 0.6
01:42 100.1 0.6
01:47 98.7 0.6
01:52 100.1 0.6
01:57 99.1 0.6
02:02 100.1 0.6
02:07 100.1 0.6
02:12 100.1 0.6
02:17 99.7 0.6
02:22 98.9 0.6
02:27 98.1 0.6
02:32 100.1 0.6
02:37 100.1 0.6
02:42 100.1 0.6
02:48 100.1 0.6
# 21/Feb/2024:16:24:04 +0000: Schema cache loaded 99 Relations, 230 Relationships, 16 Functions, 0 Domain Representations, 4 Media Type Handlers
02:53 3.2 0.6
02:58 0.0 0.6
|
@colophonemes Aha. So that means the schema cache queries are not the problem since they're fast. The problem is the schema cache parsing. This was previously discussed on #2450 (comment). |
While this is right, this does not necessarily mean that "optimizing the queries" as discussed above is wrong. IIRC, we fetch a lot more data for the schema cache - and then filter out the stuff we don't need in haskell code, right? |
Yeah, I agree that it seems like it's the parsing rather than the DB query (indeed, I was able to extract and manually run the SQL queries myself and they were very fast), but I think I agree with @wolfgangwalther that if we could cut down the number of relations we're feeding into the parsing step, then parsing would go a lot faster.
Yeah, this seems like a straightforward solution, it could be a flag that's sort of the opposite of |
@colophonemes Could you share your schema (DDL only) privately? (email on profile) Then I can work on improving the query or maybe the Haskell code. |
That time now should be clearer on main branch due to #3253 |
Unfortunately I don't think that I can that easily. It's not super-straightforward to dump out the Timescale stuff in a way that's restorable on your end. I'll try to prepare a toy schema that results in the same behaviour. |
Did some digging using the
The problem is not related to the postgrest/src/PostgREST/SchemaCache.hs Line 167 in 97cd559
By not computing the relationships and just replacing it with a
Some profiling also indicates that it's not the query but the different tranformations we do on Haskell code that take a lot of time and CPU: postgrest/src/PostgREST/SchemaCache.hs Line 163 in 97cd559
It should be possible to do that logic on the db instead but it will take a fair amount of work. Edit: |
I expected the query itself to be fast. But if we fetch fewer tables, the relationship calculation in haskell should be faster, right? |
Environment
Description of issue
Our production instance of PostgREST exhibits the following behaviour:
NOTIFY pgrst reload
)/
endpoint, but the behaviour seems to eventually happen regardless)Here's me reproducing the issue in a local Docker container, connected to our production DB. The time it takes for the CPU spike to calm down is shorter, presumably because my laptop is beefier than the ECS instances we run our production PostgREST on.
We've observed this issue since at least v9.0.0 (though possibly it was present earlier), through to the present day. I've just run through the testing steps above on both v11.1.0 and v11.2.1, and the behaviour is there in both.
Possibly an interaction with TimescaleDB?
My guess about why this happens in production but not dev is that we use TimescaleDB for a number of our tables that contain time-series data. This means that the number of tables in our schema changes dynamically as more data is added to parent tables, and Timescale chunks these into child tables. The child tables are stored on a
_timescaledb_internal
schema, e.g._timescaledb_internal._hyper_1_12345_chunk
. Some of our larger tables have upwards of 500 of these chunks attached.Formatting and running the tablesSqlQuery from PostgREST/SchemaCache.hs results in ~350 entries against a dev database, and nearly 13,000 entries against prod. If I add
AND n.nspname !~ 'timescale'
to the end of that query then the prod query drops back to around 300 entries.So, our guess is that our production PostgREST instances are choking on this massive amount of data when processing the schema cache (looking at the network graph in the image above, it looks like it's transferring about 10MB over the wire to reload the schema).
Possible mitigations
I guess that there are at least three possible mitigations that I can think of here:
_timescaledb_internal
and other Timescale schemas so they don't appear in the tables query (seems a bit gross, but maybe routes around needing to handle edge cases for the more general case of child tables)Obviously all these mitigations rely on my being correct that the root cause of the issue is the large number of child tables present in the tables query.
That said, I'm wondering if there is in fact some perf bug here? (13k rows / 10MB of data is a lot relative to most schemas, but also seems like something that should maybe be processable. But I don't have any idea about the how the combinatorics of adding additional tables scales here.)
The text was updated successfully, but these errors were encountered: