Keep profile and history data in shared memory #66

maksm90 · 2023-02-01T07:46:51Z

To simplify interaction between collector process and client backends requesting profile or history data the current patch adds shared data structures to store statistics: the fixed-size shared hash table to keep profile and fixed-size shared array to implement ring buffer for history data.
Shared hash table for profile has fixed size specified by pg_wait_sampling.max_profile_entries GUC. The least used entries are diplaced from hash table when its overflow encounters. The eviction algorithm is the same that is used in pg_stat_kcache extension - it's based on usage metric stored within hash table entries.
The shared structures for profile and history are solely in-memory and not persisted to external disk. So after server restart all statistics fully reset. This is not bad because for wait monitoring it's enough to keep track differential counters in profile statistics.

Current patch also makes all timing period GUCs reloadable via SIGHUP. Other GUCs in some way have impact on allocation of shared resources so they are done changable via server restart.

The history keeping looks not usable for regular monitoring of wait events so in current patch it's disabled by default by specifying zero value for pg_wait_sampling.history_period GUC.

To simplify interaction between collector process and client backends requesting profile or history data the current patch adds shared data structures to store statistics: the fixed-size shared hash table to keep profile and fixed-size shared array to implement ring buffer for history data. Shared hash table for profile has fixed size specified by `pg_wait_sampling.max_profile_entries` GUC. The least used entries are diplaced from hash table when its overflow encounters. The eviction algorithm is the same that is used in pg_stat_kcache extension - it's based on usage metric stored within hash table entries. The shared structures for profile and history are solely in-memory and not persisted to external disk. So after server restart all statistics fully reset. This is not bad because for wait monitoring it's enough to keep track differential counters in profile statistics. Current patch also makes all timing period GUCs reloadable via SIGHUP. Other GUCs in some way have impact on allocation of shared resources so they are done changable via server restart. The history keeping looks not usable for regular monitoring of wait events so in current patch it's disabled by default by specifying zero value for `pg_wait_sampling.history_period` GUC.

maksm90 · 2023-02-01T08:05:56Z

@rjuju could you make review of this PR and give some feedback?

…file_lock/

shinderuk · 2023-02-15T08:32:39Z

collector.c

+		 * or without any locks as it's done in pg_stat_get_activity() function.
+		 * These arrays are not accessible externally and require to add some
+		 * iterator object into corresponding containing modules.
+		 */
 		if (pid == 0)


I agree that we can follow pg_stat_get_activity and get away without any locks. The only risk is that we take a few incorrect samples and attribute wait_event_info to a wrong pid. That doesn't seem to be a problem. At least pg_stat_activity lives with it.

Concerning the pid == 0 check. I think we are safe, because the wait_event_info == 0 condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).

Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.

What do you think?

I agree that we can follow pg_stat_get_activity and get away without any locks.

In pg_stat_activity we iterate over auxiliary PROC slots without any lock. This is practical as these slots are reserved for long-lived background workers as bgwriter, checkpointer, walwriter, archiver and wal_receiver. Any changes in this specific range of slots occur very rare and most commonly are initiated by external forces. Although we might use ProcStructLock spinlock to protect access when we gather wait event info as this array is small enough (now it occupies 5 slots).
Iteration over other slots intended mostly for regular backends in pg_stat_activity is protected by ProcArrayLock and iteration have to be performed over ProcArray as indirection level to access to real working PROC slots.

Concerning the pid == 0 check. I think we are safe, because the wait_event_info == 0 condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).

Yes, in current case it works. But when we will start to add the state "no wait" as pointed in #10 we have to consider the working criteria whether PROC slot is freed.

Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.

Requesting PROC slots from pid values gotten from BackendStatusArray items makes iteration routine quadratic in time complexity. This factor is main reason that sampling of pg_stat_activity in much enough frequent manner as in pg_wait_sampling extension is not practical. IMO the most appropriate solution that I propose in comment is to organize (in future) the iteration over ProcArray and AuxiliaryProcs

shinderuk · 2023-02-15T08:44:30Z

collector.c

+	 * Therefore to reduce possible contention it's worth to segregate the logic
+	 * of PGPROCs iteration under ProcArrayLock and storing results to profile
+	 * and/or history under corresponding another lock.
+	 */
 	LWLockAcquire(ProcArrayLock, LW_SHARED);


Why do we take ProcArrayLock anyway? What concurrent operations does it prevent?
I don't think that blocking ProcArrayAdd and ProcArrayRemove buys us anything. We are not iterating over the procArray, but over allProcs. And we don't need the lock to read PGPROC.pid and wait_event_info.

Yes, you are right. In current case when we verify slot validity also via non-zero wait_event_info value we might omit acquisition of this lock. This is mostly inherited from previous implementation.

shinderuk · 2023-02-15T09:16:00Z

README.md

-| pg_wait_sampling.profile_queries    | bool      | Whether profile should be per query			|          true |
+|         Parameter name               | Data type |                  Description                                                        | Default value | Change policy |
+| ------------------------------------ | --------- | ----------------------------------------------------------------------------------- | ------------- | ------------- |
+| pg_wait_sampling.max_profile_entries | int4      | Maximum number of entries in profile hash table                                     |          5000 |       restart |


Maybe just profile_size, like history_size?

I mostly mimicked to max_connections name, the prefix _size for me is associated to something measured in bytes

shinderuk · 2023-02-15T09:28:46Z

README.md

+processes are left in hash table. Therefore, it's neccessary to take into
+account periodic flushing of profile to prevent recycling of 32-bit pid values
+in profile hash table and as consequence possible increments to profile entries
+belonging to some old processes with the same pid values as for current ones.


Do you think pid recycling is such an important concern in practice to justify this warning? Wouldn't we run out of hash table entries and evict the old ones, before we hit a recycled pid. I'm not sure that this is frequent enough to draw user's attention to.

Hmm, theoretically confusion caused by pid recycling is possible. And we have to frankly declare about it, I suppose

However, it's wise to notice in doc that if the size of profile hash table is reasonably not huge and there are no other than postgres actively spawn processes in a system the probability of such pid recycling goes to zero

shinderuk · 2023-02-15T19:59:20Z

README.md

+| pg_wait_sampling.profile_period      | int4      | Period for profile sampling in milliseconds (zero value disables profile gathering) |            10 |        reload |
+| pg_wait_sampling.history_period      | int4      | Period for history sampling in milliseconds (zero value disables history gathering) |             0 |		  reload |
+| pg_wait_sampling.profile_pid         | bool      | Whether profile should be per pid                                                   |          true |       restart |
+| pg_wait_sampling.profile_queries     | bool      | Whether profile should be per query			                                     |          true |       restart |


Does changing profile_pid or profile_queries really require a restart?

Now it looks redundantly. This is intended for future work.
Tracking pid would define whether we have to add a separate hash table to account counters for currently active processes when we have another hash table for historic data that accumulates counters for finished ones. Or how big a single hash table for historic and actual statistics have to be when statistics from completed processes is flushed to corresponding historic entries.
Similarly for tracking queryid: taking into account that the number of pid values and wait_events is bounded, we might more finely tune required memory for profile hash table by knowing about queryid tracking and expected maximum number of tracked queryid values (or another future fields with variable quantity in profile key) when this option is on.

In addition, it's some confusing to have entries in profile with null values and not null ones in pid and/or queryid fields corresponding in fact to the same entries. This requires profile reset after setting applying.

shinderuk · 2023-02-15T20:06:13Z

compat.h

+typedef uint64 pgwsQueryId;
+#else
+typedef uint32 pgwsQueryId;
+#endif


Why not just use uint64?

This adds additional space to profile hash key for older PG versions and as consequence increases the size of profile hash table. Also this segregation in type definition was added to pg_wait_sampling before - I just moved it to compat.h to refuse using of #if directive in code.

Query.queryId is 64-bit since v11. I don't think we need to worry about allocating a bit more memory than strictly necessary for v10 and below. But surely we can leave this for another patch, if you like.

shinderuk · 2023-02-15T20:12:23Z

compat.h

-	*vars = get_guc_variables();
-	*num_vars = GetNumConfigOptions();
+#define PG_WAIT_EXTENSION -1
+	return WaitLatch(latch, wakeEvents, timeout);


I think we can readily drop support for v9.6 and even v10. So we don't need this compatibility wrapper.

Good idea! I think anyway this have to be eventual decision from your side and I'll build on it further.

shinderuk · 2023-02-15T21:20:11Z

pg_wait_sampling.c

+HTAB		*pgws_profile_hash = NULL;
+LWLock		*pgws_profile_lock = NULL;
+History		*pgws_history_ring = NULL;
+LWLock		*pgws_history_lock = NULL;


Maybe we should put these variables into a single struct pgws_shared_state or something like that?

Do you mean put all these pointers into some structure? I think this doesn't bring smth even in code readability.

shinderuk · 2023-02-15T21:23:48Z

pg_wait_sampling.c

+	DefineCustomBoolVariable("pg_wait_sampling.profile_queries",
+			"Sets whether profile should be collected per query.", NULL,
+			&WhetherProfileQueryId, true,
+			PGC_POSTMASTER, 0, NULL, pgwsEnableQueryId, NULL);


We should call MarkGUCPrefixReserved or EmitWarningsOnPlaceholders after we've defined all of GUCs.

Good catch! Thanks

shinderuk · 2023-02-15T21:27:12Z

pg_wait_sampling.c

+			"pg_wait_sampling queryids",
+			sizeof(pgwsQueryId) * get_max_procs_count(),
+			&found);
+	MemSet(pgws_proc_queryids, 0, sizeof(pgwsQueryId) * get_max_procs_count());


I think this MemSet should be in the if (!found) branch.

Agreed. Thanks

shinderuk · 2023-02-15T21:30:20Z

pg_wait_sampling.c

+	info.entrysize = sizeof(ProfileHashEntry);
+	pgws_profile_hash = ShmemInitHash("pg_wait_sampling hash",
+									  MaxProfileEntries, MaxProfileEntries,
+									  &info, HASH_ELEM | HASH_BLOBS);


This code should be inside the if (!found) branch. Otherwise, this code will be called by every new backend in the EXEC_BACKEND configuration (e.g. on Windows).

ShmemInitHash() function assumes idempotent call within itself.

shinderuk · 2023-02-15T21:34:13Z

pg_wait_sampling.c

-		profile->items = (ProfileItem *) receive_array(PROFILE_REQUEST,
-										sizeof(ProfileItem), &profile->count);
+		/* Extract profile from shared memory */
+		profile_count = hash_get_num_entries(pgws_profile_hash);


We should do this after locking the hash. Otherwise, we would allocate not enough memory if the hash grows before we lock it.

Yes, you are right. Thanks

shinderuk · 2023-02-16T09:42:20Z

pg_wait_sampling.c

+	DefineCustomIntVariable("pg_wait_sampling.profile_period",
+			"Sets period of waits profile sampling in milliseconds.",
+			"0 disables profiling.",
+			&ProfilePeriod, 10, 0, INT_MAX,


We must pass the GUC_UNIT_MS flag, so that time units, e.g. '10ms', are accepted.

Thanks, good addition!

shinderuk · 2023-02-20T09:13:58Z

collector.c

-			if (!profile_pid)
-				item.pid = 0;
+			/* Set up key for hashtable search */
+			key.pid = WhetherProfilePid ? pid : 0;


I think we'd better put a memset here:

Caller must ensure there are no undefined padding bits in the keys!

https://github.com/postgres/postgres/blob/master/src/backend/utils/hash/dynahash.c#L37

We don't have any padding currently, but let's be future proof.

Thanks, good addition! Will be added

shinderuk · 2024-05-21T13:50:44Z

Why do we need to maintain our own array of queryIds? Why can't we just read PgBackendStatus.st_query_id as pg_stat_activity does? It turns out st_query_id is zero during execution of a prepared statement. (exec_execute_message in postgres.c calls pgstat_report_activity(STATE_RUNNING) and it resets st_query_id.)

Here is a quick demo. In a psql session execute: select pg_sleep(30) \bind \g. (\bind uses the extended query protocol, like prepared statements do.) In another session query pg_stat_activity for the first session while it's sleeping:

wait_event_type  | Timeout
wait_event       | PgSleep
state            | active
query_id         | 
query            | select pg_sleep(30)

Oops, query_id is blank.

Arguably it's a bug: https://www.postgresql.org/message-id/CA%2B427g8DiW3aZ6pOpVgkPbqK97ouBdf18VLiHFesea2jUk3XoQ%40mail.gmail.com

maksm90 · 2024-05-26T08:41:36Z

Why do we need to maintain our own array of queryIds? Why can't we just read PgBackendStatus.st_query_id as pg_stat_activity does?

The answer is in tracking of queryId for just top-level statement in PgBackendStatus.st_query_id. More discussion about current design is in #42 (comment) and related issues is inside #43

maksm90 requested a review from shinderuk February 1, 2023 07:47

maksm90 force-pushed the ADPG-288 branch from daa0b60 to 49ff16c Compare February 1, 2023 08:03

maksm90 changed the title ~~Keep profile and history data in shared structures~~ Keep profile and history data in shared memory Feb 1, 2023

Maksim Milyutin and others added 5 commits February 1, 2023 16:43

Resolve clang warning regarding to variable initialization

ef33592

Fix estimated size for profile hash table

3bfb23d

Refactoring: s/pgws_hash/pgws_profile_hash/ s/pgws_hash_lock/pgws_pro…

08b7e8c

…file_lock/

Refine FIXME in collector's loop over ProcArray

f0b62aa

Add TODO item to reduce contention of ProcArrayLock

9a69965

shinderuk reviewed Feb 15, 2023

View reviewed changes

shinderuk requested changes Feb 15, 2023

View reviewed changes

shinderuk reviewed Feb 16, 2023

View reviewed changes

shinderuk reviewed Feb 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep profile and history data in shared memory #66

Keep profile and history data in shared memory #66

maksm90 commented Feb 1, 2023 •

edited

Loading

maksm90 commented Feb 1, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023 •

edited

Loading

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

maksm90 Feb 19, 2023 •

edited

Loading

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 20, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023 •

edited

Loading

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 15, 2023

maksm90 Feb 18, 2023

shinderuk Feb 16, 2023

maksm90 Feb 18, 2023

shinderuk Feb 20, 2023

maksm90 Feb 24, 2023

shinderuk commented May 21, 2024

maksm90 commented May 26, 2024

Keep profile and history data in shared memory #66

Are you sure you want to change the base?

Keep profile and history data in shared memory #66

Conversation

maksm90 commented Feb 1, 2023 • edited Loading

maksm90 commented Feb 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maksm90 Feb 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maksm90 Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maksm90 Feb 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shinderuk commented May 21, 2024

maksm90 commented May 26, 2024

maksm90 commented Feb 1, 2023 •

edited

Loading

maksm90 Feb 18, 2023 •

edited

Loading

maksm90 Feb 19, 2023 •

edited

Loading

maksm90 Feb 18, 2023 •

edited

Loading