-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep profile and history data in shared memory #66
base: master
Are you sure you want to change the base?
Conversation
To simplify interaction between collector process and client backends requesting profile or history data the current patch adds shared data structures to store statistics: the fixed-size shared hash table to keep profile and fixed-size shared array to implement ring buffer for history data. Shared hash table for profile has fixed size specified by `pg_wait_sampling.max_profile_entries` GUC. The least used entries are diplaced from hash table when its overflow encounters. The eviction algorithm is the same that is used in pg_stat_kcache extension - it's based on usage metric stored within hash table entries. The shared structures for profile and history are solely in-memory and not persisted to external disk. So after server restart all statistics fully reset. This is not bad because for wait monitoring it's enough to keep track differential counters in profile statistics. Current patch also makes all timing period GUCs reloadable via SIGHUP. Other GUCs in some way have impact on allocation of shared resources so they are done changable via server restart. The history keeping looks not usable for regular monitoring of wait events so in current patch it's disabled by default by specifying zero value for `pg_wait_sampling.history_period` GUC.
@rjuju could you make review of this PR and give some feedback? |
* or without any locks as it's done in pg_stat_get_activity() function. | ||
* These arrays are not accessible externally and require to add some | ||
* iterator object into corresponding containing modules. | ||
*/ | ||
if (pid == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can follow pg_stat_get_activity and get away without any locks. The only risk is that we take a few incorrect samples and attribute wait_event_info to a wrong pid. That doesn't seem to be a problem. At least pg_stat_activity lives with it.
Concerning the pid == 0
check. I think we are safe, because the wait_event_info == 0
condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).
Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can follow pg_stat_get_activity and get away without any locks.
In pg_stat_activity we iterate over auxiliary PROC slots without any lock. This is practical as these slots are reserved for long-lived background workers as bgwriter, checkpointer, walwriter, archiver and wal_receiver. Any changes in this specific range of slots occur very rare and most commonly are initiated by external forces. Although we might use ProcStructLock
spinlock to protect access when we gather wait event info as this array is small enough (now it occupies 5 slots).
Iteration over other slots intended mostly for regular backends in pg_stat_activity is protected by ProcArrayLock
and iteration have to be performed over ProcArray
as indirection level to access to real working PROC slots.
Concerning the pid == 0 check. I think we are safe, because the wait_event_info == 0 condition guards us from sampling PGPROC's of terminated backends. Although PGPROC.pid is not cleared upon termination, wait_event_info is (not in ProcKill, but at various other places, e.g. AbortTransaction calls pgstat_report_wait_end).
Yes, in current case it works. But when we will start to add the state "no wait" as pointed in #10 we have to consider the working criteria whether PROC slot is freed.
Shall we decide to sample non-waiting states (wait_event_info == 0), then we could iterate over BackendStatusArray to visit only live backends, like pg_stat_activity does via pgstat_read_current_status.
Requesting PROC slots from pid values gotten from BackendStatusArray
items makes iteration routine quadratic in time complexity. This factor is main reason that sampling of pg_stat_activity
in much enough frequent manner as in pg_wait_sampling extension is not practical. IMO the most appropriate solution that I propose in comment is to organize (in future) the iteration over ProcArray
and AuxiliaryProcs
* Therefore to reduce possible contention it's worth to segregate the logic | ||
* of PGPROCs iteration under ProcArrayLock and storing results to profile | ||
* and/or history under corresponding another lock. | ||
*/ | ||
LWLockAcquire(ProcArrayLock, LW_SHARED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we take ProcArrayLock anyway? What concurrent operations does it prevent?
I don't think that blocking ProcArrayAdd and ProcArrayRemove buys us anything. We are not iterating over the procArray, but over allProcs. And we don't need the lock to read PGPROC.pid and wait_event_info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. In current case when we verify slot validity also via non-zero wait_event_info
value we might omit acquisition of this lock. This is mostly inherited from previous implementation.
| pg_wait_sampling.profile_queries | bool | Whether profile should be per query | true | | ||
| Parameter name | Data type | Description | Default value | Change policy | | ||
| ------------------------------------ | --------- | ----------------------------------------------------------------------------------- | ------------- | ------------- | | ||
| pg_wait_sampling.max_profile_entries | int4 | Maximum number of entries in profile hash table | 5000 | restart | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just profile_size, like history_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly mimicked to max_connections
name, the prefix _size
for me is associated to something measured in bytes
processes are left in hash table. Therefore, it's neccessary to take into | ||
account periodic flushing of profile to prevent recycling of 32-bit pid values | ||
in profile hash table and as consequence possible increments to profile entries | ||
belonging to some old processes with the same pid values as for current ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think pid recycling is such an important concern in practice to justify this warning? Wouldn't we run out of hash table entries and evict the old ones, before we hit a recycled pid. I'm not sure that this is frequent enough to draw user's attention to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, theoretically confusion caused by pid recycling is possible. And we have to frankly declare about it, I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it's wise to notice in doc that if the size of profile hash table is reasonably not huge and there are no other than postgres actively spawn processes in a system the probability of such pid recycling goes to zero
| pg_wait_sampling.profile_period | int4 | Period for profile sampling in milliseconds (zero value disables profile gathering) | 10 | reload | | ||
| pg_wait_sampling.history_period | int4 | Period for history sampling in milliseconds (zero value disables history gathering) | 0 | reload | | ||
| pg_wait_sampling.profile_pid | bool | Whether profile should be per pid | true | restart | | ||
| pg_wait_sampling.profile_queries | bool | Whether profile should be per query | true | restart | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does changing profile_pid or profile_queries really require a restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it looks redundantly. This is intended for future work.
Tracking pid would define whether we have to add a separate hash table to account counters for currently active processes when we have another hash table for historic data that accumulates counters for finished ones. Or how big a single hash table for historic and actual statistics have to be when statistics from completed processes is flushed to corresponding historic entries.
Similarly for tracking queryid: taking into account that the number of pid values and wait_events is bounded, we might more finely tune required memory for profile hash table by knowing about queryid tracking and expected maximum number of tracked queryid values (or another future fields with variable quantity in profile key) when this option is on.
In addition, it's some confusing to have entries in profile with null values and not null ones in pid and/or queryid fields corresponding in fact to the same entries. This requires profile reset after setting applying.
typedef uint64 pgwsQueryId; | ||
#else | ||
typedef uint32 pgwsQueryId; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use uint64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds additional space to profile hash key for older PG versions and as consequence increases the size of profile hash table. Also this segregation in type definition was added to pg_wait_sampling before - I just moved it to compat.h to refuse using of #if
directive in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query.queryId is 64-bit since v11. I don't think we need to worry about allocating a bit more memory than strictly necessary for v10 and below. But surely we can leave this for another patch, if you like.
*vars = get_guc_variables(); | ||
*num_vars = GetNumConfigOptions(); | ||
#define PG_WAIT_EXTENSION -1 | ||
return WaitLatch(latch, wakeEvents, timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can readily drop support for v9.6 and even v10. So we don't need this compatibility wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I think anyway this have to be eventual decision from your side and I'll build on it further.
HTAB *pgws_profile_hash = NULL; | ||
LWLock *pgws_profile_lock = NULL; | ||
History *pgws_history_ring = NULL; | ||
LWLock *pgws_history_lock = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should put these variables into a single struct pgws_shared_state or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean put all these pointers into some structure? I think this doesn't bring smth even in code readability.
DefineCustomBoolVariable("pg_wait_sampling.profile_queries", | ||
"Sets whether profile should be collected per query.", NULL, | ||
&WhetherProfileQueryId, true, | ||
PGC_POSTMASTER, 0, NULL, pgwsEnableQueryId, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call MarkGUCPrefixReserved or EmitWarningsOnPlaceholders after we've defined all of GUCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks
"pg_wait_sampling queryids", | ||
sizeof(pgwsQueryId) * get_max_procs_count(), | ||
&found); | ||
MemSet(pgws_proc_queryids, 0, sizeof(pgwsQueryId) * get_max_procs_count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this MemSet should be in the if (!found)
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Thanks
info.entrysize = sizeof(ProfileHashEntry); | ||
pgws_profile_hash = ShmemInitHash("pg_wait_sampling hash", | ||
MaxProfileEntries, MaxProfileEntries, | ||
&info, HASH_ELEM | HASH_BLOBS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should be inside the if (!found)
branch. Otherwise, this code will be called by every new backend in the EXEC_BACKEND configuration (e.g. on Windows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShmemInitHash()
function assumes idempotent call within itself.
profile->items = (ProfileItem *) receive_array(PROFILE_REQUEST, | ||
sizeof(ProfileItem), &profile->count); | ||
/* Extract profile from shared memory */ | ||
profile_count = hash_get_num_entries(pgws_profile_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this after locking the hash. Otherwise, we would allocate not enough memory if the hash grows before we lock it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. Thanks
DefineCustomIntVariable("pg_wait_sampling.profile_period", | ||
"Sets period of waits profile sampling in milliseconds.", | ||
"0 disables profiling.", | ||
&ProfilePeriod, 10, 0, INT_MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must pass the GUC_UNIT_MS flag, so that time units, e.g. '10ms', are accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good addition!
if (!profile_pid) | ||
item.pid = 0; | ||
/* Set up key for hashtable search */ | ||
key.pid = WhetherProfilePid ? pid : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better put a memset here:
Caller must ensure there are no undefined padding bits in the keys!
https://github.com/postgres/postgres/blob/master/src/backend/utils/hash/dynahash.c#L37
We don't have any padding currently, but let's be future proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good addition! Will be added
Why do we need to maintain our own array of queryIds? Why can't we just read PgBackendStatus.st_query_id as pg_stat_activity does? It turns out st_query_id is zero during execution of a prepared statement. (exec_execute_message in postgres.c calls pgstat_report_activity(STATE_RUNNING) and it resets st_query_id.) Here is a quick demo. In a psql session execute:
Oops, query_id is blank. Arguably it's a bug: https://www.postgresql.org/message-id/CA%2B427g8DiW3aZ6pOpVgkPbqK97ouBdf18VLiHFesea2jUk3XoQ%40mail.gmail.com |
The answer is in tracking of queryId for just top-level statement in |
To simplify interaction between collector process and client backends requesting profile or history data the current patch adds shared data structures to store statistics: the fixed-size shared hash table to keep profile and fixed-size shared array to implement ring buffer for history data.
Shared hash table for profile has fixed size specified by
pg_wait_sampling.max_profile_entries
GUC. The least used entries are diplaced from hash table when its overflow encounters. The eviction algorithm is the same that is used in pg_stat_kcache extension - it's based on usage metric stored within hash table entries.The shared structures for profile and history are solely in-memory and not persisted to external disk. So after server restart all statistics fully reset. This is not bad because for wait monitoring it's enough to keep track differential counters in profile statistics.
Current patch also makes all timing period GUCs reloadable via SIGHUP. Other GUCs in some way have impact on allocation of shared resources so they are done changable via server restart.
The history keeping looks not usable for regular monitoring of wait events so in current patch it's disabled by default by specifying zero value for
pg_wait_sampling.history_period
GUC.