-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reply offload #1457
base: unstable
Are you sure you want to change the base?
Reply offload #1457
Conversation
c2d1a60
to
a0a156c
Compare
# For use cases where command replies include Bulk strings (e.g. GET, MGET) | ||
# reply offload can be enabled to eliminate espensive memory access | ||
# and redundant data copy performed by main thread | ||
# | ||
# reply-offload yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect their to be cases where tuning this variable makes sense? Generally we want to avoid configuration in Valkey to make it simple to operate. Can we make real-time decisions about offloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to avoid the config too. It's better to start with no config and, if it turns out we need it later, then we can add. The reverse is not possible because removing a config is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see results of Performance tests in the PR description. Reply offload benefits performance if either data size is large (e.g. 64 Kb) or number of I/O threads is big enough for small data sizes (e.g. starting 9 io-threads
config for 512 byte).
As eliminating expensive memory access to obj->ptr
by main thread is major component of reply offload optimization , it is very challenging to provide dynamic solution. Note: access to obj->ptr
is required to know object size. Besides this, assuming object size is available somehow, it will be relatively challenging to calibrate IsReplyOffloadBeneficial(data_size, io_threads_num) to make it generic for any OS/CPU architecture/etc.
It looks like we should provide config parameter in this case and customer need to test their workloads with and without reply offload and decide to activate it or not.
Please share your opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have other places where we dynamically choose to take an action, like with lazy-free effort and freeing objects in background threads. For lazy-free, we try to guess how much work it will be to free, and if the effort is small we do it in the main thread anyways. It seems like we can have a similar heuristic here, if the object is large or we have enough I/O threads, we offload the items to be freed. That way end users don't have to tune it and it works well by default.
@@ -3206,6 +3206,7 @@ standardConfig static_configs[] = { | |||
createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL), | |||
createBoolConfig("hide-user-data-from-log", NULL, MODIFIABLE_CONFIG, server.hide_user_data_from_log, 1, NULL, NULL), | |||
createBoolConfig("import-mode", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, server.import_mode, 0, NULL, NULL), | |||
createBoolConfig("reply-offload", NULL, MODIFIABLE_CONFIG, server.reply_offload_enabled, 0, NULL, NULL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess also why is the default off? IO threading is off by default, so it seems to allow this to be on by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comments above regarding avoiding reply-offload
config at all. The answer to the question 'why reply-offload is off by default' is because it does not benefit performance in 100% of use cases.
https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457 Means you are leaking some memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.
I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.
src/networking.c
Outdated
* INTERNALS | ||
* The writevToClient strives to write all client reply buffers to the client connection. | ||
* However, it may encounter NET_MAX_WRITES_PER_EVENT or IOV_MAX or socket limit. In such case, | ||
* some client reply buffers will be written completely and some partially. | ||
* In next invocation writevToClient should resume from the exact position where it stopped. | ||
* Also writevToClient should communicate to _postWriteToClient which buffers written completely | ||
* and can be released. It is intricate in case of reply offloading as length of reply buffer does not match | ||
* to network bytes out. | ||
* | ||
* For this purpose, writevToClient uses 3 data members on the client struct as input/output paramaters: | ||
* io_last_written_buf - Last buffer that has been written to the client connection | ||
* io_last_written_bufpos - The buffer has been written until this position | ||
* io_last_written_data_len - The actual length of the data written from this buffer | ||
* This length differs from written bufpos in case of reply offload | ||
* | ||
* The writevToClient uses addBufferToReplyIOV, addCompoundBufferToReplyIOV, addOffloadedBulkToReplyIOV, addPlainBufferToReplyIOV | ||
* to build reply iovec array. These functions know to skip io_last_written_data_len, specifically addPlainBufferToReplyIOV | ||
* | ||
* In the end of execution writevToClient calls saveLastWrittenBuf for calculating "last written" buf/pos/data_len | ||
* and storing on the client. While building reply iov, writevToClient gathers auxiliary bufWriteMetadata that | ||
* helps in this calculation. In some cases, It may take several (> 2) invocations for writevToClient to write reply | ||
* from a single buffer but saveLastWrittenBuf knows to calculate "last written" buf/pos/data_len properly | ||
* | ||
* The _postWriteToClient uses io_last_written_buf and io_last_written_bufpos in order to detect completely written buffers | ||
* and release them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally internal comments should be near the code they are describing. Can we move this into the function near the relevant sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved (spread) comments to relevant functions
src/networking.c
Outdated
payloadHeader *header = (payloadHeader *)ptr; | ||
ptr += sizeof(payloadHeader); | ||
|
||
if (header->type == CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD) { | ||
clusterSlotStatsAddNetworkBytesOutForSlot(header->slot, header->actual_len); | ||
|
||
robj** obj_ptr = (robj**)ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code like this would benefit a lot from some helper methods. Instead of just constantly moving and recasting values. Something like,
robj *getValkeyObjectFromHeader(payloadHeader *header) {
char *ptr = (char *ptr) header;
ptr += sizeof(payloadHeader);
return (robj**)ptr;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested helper function does not address all the needs. As buffer can contain content like
header1ptr1ptr2ptr3header2plain_replyheader3ptr4ptr5 and it is more convenient to move ptr
and objv
accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand your comment, it doesn't look like it rendered correctly?
it is more convenient to move
I agree, but it is much harder to read.
I just looked briefly, mostly at the discussions. I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.
So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses. Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode. Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR. I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads? |
Signed-off-by: Alexander Shabanov <[email protected]>
db824f4
to
04e41c1
Compare
Fixed |
ac7e1f5
to
a40e72e
Compare
From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to |
Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to |
Got it. Thanks! At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length. But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense. I appreciate you're testing this with different sizes and with/without IO threading. |
Published results of performance tests in the PR description |
a40e72e
to
cff89de
Compare
robj *obj2 = createObject(OBJ_STRING, sdscatfmt(sdsempty(), "test2")); | ||
_addBulkOffloadToBufferOrList(c, obj2); | ||
|
||
TEST_ASSERT(c->bufpos == sizeof(payloadHeader) + 2 * sizeof(void *)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments explaining the + 2
or better declare a meaningful variable name with the value of 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments in tests
src/cluster_slot_stats.c
Outdated
|
||
int clusterSlotStatsEnabled(void) { | ||
return server.cluster_slot_stats_enabled && /* Config should be enabled. */ | ||
server.cluster_enabled; /* Cluster mode should be enabled. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments appear to be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments retained from the original refactored code. They indeed redundant. Will remove them
clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block); | ||
c->io_last_bufpos = block->used; | ||
/* If reply offload enabled force new header */ | ||
block->last_header = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a last_header
per block if we already have a last_header
field per client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single last_header
per client does not work well with DeferredReply
case. Besides this, last_header
per block is simpler, aligns well with existing code and data fields.
@@ -234,6 +266,9 @@ client *createClient(connection *conn) { | |||
c->commands_processed = 0; | |||
c->io_last_reply_block = NULL; | |||
c->io_last_bufpos = 0; | |||
c->io_last_written_buf = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we instead use a bit flag in the payload header to indicate if we are done? Since the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes" is not accurate. The main thread iterates over the headers inside the buffer ONLY ONCE when buffer should be released.
"use a bit flag in the payload header" approach will cause:
- main thread to iterate over the headers inside the buffer even when buffer can't be released fully yet
- writer thread to iterate over the headers inside the buffer in the end of write each time to detect/mark headers
The implemented io_last_written_buf/io_last_written_bufpos/io_last_written_data_len
eliminates redundant iterations over headers inside buffers and much simpler.
Signed-off-by: Alexander Shabanov <[email protected]>
cff89de
to
72be14b
Compare
Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more. This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values. We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold. |
Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly. |
@alexander-shabanov I do not think adding a configuration parameter is the preferred option in this case. Users are almost never tuning their caches at these levels and it is also very problematic to tell the user to learn his workload and formulate ahis own rules to when to enable this config. I also think there is some risk in introducing this degradation so we should work to understand what other alternatives we have.
|
/* Test 1: Add bulk offloads to the reply list */ | ||
|
||
/* Fill c->buf almost completely */ | ||
size_t reply_len = c->buf_usable_size - 2 * sizeof(payloadHeader) - 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining why -4
?
freeReplyOffloadClient(c); | ||
|
||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test releaseBufOffloads
instead of directly calling decrRefCount
@@ -174,24 +179,14 @@ void clusterSlotStatsDecrNetworkBytesOutForReplication(long long len) { | |||
* This type is not aggregated, to stay consistent with server.stat_net_output_bytes aggregation. | |||
* This function covers the internal propagation component. */ | |||
void clusterSlotStatsAddNetworkBytesOutForShardedPubSubInternalPropagation(client *c, int slot) { | |||
/* For a blocked client, c->slot could be pre-filled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why was it required before this PR.
static int canAddNetworkBytesOut(client *c) { | ||
return server.cluster_slot_stats_enabled && server.cluster_enabled && c->slot != -1; | ||
static int canAddNetworkBytesOut(int slot) { | ||
return clusterSlotStatsEnabled() && slot != -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming it to canAddSlotStats
and using it in both canAddCpuDuration
and canAddNetworkBytesIn
c->io_last_bufpos = ((clientReplyBlock *)listNodeValue(c->io_last_reply_block))->used; | ||
clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block); | ||
c->io_last_bufpos = block->used; | ||
/* If reply offload enabled force new header */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should indeed check if reply offload is enabled before writing NULL to the header? Checking a global is cheaper than write access to the block
@@ -374,6 +425,49 @@ void deleteCachedResponseClient(client *recording_client) { | |||
/* ----------------------------------------------------------------------------- | |||
* Low level functions to add more data to output buffers. | |||
* -------------------------------------------------------------------------- */ | |||
static inline void insertPayloadHeader(char *buf, size_t *bufpos, uint8_t type, size_t len, int slot, payloadHeader **last_header) { | |||
/* Save the latest header */ | |||
*last_header = (payloadHeader *)(buf + *bufpos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe new_header
is more clear?
c->bufpos += reply_len; | ||
/* We update the buffer peak after appending the reply to the buffer */ | ||
if (c->buf_peak < (size_t)c->bufpos) c->buf_peak = (size_t)c->bufpos; | ||
return reply_len; | ||
} | ||
|
||
size_t _addReplyToBuffer(client *c, const char *s, size_t len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static as well. same for _addBulkOffloadToBuffer
struct iovec *iov = iov_arr; | ||
ssize_t bufpos, iov_bytes_len = 0; | ||
listNode *lastblock; | ||
char prefixes[iovmax / 3 + 1][LONG_STR_SIZE + 3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iovmax / 3
- use constant instead of magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments for this line to clarify what the prefixes are and why these dimensions were chosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to find a way to avoid allocating this array on the stack in the common case where reply offload is disabled
char prefixes[iovmax / 3 + 1][LONG_STR_SIZE + 3]; | ||
char crlf[2] = {'\r', '\n'}; | ||
int bufcnt = 0; | ||
bufWriteMetadata metadata[listLength(c->reply) + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining that the +1 is for the static buffer
} | ||
|
||
static inline int updatePayloadHeader(payloadHeader *last_header, uint8_t type, size_t len, int slot) { | ||
if (last_header->type == type && last_header->slot == slot) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using for example MGET with small values that are not offloaded and come from different slots, wouldn't this cause multiple plain headers, thus requiring writeV instead of a simple write? We should investigate if this causes any performance degradation
Overview
This PR introduces the ability to offload replies to I/O threads as described at #1353.
Key Changes
network-bytes-out
for offloaded repliesNote: When reply offload disabled content and handling of client reply buffers remains as before this PR
Implementation Details
Reply construction:
_addReplyToBuffer
and_addReplyProtoToList
have been renamed to_addReplyPayloadToBuffer
and_addReplyPayloadToList
and extended to support different types of payloads - regular replies and offloaded replies._addReplyToBuffer
and_addReplyProtoToList
calls now_addReplyPayloadToBuffer
and_addReplyPayloadToList
and used for adding regular replies to client reply buffers._addBulkOffloadToBuffer
and_addBulkOffloadToList
are used for adding offloaded replies to client reply buffers.Write-to-client infrastructure:
The
writevToClient
and_postWriteToClient
has been significantly changed to support reply offload capability.Testing
Performance tests
Note: pay attention
io-threads 1
config means only main thread with no additional io-threads,io-threads 2
means main thread plus 1 I/O thread,io-threads 9
means main thread plus 8 I/O threads.No Reply Offload
Reply Offload Enabled