Setting TCP_QUICKACK reduces trylock latency by 40ms. #254
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When running PArSEC on Linux (e.g. Debian bookworm on aarch64), there is a 40ms delay between requesting a lock and obtaining a response.
For example, in a 1 agent, 1 shard localhost configuration
lua_bench
with 2 wallets produces the following trace for the agent:In the shard log the relevant lines are:
Note that agent starts executing the contract at 02:16:32.012 but shard processes its first
trylock
whole 43ms later - 02:16:32.055, all while the communication is entirely done over a loopback interface. Usingtcpdump -i lo
to inspect the local traffic one sees the following sequence of packets for the same timestamp range (shard is listening on port 5556):That is, there is an agent->shard TCP packet that only gets ACK'd by the shard 43ms later. However after that ACK, the shard's
trylock
response is almost immediate.I believe this comes from a delayed ACK feature in Linux where ACKs get briefly delayed to coalesce sending of ACK receipts and improve throughput (at expense of latency). According to https://topic.alibabacloud.com/a/linux-tcp-delay-confirmation_1_16_31848681.html, the default delay is exactly 40ms, suspiciously close to number observed above. This hypothesis is further strengthened by the fact that this PR (which enables TCP_QUICKACK) improves 5 wallet
lua_bench
tx submission rate by around 100x (from ~15tps to ~1500tps).As suggested by netty/netty#13610, this needs to be done after each
recv
- in fact, doing it only at the end oftcp_socket::receive
(and once when setting up the socket) only improved the tx submission rate by 2x (to ~28tps)).I think that in our setting enabling
TCP_NODELAY
(which disables coalescing on the sending side) also makes sense for our protocol but I did not see significant performance changes, so maybe that can be omitted. (The best option for ourtrylock
interface would probably be doing eachtrylock
request and response inside a singlewritev
orTCP_CORK
.) For now, this patch does not do any error handling and does not gatesetsockopt
to platforms whereTCP_QUICKACK
is defined (I guess, only Linux). How should this be improved?