add tcp user timeout config knob #14

pH14 · 2023-03-09T21:24:32Z

Currently tokio-postgres exposes two knobs to maintain healthy connections: a connect timeout and keep-alives settings that apply directly to the TCP socket. These cover the cases of connection establishment and for maintaining idle connections, but do not cover the case of an active/established socket that does not hear a response from the receiver for a long period of time. By default it can take 15-20m (15 retries with exponential backoff. the # of retries is controlled by tcp_retries2) for a connection to be killed under these circumstances.

The generally recommended solution to this problem is to set TCP_USER_TIMEOUT to cap the total amount of time a socket waits to receive a response after it is established. https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ has a great writeup of this case under "Busy ESTAB socket is not forever".

I haven't found a super satisfying way of testing this yet, but staging it here for now.

pH14 · 2023-03-23T20:15:45Z

OK this was weirdly hard to verify -- I spun up a scratch machine (this option is Linux specific), started environmentd, and then attached strace -e trace=setsockopt -f -p <envd> and from there we can see a steady stream of syscalls setting this option:

[pid 475309] setsockopt(86, SOL_TCP, TCP_USER_TIMEOUT, [5000], 4) = 0
[pid 475309] setsockopt(93, SOL_TCP, TCP_USER_TIMEOUT, [5000], 4) = 0
...

pH14 · 2023-03-23T20:16:48Z

Not sure what to do about the failing tests. cc @benesch wouldn't mind a look at the change itself / any tips for CI here

benesch · 2023-03-24T03:57:59Z

CI here has been broken for years I’m afraid! Recommend filing the patch upstream, where CI does work. Then the only thing that can bite us is a rebase issue, which is fairly unlikely, and we get plenty of end-to-end coverage of this library in Materialize.

pH14 · 2023-03-27T21:14:26Z

sfackler#1007 merged upstream. I didn't cherry-pick since the change is so few lines / it didn't apply cleanly... not sure if that's cool or not. Otherwise, I think we're good to go here, though I'm not able to merge myself (cc @benesch)

benesch · 2023-03-28T07:58:07Z

Thanks, @pH14! I figured I'd just integrate the latest upstream changes (including yours) into the master branch, following the instructions here: https://github.com/MaterializeInc/rust-postgres#integrating-upstream-changes. I "fixed" the branch protection settings (I think) to not require CI to pass.

I think a cargo update -p tokio-postgres in the Materialize repo should pick this up now!

benesch approved these changes Mar 24, 2023

View reviewed changes

pH14 force-pushed the tcp-user-timeout branch from 18ee7eb to 1b30118 Compare March 27, 2023 21:08

add tcp user timeout config

54a3c5e

pH14 force-pushed the tcp-user-timeout branch from 1b30118 to 54a3c5e Compare March 27, 2023 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tcp user timeout config knob #14

add tcp user timeout config knob #14

pH14 commented Mar 9, 2023

pH14 commented Mar 23, 2023

pH14 commented Mar 23, 2023

benesch commented Mar 24, 2023

pH14 commented Mar 27, 2023

benesch commented Mar 28, 2023

add tcp user timeout config knob #14

Are you sure you want to change the base?

add tcp user timeout config knob #14

Conversation

pH14 commented Mar 9, 2023

pH14 commented Mar 23, 2023

pH14 commented Mar 23, 2023

benesch commented Mar 24, 2023

pH14 commented Mar 27, 2023

benesch commented Mar 28, 2023