Revise WS Ping Timeout Handling to become more robust. #10337

sgothel · 2024-10-27T11:00:29Z

Summary

WS Ping timeout was described and included in the PR #9916 (Issue #9833), commit 994f8f9.
(Impact is discussed lately in PR #10334).

WS native control-frame Ping is considered the only reliable alive-checking instrument we have for long-lived WS connections.
This above COOL's protocol-ping and ICMP ping .

Previous Code

The original code pre commit 994f8f9 had a fixed one-shot period and timeout (TO) of 18s w/o closing the connection at checkTimeout.

Introduced Disconnect on WS Ping Timeout

Commit 994f8f9
uses a period of 3s and an average (avg) timeout (TO) of 2s.

The avg TO is not triggered in case of an one-off,
but over the cumulative average of the total connection duration (-> TimeAverage).
It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.

Other Keepalive Timeout Related Instruments

World Ping Test - global ping test,
for ICMP ping, demonstrates latency of < 500ms (Sydney w/ ~333ms) at least for me.

Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.

5s: node.js default server.keepAliveTimeout
18s: Original code had a period==timeout
20s: Python WebSockets https://websockets.readthedocs.io/en/latest/topics/keepalive.html
20s: grpc https://github.com/grpc/grpc/blob/master/doc/keepalive.md

KeepAlive Intervals (also considered for timeout)

30s: .Net
30-150s: Zato's websocket-timeouts

Desired Solution (Robust Ping Timeout allowing Package-Loss)

Zato's websocket-timeouts
uses a missed ping threshold (default 5) and above timeout (=interval) (default 30s),
allowing for package-loss scenarios.

Zato actually uses a default of 5 * 30s = 150s, exceeding all above mentioned timeout defaults.

Proposed is a more tight but still above used defaults (5-1) * 9s + 8s = 44s,
i.e. threshold of 5 and an average timeout of 8s and interval of 9s.
Noted relaxed properties on the average timeout above apply here as well!

Here we should use a moving average, assuming constant time intervals (as is the case here),
covering at least threshold data points.

Such instrument gives following robust properties

Allows to miss threshold - 1 pings (package-loss),
Resets the timeout criteria if one pong has been received
Essentially expands the timeout to (threshold-1) * interval + avg_timeout, i.e. (5-1) * 9s + 8s = 44s
- avg_timeout being the moving average timeout for a single ping (default 8s)
- threshold being the number of pings to fail before timeout (default 5)
- interval being >= avg_timeout, period until next ping attempt (default 9s)
- Note: If any value is zero, the instrument is disabled
Goes well along with relaxed moving average and lower avg_timeout, i.e. only triggers ping timeout if degraded over time w/o one-offs while still allowing ping drops.

Additionally, it is suggested to only use this instrument when reaching a low number of free available sockets.

Further, StreamSocket::checkRemoval should not additionally
check for inactivity on WS to trigger a timeout, as it would override this instrument,
i.e. its inactivity timeout must either exceed this instrument or being dropped for WS.

Fallback Solution

As a fallback, we should at least ensure that the average timeout triggering disconnect
is not below 18s. However, even this value has not been utilized nor tested yet.

Therefore, the desired solution accounting for packet loss is more robust.

The text was updated successfully, but these errors were encountered:

mmeeks · 2024-10-28T17:56:29Z

It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.

A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.

Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.

sgothel · 2024-10-28T19:43:22Z

It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.

A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.

Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.

OK, no problem - I will take it out in #10355

sgothel · 2024-10-28T20:48:37Z

Closed for now, reopen if desired.

sgothel added enhancement New feature or request unconfirmed 24.04 labels Oct 27, 2024

sgothel added this to the 24.04 milestone Oct 27, 2024

sgothel self-assigned this Oct 27, 2024

sgothel added this to Collabora Online Oct 27, 2024

sgothel mentioned this issue Oct 27, 2024

Private/mmeeks/debongmore #10334

Merged

6 tasks

sgothel mentioned this issue Oct 28, 2024

Revise Handle Limited Open Connections #10355

Closed

5 tasks

sgothel closed this as completed Oct 28, 2024

github-project-automation bot moved this to Done in Collabora Online Oct 28, 2024

sgothel mentioned this issue Oct 30, 2024

Use system's maximum concurrent TCP connections for net::Defaults.maxTCPConnections #10375

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise WS Ping Timeout Handling to become more robust. #10337

Revise WS Ping Timeout Handling to become more robust. #10337

sgothel commented Oct 27, 2024 •

edited

Loading

mmeeks commented Oct 28, 2024

sgothel commented Oct 28, 2024

sgothel commented Oct 28, 2024

Revise WS Ping Timeout Handling to become more robust. #10337

Revise WS Ping Timeout Handling to become more robust. #10337

Comments

sgothel commented Oct 27, 2024 • edited Loading

Summary

Previous Code

Introduced Disconnect on WS Ping Timeout

Other Keepalive Timeout Related Instruments

Desired Solution (Robust Ping Timeout allowing Package-Loss)

Fallback Solution

mmeeks commented Oct 28, 2024

sgothel commented Oct 28, 2024

sgothel commented Oct 28, 2024

sgothel commented Oct 27, 2024 •

edited

Loading