Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise WS Ping Timeout Handling to become more robust. #10337

Closed
sgothel opened this issue Oct 27, 2024 · 3 comments
Closed

Revise WS Ping Timeout Handling to become more robust. #10337

sgothel opened this issue Oct 27, 2024 · 3 comments
Assignees
Labels
Milestone

Comments

@sgothel
Copy link

sgothel commented Oct 27, 2024

Summary

WS Ping timeout was described and included in the PR #9916 (Issue #9833), commit 994f8f9.
(Impact is discussed lately in PR #10334).

WS native control-frame Ping is considered the only reliable alive-checking instrument we have for long-lived WS connections.
This above COOL's protocol-ping and ICMP ping .

Previous Code

The original code pre commit 994f8f9 had a fixed one-shot period and timeout (TO) of 18s w/o closing the connection at checkTimeout.

Introduced Disconnect on WS Ping Timeout

Commit 994f8f9
uses a period of 3s and an average (avg) timeout (TO) of 2s.

The avg TO is not triggered in case of an one-off,
but over the cumulative average of the total connection duration (-> TimeAverage).
It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.

Other Keepalive Timeout Related Instruments

World Ping Test - global ping test,
for ICMP ping, demonstrates latency of < 500ms (Sydney w/ ~333ms) at least for me.

Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.

KeepAlive Intervals (also considered for timeout)

Desired Solution (Robust Ping Timeout allowing Package-Loss)

Zato's websocket-timeouts
uses a missed ping threshold (default 5) and above timeout (=interval) (default 30s),
allowing for package-loss scenarios.

Zato actually uses a default of 5 * 30s = 150s, exceeding all above mentioned timeout defaults.

Proposed is a more tight but still above used defaults (5-1) * 9s + 8s = 44s,
i.e. threshold of 5 and an average timeout of 8s and interval of 9s.
Noted relaxed properties on the average timeout above apply here as well!

Here we should use a moving average, assuming constant time intervals (as is the case here),
covering at least threshold data points.

Such instrument gives following robust properties

  • Allows to miss threshold - 1 pings (package-loss),
  • Resets the timeout criteria if one pong has been received
  • Essentially expands the timeout to (threshold-1) * interval + avg_timeout, i.e. (5-1) * 9s + 8s = 44s
    • avg_timeout being the moving average timeout for a single ping (default 8s)
    • threshold being the number of pings to fail before timeout (default 5)
    • interval being >= avg_timeout, period until next ping attempt (default 9s)
    • Note: If any value is zero, the instrument is disabled
  • Goes well along with relaxed moving average and lower avg_timeout, i.e. only triggers ping timeout if degraded over time w/o one-offs while still allowing ping drops.

Additionally, it is suggested to only use this instrument when reaching a low number of free available sockets.

Further, StreamSocket::checkRemoval should not additionally
check for inactivity on WS to trigger a timeout, as it would
override this instrument,
i.e. its inactivity timeout must either exceed this instrument or being dropped for WS.

Fallback Solution

As a fallback, we should at least ensure that the average timeout triggering disconnect
is not below 18s. However, even this value has not been utilized nor tested yet.

Therefore, the desired solution accounting for packet loss is more robust.

@sgothel sgothel added enhancement New feature or request unconfirmed 24.04 labels Oct 27, 2024
@sgothel sgothel added this to the 24.04 milestone Oct 27, 2024
@sgothel sgothel self-assigned this Oct 27, 2024
@mmeeks
Copy link
Contributor

mmeeks commented Oct 28, 2024

It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.

A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.

Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.

@sgothel
Copy link
Author

sgothel commented Oct 28, 2024

It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.

A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.

Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.

OK, no problem - I will take it out in #10355

@sgothel
Copy link
Author

sgothel commented Oct 28, 2024

Closed for now, reopen if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants
@mmeeks @sgothel and others