You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
WS Ping timeout was described and included in the PR #9916 (Issue #9833), commit 994f8f9.
(Impact is discussed lately in PR #10334).
WS native control-frame Ping is considered the only reliable alive-checking instrument we have for long-lived WS connections.
This above COOL's protocol-ping and ICMP ping .
Previous Code
The original code pre commit 994f8f9 had a fixed one-shot period and timeout (TO) of 18s w/o closing the connection at checkTimeout.
Introduced Disconnect on WS Ping Timeout
Commit 994f8f9
uses a period of 3s and an average (avg) timeout (TO) of 2s.
The avg TO is not triggered in case of an one-off,
but over the cumulative average of the total connection duration (-> TimeAverage).
It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.
Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.
Zato's websocket-timeouts
uses a missed ping threshold (default 5) and above timeout (=interval) (default 30s),
allowing for package-loss scenarios.
Zato actually uses a default of 5 * 30s = 150s, exceeding all above mentioned timeout defaults.
Proposed is a more tight but still above used defaults (5-1) * 9s + 8s = 44s,
i.e. threshold of 5 and an average timeout of 8s and interval of 9s.
Noted relaxed properties on the average timeout above apply here as well!
Here we should use a moving average, assuming constant time intervals (as is the case here),
covering at least threshold data points.
Such instrument gives following robust properties
Allows to miss threshold - 1 pings (package-loss),
Resets the timeout criteria if one pong has been received
Essentially expands the timeout to (threshold-1) * interval + avg_timeout, i.e. (5-1) * 9s + 8s = 44s
avg_timeout being the moving average timeout for a single ping (default 8s)
threshold being the number of pings to fail before timeout (default 5)
interval being >= avg_timeout, period until next ping attempt (default 9s)
Note: If any value is zero, the instrument is disabled
Goes well along with relaxed moving average and lower avg_timeout, i.e. only triggers ping timeout if degraded over time w/o one-offs while still allowing ping drops.
Additionally, it is suggested to only use this instrument when reaching a low number of free available sockets.
Further, StreamSocket::checkRemoval should not additionally
check for inactivity on WS to trigger a timeout, as it would override this instrument,
i.e. its inactivity timeout must either exceed this instrument or being dropped for WS.
Fallback Solution
As a fallback, we should at least ensure that the average timeout triggering disconnect
is not below 18s. However, even this value has not been utilized nor tested yet.
Therefore, the desired solution accounting for packet loss is more robust.
The text was updated successfully, but these errors were encountered:
It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.
A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.
Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.
It is -very- unclear to me why we want to ever close websockets; and I'm none the wiser having read this.
A websocket once setup and authenticated is used for hard-lifecycle management of an editing session.
Close it frivolously and really-bad-things can happen - in extremis data-loss if permissions changed underneath us and we can no longer ask the user what to do about a failed save. I see no reason to have added this closing of websocket at all FWIW - and I'd prefer it backed out ASAP until there is a clear understanding and concept around what we're trying to achieve here.
Summary
WS Ping timeout was described and included in the PR #9916 (Issue #9833), commit 994f8f9.
(Impact is discussed lately in PR #10334).
WS native control-frame Ping is considered the only reliable alive-checking instrument we have for long-lived WS connections.
This above COOL's protocol-ping and ICMP ping .
Previous Code
The original code pre commit 994f8f9 had a fixed one-shot period and timeout (TO) of 18s w/o closing the connection at
checkTimeout
.Introduced Disconnect on WS Ping Timeout
Commit 994f8f9
uses a period of 3s and an average (avg) timeout (TO) of 2s.
The avg TO is not triggered in case of an one-off,
but over the cumulative average of the total connection duration (->
TimeAverage
).It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.
Other Keepalive Timeout Related Instruments
World Ping Test - global ping test,
for ICMP ping, demonstrates latency of < 500ms (Sydney w/ ~333ms) at least for me.
Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.
KeepAlive Intervals (also considered for timeout)
Desired Solution (Robust Ping Timeout allowing Package-Loss)
Zato's websocket-timeouts
uses a missed ping threshold (default 5) and above timeout (=interval) (default 30s),
allowing for package-loss scenarios.
Zato actually uses a default of
5 * 30s = 150s
, exceeding all above mentioned timeout defaults.Proposed is a more tight but still above used defaults
(5-1) * 9s + 8s = 44s
,i.e. threshold of 5 and an average timeout of 8s and interval of 9s.
Noted relaxed properties on the average timeout above apply here as well!
Here we should use a moving average, assuming constant time intervals (as is the case here),
covering at least threshold data points.
Such instrument gives following robust properties
threshold - 1
pings (package-loss),(threshold-1) * interval + avg_timeout
, i.e.(5-1) * 9s + 8s = 44s
avg_timeout
, i.e. only triggers ping timeout if degraded over time w/o one-offs while still allowing ping drops.Additionally, it is suggested to only use this instrument when reaching a low number of free available sockets.
Further,
StreamSocket::checkRemoval
should notadditionallyoverride this instrument,check for inactivity on WS to trigger a timeout, as it would
i.e. its inactivity timeout must either exceed this instrument or being dropped for WS.
Fallback Solution
As a fallback, we should at least ensure that the average timeout triggering disconnect
is not below 18s. However, even this value has not been utilized nor tested yet.
Therefore, the desired solution accounting for packet loss is more robust.
The text was updated successfully, but these errors were encountered: