Private/mmeeks/debongmore #10334

mmeeks · 2024-10-26T14:40:11Z

Resolves: Handle limited open Connections due to keepalive connections (cool#9621) #9833 refining PR Handle limited open Connections due to keepalive connections #9916
Target version: master

Summary

TODO

...

Checklist

I have run make prettier-write and formatted the code.
All commits have Change-Id
I have run tests with make check
I have issued make run and manually verified that everything looks okay
Documentation (manuals or wiki) has been updated or is not required

mmeeks · 2024-10-26T14:44:34Z

Wow - the more I read here, the more scared I get; some questions.

Did we really add this "close websockets" functionality ? - and why was that hidden in this huge unrelated timeout change ?
where did the 2 second timeout number come from ?
are/were we really applying that to kit, and background-save process' websockets ?

Then there are more questions:

/// TO Criteria: Violate maximum idle (_pollTimeout default 64s)
const bool isIdle = _pollTimeout > std::chrono::microseconds::zero() &&
    durLast > _pollTimeout;
/// TO Criteria: Shall terminate?
const bool isTermination = SigUtil::getTerminationFlag();
if (isIdle || isTermination )

What is this 'isIdle' condition I just renamed ? - why is the _pollTimeout related in any way to when we should close sockets ? perhaps it should have the same value - but the _pollTimeout - is the default amount of time the poll runs when there is no event - it seems totally conceptually unrelated to the amount of time we should wait before closing sockets - luckily it is plausible time.

However as a constant - not a member it should not have an '_' prefix - it looks like an instance variable.

What does the 'TO Criteria:' mean ? in the comment - can we make that more readable, less shouty, and so on ?

Concerned really.

sgothel · 2024-10-26T18:01:15Z

net/WebSocketHandler.hpp

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
            LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
                    << _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
                    << (int)_pingMicroS.duration() << "s");
-            shutdownSilent();
+            if( isIPType() )
+                shutdownSilent();


Outch, missed the internal kit usage obviously, as also not considered in the modified orig code.
Thank you!

This needs more refinement (WIP), ~~i.e. determination whether to use or not use WS ping checkTimeout to also allow unit testing. The latter uses a local socket.~~
.. needs not to return true if local.

Would this disconnect the internal connection between coolwsd and kit?

The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.

sgothel · 2024-10-26T19:31:21Z

* Did we really add this "close websockets" functionality ? - and why was that hidden in this huge unrelated timeout change ?

It was included in the PR #9916 description
as well as in the commit of the patch series
994f8f9

By memory, it was discussed with you and/or Ash .. i.e.
ping as the only reliable alive-checking instrument we have.
I then explained WS native-ping vs COOL protocol-ping.
We didn't discuss ICMP ping .

* where did the 2 second timeout number come from ?

The original code had a period==timeout of 18s.

This code has a period of 3s and a timeout of 2s.
Timing was also described in above mentioned artifacts.

However, this timeout is not triggered in case of a one-off,
but over the average of the total connection duration (-> TimeAverage).
It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.

World Ping Test - global ping test,
for ICMP ping, demonstrates latency of < 500ms (Sydney w/ ~333ms) at least for me.

Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.

5s: node.js default server.keepAliveTimeout
18s: Original code had a period==timeout
20s: Python WebSockets https://websockets.readthedocs.io/en/latest/topics/keepalive.html
20s: grpc https://github.com/grpc/grpc/blob/master/doc/keepalive.md

* are/were we really applying that to kit, and background-save process' websockets ?

Originally the WS ping/pong used (fixed 18s) without disconnecting.
What I understood your commit 8835ff0
resolved this by taking out non-IP based sockets.

I re-reviewed the referenced documentation and the code and will followup on it in a bit.

sgothel · 2024-10-26T19:47:36Z

What is this 'isIdle' condition I just renamed ? - why is the _pollTimeout related in any way to when we should close sockets ? perhaps it should have the same value - but the _pollTimeout - is the default amount of time the poll runs when there is no event - it seems totally conceptually unrelated to the amount of time we should wait before closing sockets - luckily it is plausible time.

Yes, its about inactivity (badly named idle) and wrongly named _pollTimeout
due to using the same value as SocketPollTimeout (64s).

isIdle -> isInactive
_pollTimeout -> inactivityTimeout

Note that this inactivity is disrupted by WS ping, hence also caps the WS ping average timeout to 64s.

However as a constant - not a member it should not have an '_' prefix - it looks like an instance variable.

An immutable private entity like previous code's _hostname.

What does the 'TO Criteria:' mean ? in the comment - can we make that more readable, less shouty, and so on ?

Sorry, TO is the common abbreviation for timeout - so I should have used lower case criteria I guess.

sgothel · 2024-10-27T08:51:34Z

Rebased and adjusted the WebSocketHandler::checkTimeout code (compile clean and working w/ product & unit testing).

sgothel · 2024-10-27T09:39:06Z

I re-reviewed the referenced documentation and the code and will followup on it in a bit.

Considering Zato's websocket-timeouts
using a missed ping threshold and above mentioned ~20s timeout (TO) value,
I prepare a commit considering threshold * avg_timeout (average = avg).

The threshold

allows to miss threshold-1 pings,
resets the timeout criteria if one pong has been received
essentially expands the timeout to threshold * avg_timeout, i.e. 5 * 10s = 50s (edited)
goes well along with TimeAverage and the lower avg_timeout

Bottom line, should render our WS ping more robust.

Edit: I have wrapped this up in #10337 and will provide PR.

sgothel · 2024-10-27T11:59:34Z

Yes, its about inactivity (badly named idle) and wrongly named _pollTimeout due to using the same value as SocketPollTimeout (64s).
* isIdle -> isInactive

* _pollTimeout -> inactivityTimeout
Note that this inactivity is disrupted by WS ping, hence also caps the WS ping average timeout to 64s.

Adding a commit to this PR using above naming scheme, expanding the inactivity timeout to 3600s and adding a unit test.

The fix for WS Ping will go into another PR addressing #10337

sgothel · 2024-10-27T17:51:07Z

Adding a commit to this PR using above naming scheme, expanding the inactivity timeout to 3600s and adding a unit test.

Resulted in the forced push'ed three commits

Remove virtual of checkRemoval and cleanup var names AND using 3600s inactivity timeout
Unit test cleanup/update
Adding a unit test for inactivity testing

sgothel

Good as-is with the additional commits re inactivity/checkRemoval.
Would be great to have a re-review by @mmeeks.
I will use this to add the PR for #10337

mmeeks

Needs fixing as described - but we can merge this to avoid the random closure of sockets and data-loss from documents with their Kit connection cut out from under them.

mmeeks · 2024-10-28T09:31:53Z

net/WebSocketHandler.hpp

-                    << (int)_pingMicroS.duration() << "s");
-            shutdownSilent();
-            return true;
+            std::shared_ptr<StreamSocket> socket = _socket.lock();


Why the lock and the copy when this comes in as a parameter ?

I'm quite disturbed by closing connections on longer average pings too - where does that come from ?

I'm quite disturbed by closing connections on longer average pings too - where does that come from ?

I described the source (and idea) of WS ping TO instrument #10334 (comment) as well as wrapped it up in #10337

I am currently about to finish my draft PR for #10337 and will post its initial version,
currently refining the UTs for it.

Why the lock and the copy when this comes in as a parameter ?

the _socket.lock() has moved up to checkTimeout to cover for the socket->isIPType() check, which you have correctly introduced with your change request (but not coded all out).

mmeeks · 2024-10-28T09:36:03Z

net/Socket.hpp

@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket,
                 HostType hostType, ReadType readType = ReadType::NormalRead,
                 std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) :
        Socket(fd, type, creationTime),
-        _pollTimeout( net::Defaults::get().SocketPollTimeout ),
+        _inactivityTimeout( net::Defaults::get().InactivityTimeout ),


Why do we copy this number into a member variable - rather than using it directly in the arithmetic ? surely getting that number from the defaults is far from slow - and it would then be clearer, simpler, fewer lines of code, less memory used for the socket class and so on (?). Please fix that.

I have cleaned up these parameter for a WIP for #10337, only querying the net::Defaults::get() once.
Due to the singleton magic nature of net::Defaults::get() (for UTs) it must be queried
to get the thread-safe single instance.

But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()), which was ignored in favor or introducing this (unnecessary) net::Defaults (and a new function configNet()).

I really don't understand why we need configNet() that takes net::Defaults when configure() does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults has all-public members, so there is absolutely no thread safety here.

It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.

I will put these changes in this branch (simplified net::Defaults properties cache).

Ashod

Thanks, @mmeeks. Looks much simpler, although still more complex than my original suggestion to stick to checkTimeout() as much as possible.

It would be great to get a confirmation if this fixes a potential disconnection of the kits. Need to know if I should stop the hunt or not.

Ashod · 2024-10-26T14:49:41Z

net/WebSocketHandler.hpp

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
            LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
                    << _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
                    << (int)_pingMicroS.duration() << "s");
-            shutdownSilent();
+            if( isIPType() )


if (isIPType()) ?

revised and fixed with my amended commit 5ad12dc#diff-ee87941216cc8c8f64d6d5d5a0c1e50f07cb88bbc5f5295fbb1740e90a678fb1R679

Would this disconnect the internal connection between coolwsd and kit?

yes, internal connections are of type unix to the LocalServerSocket, validated.

Re net::Default configuration:

Its configuration via configNet UT callback was added in parallel since it is beyond the WSD scope,
where the general WSD configuration resides.

Lack of thread sync is due to its lifecycle the same reason as WSD configuration,
i.e. we call configNet pre-creation of sockets etc.

It is correct that we had duplication of maximum sockets (ahem), which will be revised due to changed semantics. However, the build global will influence this metric. Not too great, I agree.

WIP: Dropping the net::Defaults instance and configNet (using configure) as requested.

Ashod · 2024-10-28T10:49:11Z

net/WebSocketHandler.hpp

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
            LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
                    << _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
                    << (int)_pingMicroS.duration() << "s");
-            shutdownSilent();
+            if( isIPType() )
+                shutdownSilent();


Would this disconnect the internal connection between coolwsd and kit?

The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.

Ashod · 2024-10-28T10:57:48Z

net/Socket.hpp

@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket,
                 HostType hostType, ReadType readType = ReadType::NormalRead,
                 std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) :
        Socket(fd, type, creationTime),
-        _pollTimeout( net::Defaults::get().SocketPollTimeout ),
+        _inactivityTimeout( net::Defaults::get().InactivityTimeout ),


But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()), which was ignored in favor or introducing this (unnecessary) net::Defaults (and a new function configNet()).

I really don't understand why we need configNet() that takes net::Defaults when configure() does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults has all-public members, so there is absolutely no thread safety here.

It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.

Ashod · 2024-10-28T11:11:24Z

By memory, it was discussed with you and/or Ash .. i.e. ping as the only reliable alive-checking instrument we have. I then explained WS native-ping vs COOL protocol-ping. We didn't discuss ICMP ping .

We discussed this exact point privately, yes. Let me quote myself:

Negative. TCP has a perfectly good way for this (with some footnotes). We simply poll the socket handle, and we get back the state, if it is closed. [...]

So it would seem I didn't quite agree that "pings are the only reliable alive-checking instrument." Indeed, we already detect closed sockets, in SocketPoll, and remove them. Here is another comment on the original PR, also questioning the approach taken.

To repeat what I've been advocating:

If we want to detect dead connections, relying on TCP and poll(2) seems straightforward (as we already do).
If we need to detect idle connections, checking the last data transfer time against a threshold (timeout) would do the trick very reliably (checkTimeout() is a good place to do it, and we already do that for HTTP GET).
If we need to detect slow connections, then we need both a rolling average data-rate and a threshold. We might also want to enforce this logic only when we have too many open sockets, or based on some other criteria, but not strictly at all times. Again, checkTimeout() is the obvious place to do that.

But most certainly we have to be extremely vigilant not to close WebSocket connections blindly and without justification, which are the lifelines of documents and their life-cycle.
Most critically, the internal connection between a Kit and coolwsd cannot be externally interrupted for any reason, whatsoever.

Hope this clarifies where I stand. Happy to discuss where there is a mismatch between my understanding/position and the requirements, of course.

sgothel · 2024-10-28T11:52:01Z

To repeat what I've been advocating:

* If we want to detect **dead** connections, relying on TCP and poll(2) seems straightforward (as we already do).

Yup.

* If we need to detect **idle** connections, checking the last data transfer time against a threshold (timeout) would do the trick very reliably (`checkTimeout()` is a good place to do it, and we already do that for HTTP GET).

We had this in place for replies.

Now we have general client inactivity (set to 1 hour, with revised commits in this branch).
(This was wrong earlier using just the 64s timeout).

The idle connection criteria was part of the requirement.

This metric is checked in StreamSocket::checkRemoval for outside facing connections,
called sequentially after virtual override checkTimeout. (edited)

* If we need to detect **slow** connections, then we need both a _rolling_ average data-rate and a threshold.

In line with proposal #10337 and described here above.
Thank you for mentioning the rolling or moving average characteristic here (I missed to mention that yet).

We might also want to enforce this logic only when we have too many open sockets, or based on some other criteria, but not strictly at all times.

This is a great addition. Hence we would want an addition property like LowFreeSockets...,
a percentage of the used maximum sockets. The latter is the hard-cut-off for new connections but different from build-in maximum session connection, as discussed above

Again, checkTimeout() is the obvious place to do that.

Yes, this is covered with the WS ping timeout within its checkTimeout override.

This overall slow connection criteria was part of the requirement.

But most certainly we have to be extremely vigilant not to close WebSocket connections blindly and without justification, which are the lifelines of documents and their life-cycle. Most critically, the internal connection between a Kit and coolwsd cannot be externally interrupted for any reason, whatsoever.

Hope this clarifies where I stand. Happy to discuss where there is a mismatch between my understanding/position and the requirements, of course.

Yup, thank you.

Revising the work within this branch.

sgothel · 2024-10-28T12:47:13Z

failed desktop cypress test writer/table_operation_spec.js not reproduced locally, kicking jenkins.

Change-Id: Ifc18ae945f9946e511a394dd92d4e884e6d3ae6d Signed-off-by: Michael Meeks <[email protected]>

We have one to the Kit process - that is allowed to take way longer than the ping time to respond, and it was not clear that this new close behavior on ping/ping was introduced in: e8a3ceb Which also appears to add a 2 second non-responding websocket ping timeout for some reason (!?). Signed-off-by: Michael Meeks <[email protected]> Signed-off-by: Sven Göthel <[email protected]> Change-Id: Iad74f253bdac5636757b130b299b5deacda658db

sgothel · 2024-10-28T16:12:21Z

Rebased and added commit 4ceaeb0
(redone due to android build issue not linking NetUtil.cpp)

mmeeks · 2024-10-28T17:23:35Z

Fixed CI failure mode in unit tests - only execute executable coolwsd ;-) but ... same Cypress test fails locally - interestingly.

sgothel · 2024-10-28T17:29:29Z

Fixed CI failure mode in unit tests - only execute executable coolwsd ;-) but ... same Cypress test fails locally - interestingly.

thank you!

I got this message too late after my push .. outch, @mmeeks please be so kind and re-push, thank you. duck and hide

Avoiding this: + test -s ./coolwsd + echo 'Cleaning up...' Cleaning up... + ./coolwsd --disable-cool-user-checking --cleanup /tmp/jenkins9134351613063228334.sh: line 41: ./coolwsd: Permission denied Signed-off-by: Michael Meeks <[email protected]> Change-Id: I17886f1198fedd51e7b90ce1759abfcd0323835c

sgothel · 2024-10-28T18:09:08Z

My cleanup work continues here #10355

mmeeks requested review from sgothel, Ashod and caolanm October 26, 2024 14:40

sgothel reviewed Oct 26, 2024

View reviewed changes

sgothel force-pushed the private/mmeeks/debongmore branch from 8835ff0 to 5ad12dc Compare October 27, 2024 08:49

sgothel mentioned this pull request Oct 27, 2024

Revise WS Ping Timeout Handling to become more robust. #10337

Closed

sgothel self-requested a review October 27, 2024 17:54

sgothel approved these changes Oct 27, 2024

View reviewed changes

mmeeks commented Oct 28, 2024

View reviewed changes

Ashod approved these changes Oct 28, 2024

View reviewed changes

mmeeks added 2 commits October 28, 2024 17:08

Simplify code flow.

9a68d84

Change-Id: Ifc18ae945f9946e511a394dd92d4e884e6d3ae6d Signed-off-by: Michael Meeks <[email protected]>

sgothel force-pushed the private/mmeeks/debongmore branch from 45c2005 to 61f5b7a Compare October 28, 2024 16:10

sgothel force-pushed the private/mmeeks/debongmore branch from 61f5b7a to 4ceaeb0 Compare October 28, 2024 16:21

sgothel force-pushed the private/mmeeks/debongmore branch from 4e1133f to ffae823 Compare October 28, 2024 17:26

mmeeks force-pushed the private/mmeeks/debongmore branch from ffae823 to 89d229e Compare October 28, 2024 17:38

sgothel mentioned this pull request Oct 28, 2024

Revise Handle Limited Open Connections #10355

Closed

5 tasks

mmeeks merged commit 3229102 into master Oct 29, 2024
13 checks passed

mmeeks deleted the private/mmeeks/debongmore branch October 29, 2024 09:36

sgothel mentioned this pull request Oct 30, 2024

Use system's maximum concurrent TCP connections for net::Defaults.maxTCPConnections #10375

Open

5 tasks

Private/mmeeks/debongmore #10334

Private/mmeeks/debongmore #10334

Conversation

mmeeks commented Oct 26, 2024 • edited by sgothel Loading

Summary

TODO

Checklist

mmeeks commented Oct 26, 2024

Choose a reason for hiding this comment

sgothel Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgothel commented Oct 26, 2024

sgothel commented Oct 26, 2024 • edited Loading

sgothel commented Oct 27, 2024

sgothel commented Oct 27, 2024 • edited Loading

sgothel commented Oct 27, 2024

sgothel commented Oct 27, 2024

sgothel left a comment

Choose a reason for hiding this comment

mmeeks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ashod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ashod commented Oct 28, 2024

sgothel commented Oct 28, 2024 • edited Loading

sgothel commented Oct 28, 2024

sgothel commented Oct 28, 2024 • edited Loading

mmeeks commented Oct 28, 2024

sgothel commented Oct 28, 2024

sgothel commented Oct 28, 2024

mmeeks commented Oct 26, 2024 •

edited by sgothel

Loading

sgothel Oct 27, 2024 •

edited

Loading

sgothel commented Oct 26, 2024 •

edited

Loading

sgothel commented Oct 27, 2024 •

edited

Loading

sgothel commented Oct 28, 2024 •

edited

Loading

sgothel commented Oct 28, 2024 •

edited

Loading