Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private/mmeeks/debongmore #10334

Merged
merged 3 commits into from
Oct 29, 2024
Merged

Private/mmeeks/debongmore #10334

merged 3 commits into from
Oct 29, 2024

Conversation

mmeeks
Copy link
Contributor

@mmeeks mmeeks commented Oct 26, 2024

Summary

TODO

  • ...

Checklist

  • I have run make prettier-write and formatted the code.
  • All commits have Change-Id
  • I have run tests with make check
  • I have issued make run and manually verified that everything looks okay
  • Documentation (manuals or wiki) has been updated or is not required

@mmeeks
Copy link
Contributor Author

mmeeks commented Oct 26, 2024

Wow - the more I read here, the more scared I get; some questions.

  • Did we really add this "close websockets" functionality ? - and why was that hidden in this huge unrelated timeout change ?
  • where did the 2 second timeout number come from ?
  • are/were we really applying that to kit, and background-save process' websockets ?

Then there are more questions:

/// TO Criteria: Violate maximum idle (_pollTimeout default 64s)
const bool isIdle = _pollTimeout > std::chrono::microseconds::zero() &&
    durLast > _pollTimeout;
/// TO Criteria: Shall terminate?
const bool isTermination = SigUtil::getTerminationFlag();
if (isIdle || isTermination )

What is this 'isIdle' condition I just renamed ? - why is the _pollTimeout related in any way to when we should close sockets ? perhaps it should have the same value - but the _pollTimeout - is the default amount of time the poll runs when there is no event - it seems totally conceptually unrelated to the amount of time we should wait before closing sockets - luckily it is plausible time.

However as a constant - not a member it should not have an '_' prefix - it looks like an instance variable.

What does the 'TO Criteria:' mean ? in the comment - can we make that more readable, less shouty, and so on ?

Concerned really.

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
<< (int)_pingMicroS.duration() << "s");
shutdownSilent();
if( isIPType() )
shutdownSilent();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outch, missed the internal kit usage obviously, as also not considered in the modified orig code.
Thank you!

Copy link

@sgothel sgothel Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more refinement (WIP), i.e. determination whether to use or not use WS ping checkTimeout to also allow unit testing. The latter uses a local socket.
.. needs not to return true if local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this disconnect the internal connection between coolwsd and kit?

The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.

@sgothel
Copy link

sgothel commented Oct 26, 2024

* Did we really add this "close websockets" functionality ? - and why was that hidden in this huge unrelated timeout change ?

It was included in the PR #9916 description
as well as in the commit of the patch series
994f8f9

By memory, it was discussed with you and/or Ash .. i.e.
ping as the only reliable alive-checking instrument we have.
I then explained WS native-ping vs COOL protocol-ping.
We didn't discuss ICMP ping .

* where did the 2 second timeout number come from ?

The original code had a period==timeout of 18s.

This code has a period of 3s and a timeout of 2s.
Timing was also described in above mentioned artifacts.

However, this timeout is not triggered in case of a one-off,
but over the average of the total connection duration (-> TimeAverage).
It renders the latency tolerance much higher in case of drop-outs,
the longer the connection is well alive. E.g. a sporadic 120s lag after
a 120s good responsive connection duration of 250ms would
result in the average of 1.25s and remains alive.
Only after multiple such extreme failures/lags occur, we would drop the connection.
Hence current 2s on average becomes much more relaxed than original fixed 20s.

World Ping Test - global ping test,
for ICMP ping, demonstrates latency of < 500ms (Sydney w/ ~333ms) at least for me.

Most other utilities I have re-checked now seem to use higher fixed values
ranging between 10s-120min for the timeout - leaning more towards the low-end.

* are/were we really applying that to kit, and background-save process' websockets ?

Originally the WS ping/pong used (fixed 18s) without disconnecting.
What I understood your commit 8835ff0
resolved this by taking out non-IP based sockets.

I re-reviewed the referenced documentation and the code and will followup on it in a bit.

@sgothel
Copy link

sgothel commented Oct 26, 2024

What is this 'isIdle' condition I just renamed ? - why is the _pollTimeout related in any way to when we should close sockets ? perhaps it should have the same value - but the _pollTimeout - is the default amount of time the poll runs when there is no event - it seems totally conceptually unrelated to the amount of time we should wait before closing sockets - luckily it is plausible time.

Yes, its about inactivity (badly named idle) and wrongly named _pollTimeout
due to using the same value as SocketPollTimeout (64s).

  • isIdle -> isInactive
  • _pollTimeout -> inactivityTimeout

Note that this inactivity is disrupted by WS ping, hence also caps the WS ping average timeout to 64s.

However as a constant - not a member it should not have an '_' prefix - it looks like an instance variable.

An immutable private entity like previous code's _hostname.

What does the 'TO Criteria:' mean ? in the comment - can we make that more readable, less shouty, and so on ?

Sorry, TO is the common abbreviation for timeout - so I should have used lower case criteria I guess.

@sgothel
Copy link

sgothel commented Oct 27, 2024

Rebased and adjusted the WebSocketHandler::checkTimeout code (compile clean and working w/ product & unit testing).

@sgothel
Copy link

sgothel commented Oct 27, 2024

I re-reviewed the referenced documentation and the code and will followup on it in a bit.

Considering Zato's websocket-timeouts
using a missed ping threshold and above mentioned ~20s timeout (TO) value,
I prepare a commit considering threshold * avg_timeout (average = avg).

The threshold

  • allows to miss threshold-1 pings,
  • resets the timeout criteria if one pong has been received
  • essentially expands the timeout to threshold * avg_timeout, i.e. 5 * 10s = 50s (edited)
  • goes well along with TimeAverage and the lower avg_timeout

Bottom line, should render our WS ping more robust.

Edit: I have wrapped this up in #10337 and will provide PR.

@sgothel
Copy link

sgothel commented Oct 27, 2024

Yes, its about inactivity (badly named idle) and wrongly named _pollTimeout due to using the same value as SocketPollTimeout (64s).

* isIdle -> isInactive

* _pollTimeout -> inactivityTimeout

Note that this inactivity is disrupted by WS ping, hence also caps the WS ping average timeout to 64s.

Adding a commit to this PR using above naming scheme, expanding the inactivity timeout to 3600s and adding a unit test.

The fix for WS Ping will go into another PR addressing #10337

@sgothel
Copy link

sgothel commented Oct 27, 2024

Adding a commit to this PR using above naming scheme, expanding the inactivity timeout to 3600s and adding a unit test.

Resulted in the forced push'ed three commits

  • Remove virtual of checkRemoval and cleanup var names AND using 3600s inactivity timeout
  • Unit test cleanup/update
  • Adding a unit test for inactivity testing

@sgothel sgothel self-requested a review October 27, 2024 17:54
Copy link

@sgothel sgothel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good as-is with the additional commits re inactivity/checkRemoval.
Would be great to have a re-review by @mmeeks.
I will use this to add the PR for #10337

Copy link
Contributor Author

@mmeeks mmeeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs fixing as described - but we can merge this to avoid the random closure of sockets and data-loss from documents with their Kit connection cut out from under them.

<< (int)_pingMicroS.duration() << "s");
shutdownSilent();
return true;
std::shared_ptr<StreamSocket> socket = _socket.lock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lock and the copy when this comes in as a parameter ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite disturbed by closing connections on longer average pings too - where does that come from ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite disturbed by closing connections on longer average pings too - where does that come from ?

I described the source (and idea) of WS ping TO instrument #10334 (comment) as well as wrapped it up in #10337

I am currently about to finish my draft PR for #10337 and will post its initial version,
currently refining the UTs for it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lock and the copy when this comes in as a parameter ?

the _socket.lock() has moved up to checkTimeout to cover for the socket->isIPType() check, which you have correctly introduced with your change request (but not coded all out).

net/Socket.hpp Outdated
@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket,
HostType hostType, ReadType readType = ReadType::NormalRead,
std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) :
Socket(fd, type, creationTime),
_pollTimeout( net::Defaults::get().SocketPollTimeout ),
_inactivityTimeout( net::Defaults::get().InactivityTimeout ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we copy this number into a member variable - rather than using it directly in the arithmetic ? surely getting that number from the defaults is far from slow - and it would then be clearer, simpler, fewer lines of code, less memory used for the socket class and so on (?). Please fix that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have cleaned up these parameter for a WIP for #10337, only querying the net::Defaults::get() once.
Due to the singleton magic nature of net::Defaults::get() (for UTs) it must be queried
to get the thread-safe single instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()), which was ignored in favor or introducing this (unnecessary) net::Defaults (and a new function configNet()).

I really don't understand why we need configNet() that takes net::Defaults when configure() does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults has all-public members, so there is absolutely no thread safety here.

It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put these changes in this branch (simplified net::Defaults properties cache).

Copy link
Contributor

@Ashod Ashod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mmeeks. Looks much simpler, although still more complex than my original suggestion to stick to checkTimeout() as much as possible.

It would be great to get a confirmation if this fixes a potential disconnection of the kits. Need to know if I should stop the hunt or not.

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
<< (int)_pingMicroS.duration() << "s");
shutdownSilent();
if( isIPType() )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (isIPType()) ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this disconnect the internal connection between coolwsd and kit?

yes, internal connections are of type unix to the LocalServerSocket, validated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re net::Default configuration:

  • Its configuration via configNet UT callback was added in parallel since it is beyond the WSD scope,
    where the general WSD configuration resides.
  • Lack of thread sync is due to its lifecycle the same reason as WSD configuration,
    i.e. we call configNet pre-creation of sockets etc.
  • It is correct that we had duplication of maximum sockets (ahem), which will be revised due to changed semantics. However, the build global will influence this metric. Not too great, I agree.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP: Dropping the net::Defaults instance and configNet (using configure) as requested.

@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg "
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over "
<< (int)_pingMicroS.duration() << "s");
shutdownSilent();
if( isIPType() )
shutdownSilent();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this disconnect the internal connection between coolwsd and kit?

The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.

net/Socket.hpp Outdated
@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket,
HostType hostType, ReadType readType = ReadType::NormalRead,
std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) :
Socket(fd, type, creationTime),
_pollTimeout( net::Defaults::get().SocketPollTimeout ),
_inactivityTimeout( net::Defaults::get().InactivityTimeout ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()), which was ignored in favor or introducing this (unnecessary) net::Defaults (and a new function configNet()).

I really don't understand why we need configNet() that takes net::Defaults when configure() does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults has all-public members, so there is absolutely no thread safety here.

It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.

@Ashod
Copy link
Contributor

Ashod commented Oct 28, 2024

By memory, it was discussed with you and/or Ash .. i.e. ping as the only reliable alive-checking instrument we have. I then explained WS native-ping vs COOL protocol-ping. We didn't discuss ICMP ping .

We discussed this exact point privately, yes. Let me quote myself:

Negative. TCP has a perfectly good way for this (with some footnotes). We simply poll the socket handle, and we get back the state, if it is closed. [...]

So it would seem I didn't quite agree that "pings are the only reliable alive-checking instrument." Indeed, we already detect closed sockets, in SocketPoll, and remove them. Here is another comment on the original PR, also questioning the approach taken.

To repeat what I've been advocating:

  • If we want to detect dead connections, relying on TCP and poll(2) seems straightforward (as we already do).
  • If we need to detect idle connections, checking the last data transfer time against a threshold (timeout) would do the trick very reliably (checkTimeout() is a good place to do it, and we already do that for HTTP GET).
  • If we need to detect slow connections, then we need both a rolling average data-rate and a threshold. We might also want to enforce this logic only when we have too many open sockets, or based on some other criteria, but not strictly at all times. Again, checkTimeout() is the obvious place to do that.

But most certainly we have to be extremely vigilant not to close WebSocket connections blindly and without justification, which are the lifelines of documents and their life-cycle.
Most critically, the internal connection between a Kit and coolwsd cannot be externally interrupted for any reason, whatsoever.

Hope this clarifies where I stand. Happy to discuss where there is a mismatch between my understanding/position and the requirements, of course.

@sgothel
Copy link

sgothel commented Oct 28, 2024

To repeat what I've been advocating:

* If we want to detect **dead** connections, relying on TCP and poll(2) seems straightforward (as we already do).

Yup.

* If we need to detect **idle** connections, checking the last data transfer time against a threshold (timeout) would do the trick very reliably (`checkTimeout()` is a good place to do it, and we already do that for HTTP GET).

We had this in place for replies.

Now we have general client inactivity (set to 1 hour, with revised commits in this branch).
(This was wrong earlier using just the 64s timeout).

The idle connection criteria was part of the requirement.

This metric is checked in StreamSocket::checkRemoval for outside facing connections,
called sequentially after virtual override checkTimeout. (edited)

* If we need to detect **slow** connections, then we need both a _rolling_ average data-rate and a threshold. 

In line with proposal #10337 and described here above.
Thank you for mentioning the rolling or moving average characteristic here (I missed to mention that yet).

We might also want to enforce this logic only when we have too many open sockets, or based on some other criteria, but not strictly at all times.

This is a great addition. Hence we would want an addition property like LowFreeSockets...,
a percentage of the used maximum sockets. The latter is the hard-cut-off for new connections but different from build-in maximum session connection, as discussed above

Again, checkTimeout() is the obvious place to do that.

Yes, this is covered with the WS ping timeout within its checkTimeout override.

This overall slow connection criteria was part of the requirement.

But most certainly we have to be extremely vigilant not to close WebSocket connections blindly and without justification, which are the lifelines of documents and their life-cycle. Most critically, the internal connection between a Kit and coolwsd cannot be externally interrupted for any reason, whatsoever.

Hope this clarifies where I stand. Happy to discuss where there is a mismatch between my understanding/position and the requirements, of course.

Yup, thank you.

Revising the work within this branch.

@sgothel
Copy link

sgothel commented Oct 28, 2024

failed desktop cypress test writer/table_operation_spec.js not reproduced locally, kicking jenkins.

Change-Id: Ifc18ae945f9946e511a394dd92d4e884e6d3ae6d
Signed-off-by: Michael Meeks <[email protected]>
We have one to the Kit process - that is allowed to take way
longer than the ping time to respond, and it was not clear that
this new close behavior on ping/ping was introduced in:

	e8a3ceb

Which also appears to add a 2 second non-responding websocket
ping timeout for some reason (!?).

Signed-off-by: Michael Meeks <[email protected]>
Signed-off-by: Sven Göthel <[email protected]>
Change-Id: Iad74f253bdac5636757b130b299b5deacda658db
@sgothel
Copy link

sgothel commented Oct 28, 2024

Rebased and added commit 4ceaeb0
(redone due to android build issue not linking NetUtil.cpp)

@mmeeks
Copy link
Contributor Author

mmeeks commented Oct 28, 2024

Fixed CI failure mode in unit tests - only execute executable coolwsd ;-) but ... same Cypress test fails locally - interestingly.

@sgothel
Copy link

sgothel commented Oct 28, 2024

Fixed CI failure mode in unit tests - only execute executable coolwsd ;-) but ... same Cypress test fails locally - interestingly.

thank you!

I got this message too late after my push .. outch, @mmeeks please be so kind and re-push, thank you. duck and hide

Avoiding this:
+ test -s ./coolwsd
+ echo 'Cleaning up...'
Cleaning up...
+ ./coolwsd --disable-cool-user-checking --cleanup
/tmp/jenkins9134351613063228334.sh: line 41: ./coolwsd: Permission denied

Signed-off-by: Michael Meeks <[email protected]>
Change-Id: I17886f1198fedd51e7b90ce1759abfcd0323835c
@sgothel
Copy link

sgothel commented Oct 28, 2024

My cleanup work continues here #10355

@mmeeks mmeeks merged commit 3229102 into master Oct 29, 2024
13 checks passed
@mmeeks mmeeks deleted the private/mmeeks/debongmore branch October 29, 2024 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Handle limited open Connections due to keepalive connections (cool#9621)
3 participants