More control over http keepalive and sockjs timeouts #264

jcheng5 · 2016-11-17T17:43:13Z

A user reported disconnects when users on slow internet connections try to use apps that have large outputs (it's unrelated to how long the outputs take to calculate, it's purely the payload size that matters).

Repro

I was able to repro using the following app, plus using commands to slow down the network interface on Linux:

library(shiny)

ui <- fluidPage(
  "Hello",
  plotOutput("plot")
)

server <- function(input, output, session) {
  session$sendCustomMessage("test", 1:3E6)
  output$plot <- renderPlot(plot(cars))
}

shinyApp(ui, server)

This app sends a ~22MB payload to the client (which is ignored) before rendering the plot.

Run this command on the server to slow down one of the network interfaces (tested on Ubuntu):

sudo tc qdisc add dev <network-interface> root netem delay 1000ms

(Replace with your interface name, e.g. eth0. If you don't know your network interface names, call ifconfig. Don't use the lo interface, that's the loopback and affects not only client-to-SS communications but also SS-to-R, which seems to result in a different failure mode.)

To turn off the network interface slowness use this command:

sudo tc qdisc del dev <network-interface> root netem

Because we are not using the loopback adapter, you have to use a browser that's on a different machine (or if you're using a Linux VM, then a browser on the host OS will do).

You will get different results depending on what SockJS transport you use:

websocket: The entire payload is downloaded, but then sockjs-client immediately disconnects with a code 3000 and message "No response from heartbeat". Client does not attempt to reconnect.
xhr-streaming/xhr-polling: ERR_INCOMPLETE_CHUNKED_ENCODING after 45-50 seconds, I believe. Client then attempts to auto-reconnect but the session is gone.

I didn't test any other transports.

Root cause

There are several different timeouts that work in concert to disrupt this scenario. However, the bottom line is that SockJS was not designed for large amounts of latency in messages being received. There is an issue filed against sockjs-node that points to the underlying problem, which I'll describe.

From the server side, SockJS needs to know when a client connection is dead, so it can close resources that it's holding from the server side and let the application know it should execute whatever end-of-session logic it has. With xhr-streaming and xhr-polling, this obviously requires a timeout in some way; because the client will serially issue multiple xhr-streaming or (especially) xhr-polling requests over the course of a single connection, there are bound to be times when one request has ended and we're not sure whether another one is ever going to arrive. SockJS has a disconnect_delay option that lets you control how long to wait, the default is 5 seconds. This means that when the client ends an xhr-streaming/xhr-polling request, by default it has to issue a new request with 5 seconds or else it will be considered disconnected.

The problem is that the 5 second timer begins not when the current request disconnects, but when the last (asynchronous) message send begins. So if that message takes a long time to deliver, as in this case, the timer will have long since expired even if the client quickly starts a new xhr request.

With websockets, there's one continuous socket the whole time, so there isn't this particular problem. However, there's (apparently) a similar situation, where the client may have gone away but we never got a TCP FIN or RST or whatever; SockJS wants to know in a timely manner when this situation has occurred. The websocket transport is unique among the transport types in that when a heartbeat packet is sent from server to client (by default every 25 seconds), there is a hardcoded time limit of 10 seconds by which the client must acknowledge the heartbeat by sending a message back to the server. Similar to the xhr problem, this 10 second timer starts when the heartbeat is queued, not when it makes it to the client; and in our repro case above, it's going to be a couple of minutes between when the heartbeat is queued and when it is delivered.

Finally, there's one more timeout that affects the scenario above. In main.js we were calling socket.setTimeout(45*1000) on every HTTP request we receive. This was intended to be an http keepalive timeout, so idle sockets don't stay connected to us too long. However, this timeout doesn't seem to work the way I expected; it terminates 45 seconds after data is written to the socket object, regardless of how long it takes for the data to actually be sent. (It was hard to find the right Node.js issue for this problem as there seems to be some different bugged behavior that used to happen when you use the http library on top, but this seems like it may be it: nodejs/node#3319)

Workaround

This pull request merely lets the admin configure these timeouts by hand. Whatever the longest payload-delivery is expected, at least that number of seconds should be set for sockjs_disconect_delay. And http_keepalive_timeout should be that number plus another buffer, like 20% or 20 seconds, whichever is less.

Also, disable_websockets on; should be used, since the heartbeat response timeout for websockets is hardcoded and can't be changed.

For the repro above, this works for me:

http_keepalive_timeout 220;
sockjs_disconnect_delay 200;
disable_websockets;

Potential future work

response#end and socket#write both have callbacks that, in this test case at least, fire when the data is fully sent on the socket (so in this case, 2 minutes after write). So it seems like it'd be possible to implement the http_keepalive_timeout the "right" way, by monkeypatching the socket's write and setTimeout methods.
It would similarly be possible to fire the disconnect timers in SockJS at the right time, but would definitely require some deep understanding of how SockJS works (I have spent some time looking already, and it wasn't obvious to me how to fix it without pretty big changes).
It'd be nice if SockJS websocket heartbeat response timer was configurable, or disable-able. That seems simple enough.

In multiple conversations with @trestletech and now @jmcphers we've questioned whether SockJS and websockets in general are worth the trouble, given how smoothly RStudio's long-polling works. They might not be, and switching to long-polling would be much simpler and give us much more control over these kinds of issues. However, there is the big disadvantage for SS/SSP that the number of outstanding long-polling requests you can have open to a single hostname are quite limited (usually 6), while the limit for WebSockets is much larger (like 255). It may be that HTTP/2 will solve this problem for us, seems worth investigating.

trestletech

LGTM! Just a couple of nits.

Should we note the disable_websockets on; caveat on the config settings?

trestletech · 2016-11-17T22:04:49Z

config/shiny-server-rules.config

+}
+
+sockjs_disconnect_delay {
+  desc "How long the SockJS server should wait between HTTP requests before considering the client to be disconnected.";


Can you note the default?

trestletech · 2016-11-17T22:05:07Z

NEWS

+* Add additional configuration directives `http_keepalive_timeout`,
+  `sockjs_heartbeat_delay`, and `sockjs_disconnect_delay` to allow working
+  with very slow connections and large SockJS payloads.
+


Might be worth clarifying whether or not any of these default values changed from the previous release. (I'm curious)

trestletech · 2016-11-17T22:10:43Z

And/or forbid larger-than-default delays when websockets are enabled, as that's almost certainly a misconfiguration?

jcheng5 · 2016-11-18T00:37:00Z

Filed an issue on sockjs-node to make the websocket timeout configurable: sockjs/sockjs-node#219

jcheng5 · 2016-11-18T00:39:52Z

@trestletech I'm not going to forbid or warn on larger-than-default delays with websockets enabled, for now. That's a great suggestion, but I'm not comfortable with the magnitude of the change at the moment (the logic for interpreting disable_websockets and sockjs_disconnect_delay are in very different places in the code AFAICT). I'm hoping the websocket value can become configurable, then we can get rid of the restriction. In the meantime, I don't think anybody is going to figure out these timeouts without talking to support anyway (or reading an article we'll need to write and put on the wiki or shiny.rstudio.com).

The verbiage suggestions have gone in. I'm going to merge now so we can have a build for that customer, but if you have other suggestions I'm happy to take them. Thanks!

More control over http keepalive and sockjs timeouts

6a4f865

trestletech reviewed Nov 17, 2016

View reviewed changes

Code review feedback

f480600

Add detailed comment about socket.setTimeout

c27bb66

jcheng5 merged commit a6adccd into master Nov 18, 2016

jcheng5 deleted the joe/feature/more-timeouts branch November 18, 2016 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More control over http keepalive and sockjs timeouts #264

More control over http keepalive and sockjs timeouts #264

jcheng5 commented Nov 17, 2016 •

edited

Loading

trestletech left a comment

trestletech Nov 17, 2016

trestletech Nov 17, 2016

trestletech commented Nov 17, 2016

jcheng5 commented Nov 18, 2016

jcheng5 commented Nov 18, 2016 •

edited

Loading

More control over http keepalive and sockjs timeouts #264

More control over http keepalive and sockjs timeouts #264

Conversation

jcheng5 commented Nov 17, 2016 • edited Loading

Repro

Root cause

Workaround

Potential future work

trestletech left a comment

Choose a reason for hiding this comment

trestletech Nov 17, 2016

Choose a reason for hiding this comment

trestletech Nov 17, 2016

Choose a reason for hiding this comment

trestletech commented Nov 17, 2016

jcheng5 commented Nov 18, 2016

jcheng5 commented Nov 18, 2016 • edited Loading

jcheng5 commented Nov 17, 2016 •

edited

Loading

jcheng5 commented Nov 18, 2016 •

edited

Loading