Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More control over http keepalive and sockjs timeouts #264

Merged
merged 3 commits into from
Nov 18, 2016

Conversation

jcheng5
Copy link
Member

@jcheng5 jcheng5 commented Nov 17, 2016

A user reported disconnects when users on slow internet connections try to use apps that have large outputs (it's unrelated to how long the outputs take to calculate, it's purely the payload size that matters).

Repro

I was able to repro using the following app, plus using commands to slow down the network interface on Linux:

library(shiny)

ui <- fluidPage(
  "Hello",
  plotOutput("plot")
)

server <- function(input, output, session) {
  session$sendCustomMessage("test", 1:3E6)
  output$plot <- renderPlot(plot(cars))
}

shinyApp(ui, server)

This app sends a ~22MB payload to the client (which is ignored) before rendering the plot.

Run this command on the server to slow down one of the network interfaces (tested on Ubuntu):

sudo tc qdisc add dev <network-interface> root netem delay 1000ms

(Replace with your interface name, e.g. eth0. If you don't know your network interface names, call ifconfig. Don't use the lo interface, that's the loopback and affects not only client-to-SS communications but also SS-to-R, which seems to result in a different failure mode.)

To turn off the network interface slowness use this command:

sudo tc qdisc del dev <network-interface> root netem

Because we are not using the loopback adapter, you have to use a browser that's on a different machine (or if you're using a Linux VM, then a browser on the host OS will do).

You will get different results depending on what SockJS transport you use:

  • websocket: The entire payload is downloaded, but then sockjs-client immediately disconnects with a code 3000 and message "No response from heartbeat". Client does not attempt to reconnect.
  • xhr-streaming/xhr-polling: ERR_INCOMPLETE_CHUNKED_ENCODING after 45-50 seconds, I believe. Client then attempts to auto-reconnect but the session is gone.

I didn't test any other transports.

Root cause

There are several different timeouts that work in concert to disrupt this scenario. However, the bottom line is that SockJS was not designed for large amounts of latency in messages being received. There is an issue filed against sockjs-node that points to the underlying problem, which I'll describe.

From the server side, SockJS needs to know when a client connection is dead, so it can close resources that it's holding from the server side and let the application know it should execute whatever end-of-session logic it has. With xhr-streaming and xhr-polling, this obviously requires a timeout in some way; because the client will serially issue multiple xhr-streaming or (especially) xhr-polling requests over the course of a single connection, there are bound to be times when one request has ended and we're not sure whether another one is ever going to arrive. SockJS has a disconnect_delay option that lets you control how long to wait, the default is 5 seconds. This means that when the client ends an xhr-streaming/xhr-polling request, by default it has to issue a new request with 5 seconds or else it will be considered disconnected.

The problem is that the 5 second timer begins not when the current request disconnects, but when the last (asynchronous) message send begins. So if that message takes a long time to deliver, as in this case, the timer will have long since expired even if the client quickly starts a new xhr request.

With websockets, there's one continuous socket the whole time, so there isn't this particular problem. However, there's (apparently) a similar situation, where the client may have gone away but we never got a TCP FIN or RST or whatever; SockJS wants to know in a timely manner when this situation has occurred. The websocket transport is unique among the transport types in that when a heartbeat packet is sent from server to client (by default every 25 seconds), there is a hardcoded time limit of 10 seconds by which the client must acknowledge the heartbeat by sending a message back to the server. Similar to the xhr problem, this 10 second timer starts when the heartbeat is queued, not when it makes it to the client; and in our repro case above, it's going to be a couple of minutes between when the heartbeat is queued and when it is delivered.

Finally, there's one more timeout that affects the scenario above. In main.js we were calling socket.setTimeout(45*1000) on every HTTP request we receive. This was intended to be an http keepalive timeout, so idle sockets don't stay connected to us too long. However, this timeout doesn't seem to work the way I expected; it terminates 45 seconds after data is written to the socket object, regardless of how long it takes for the data to actually be sent. (It was hard to find the right Node.js issue for this problem as there seems to be some different bugged behavior that used to happen when you use the http library on top, but this seems like it may be it: nodejs/node#3319)

Workaround

This pull request merely lets the admin configure these timeouts by hand. Whatever the longest payload-delivery is expected, at least that number of seconds should be set for sockjs_disconect_delay. And http_keepalive_timeout should be that number plus another buffer, like 20% or 20 seconds, whichever is less.

Also, disable_websockets on; should be used, since the heartbeat response timeout for websockets is hardcoded and can't be changed.

For the repro above, this works for me:

http_keepalive_timeout 220;
sockjs_disconnect_delay 200;
disable_websockets;

Potential future work

  • response#end and socket#write both have callbacks that, in this test case at least, fire when the data is fully sent on the socket (so in this case, 2 minutes after write). So it seems like it'd be possible to implement the http_keepalive_timeout the "right" way, by monkeypatching the socket's write and setTimeout methods.
  • It would similarly be possible to fire the disconnect timers in SockJS at the right time, but would definitely require some deep understanding of how SockJS works (I have spent some time looking already, and it wasn't obvious to me how to fix it without pretty big changes).
  • It'd be nice if SockJS websocket heartbeat response timer was configurable, or disable-able. That seems simple enough.

In multiple conversations with @trestletech and now @jmcphers we've questioned whether SockJS and websockets in general are worth the trouble, given how smoothly RStudio's long-polling works. They might not be, and switching to long-polling would be much simpler and give us much more control over these kinds of issues. However, there is the big disadvantage for SS/SSP that the number of outstanding long-polling requests you can have open to a single hostname are quite limited (usually 6), while the limit for WebSockets is much larger (like 255). It may be that HTTP/2 will solve this problem for us, seems worth investigating.

Copy link
Contributor

@trestletech trestletech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of nits.

Should we note the disable_websockets on; caveat on the config settings?

}

sockjs_disconnect_delay {
desc "How long the SockJS server should wait between HTTP requests before considering the client to be disconnected.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you note the default?

* Add additional configuration directives `http_keepalive_timeout`,
`sockjs_heartbeat_delay`, and `sockjs_disconnect_delay` to allow working
with very slow connections and large SockJS payloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth clarifying whether or not any of these default values changed from the previous release. (I'm curious)

@trestletech
Copy link
Contributor

And/or forbid larger-than-default delays when websockets are enabled, as that's almost certainly a misconfiguration?

@jcheng5
Copy link
Member Author

jcheng5 commented Nov 18, 2016

Filed an issue on sockjs-node to make the websocket timeout configurable: sockjs/sockjs-node#219

@jcheng5
Copy link
Member Author

jcheng5 commented Nov 18, 2016

@trestletech I'm not going to forbid or warn on larger-than-default delays with websockets enabled, for now. That's a great suggestion, but I'm not comfortable with the magnitude of the change at the moment (the logic for interpreting disable_websockets and sockjs_disconnect_delay are in very different places in the code AFAICT). I'm hoping the websocket value can become configurable, then we can get rid of the restriction. In the meantime, I don't think anybody is going to figure out these timeouts without talking to support anyway (or reading an article we'll need to write and put on the wiki or shiny.rstudio.com).

The verbiage suggestions have gone in. I'm going to merge now so we can have a build for that customer, but if you have other suggestions I'm happy to take them. Thanks!

@jcheng5 jcheng5 merged commit a6adccd into master Nov 18, 2016
@jcheng5 jcheng5 deleted the joe/feature/more-timeouts branch November 18, 2016 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants