Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the client side SUB IOPub socket subscription *before* we connect #673

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DavisVaughan
Copy link
Contributor

@DavisVaughan DavisVaughan commented Jan 23, 2025

I am hoping this fixes some very strange weirdness we've been seeing ever since we added the Welcome message handshake infrastructure. In particular, this error we see in CI a lot:

thread 'test_env_vars' panicked at crates/amalthea/src/fixtures/dummy_frontend.rs:195:9:
assertion failed: Status(JupyterMessage { zmq_identities: [], header: JupyterHeader { msg_id: "24ea361b-b116-462e-a868-47d503ae01be", session: "5ccdaa85-4179-49b6-b8e2-97b484081e7e", username: "kernel", date: "2025-01-23T21:23:15.387272+00:00", msg_type: "status", version: "5.3" }, parent_header: None, content: KernelStatus { execution_state: Starting } }) does not match Message::Welcome(data)

That error comes from here:

// Immediately block until we've received the IOPub welcome message from the XPUB
// server side socket. This confirms that we've fully subscribed and avoids
// dropping any of the initial IOPub messages that a server may send if we start
// to perform requests immediately (in particular, busy/idle messages).
// https://github.com/posit-dev/ark/pull/577
assert_matches!(Self::recv(&iopub_socket), Message::Welcome(data) => {
assert_eq!(data.content.subscription, String::from(""));
});

Essentially "I was expecting to get a Welcome message, but instead got a KernelStatus { Starting } message"

We've been confused about this because we thought that an XPUB socket was supposed to send out a Welcome message before anything else when a SUB connects, which should make this impossible. But we've been thinking about it wrong! It's not that the XPUB socket is sending the KernelStatus { Starting } message first, it's that we've already missed the Welcome message!

Here's my theory about the race condition we are dealing with:

  • XPUB bind()s first
  • SUB connect()s
  • XPUB processes that connection, and says Welcome. But SUB has not set any subscriptions, so Welcome is dropped
  • SUB subscribe()s, and then starts waiting on Welcome
  • XPUB sends out KernelStatus { Starting }
  • SUB is very confused when we get KernelStatus { Starting } instead of Welcome, resulting in the above message

The solution is so simple. Just subscribe before calling connect(), so the welcome message can't get dropped. This is even confirmed by the documentation of ZMQ_XPUB_WELCOME_MSG:

Subscriber must subscribe to the Welcome message before connecting.

In this case Subscriber = our client side SUB. So yea, that would explain it.

I even dug all the way into the zmq C++ to see where the welcome message is sent out from. It does indeed look like it is sent out exactly once, at connect() time, providing a window where we can miss it if we are not subscribed yet!
https://github.com/zeromq/libzmq/blob/34f7fa22022bed9e0e390ed3580a1c83ac4a2834/src/xpub.cpp#L56-L65

@DavisVaughan
Copy link
Contributor Author

@jmcphers I'm pretty convinced from this that kallichore should probably be subscribing before connecting as well, right here:
https://github.com/posit-dev/kallichore/blob/6a5d4877cd81b14e13e8c2a096bbb41c2cd22452/crates/kcserver/src/zmq_ws_proxy.rs#L155-L173

Which would really come into play when kallichore supports the welcome message (https://github.com/posit-dev/kallichore/issues/1) like our test client does.

Comment on lines -204 to -205
/// Note that this needs to be called *after* the socket connection is
/// established on both ends.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong, based on those xpub docs quoted above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant