Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: reliable pub/sub messaging #1

Open
gtoonstra opened this issue Jun 29, 2015 · 0 comments
Open

Investigation: reliable pub/sub messaging #1

gtoonstra opened this issue Jun 29, 2015 · 0 comments

Comments

@gtoonstra
Copy link
Owner

Pub/sub messaging is not 100% reliable. If throughput is not too high, then it can be assumed that all currently connected consumers will receive all produced messages.

Some systems like map-reduce do not really care about dropped messages, because they have retry mechanisms in place to re-send if a reply or response is not given in an allotted time. So the general mechanism there is a "pigeon"-like messaging model, where you tie a message to a pigeon and regularly check if anything comes back:

a. Send a message to do something remotely and record work is to be done in a work log, expecting other systems to start sending updates on the progress of that work.
b. Leave execution context, continue doing other important work
c. Go through each work item in the list and check the work is progressing
d. If a job is dead, just start it again on a different node.

So the system inherently recovers itself from dropped messages and other kinds of failure. There's no specific mechanism that needs to be in place to allow the system to recover.

In a "google pregel" like implementation however, there is an important guarantee:

a message is guaranteed to be delivered once.

An alternative guarantee ( made by choice and these requirements are perpendicular ) is:

a message is guaranteed to be delivered one or more times.

beyond of course not guaranteeing any message delivery.

When message throughput increases, there will eventually be dropped messages, caused by different issues:

1 One single producer may produce too many messages for the broker, so the messages are dropped by the TCP stack on the broker. The producer/broker connection is a paired, directional connection, so the in-queue there only stores message from that single producer.
2 One single consumer may not be fast enough to process all messages sent to it, which means that messages coming from the broker may get dropped.
3 The broker may fail
4 The producer may fail to produce messages (i.e. no longer connected)
5 The consumer may fail to consume messages (i.e. no longer connected)

The latter two, 4&5, should be solved at application level, or simply ignored. In a vertex application, 4 is already guaranteed, because there's a monitor that runs a "SURVEY" process, querying workers that are actively producing data. This relies on the only assumption that if messages replying to the survey manage to reach the surveyor, other messages it produces should be able to reach other workers as well. This is a reasonable assumption, as all messages flow through the same broker.

Number 3, broker failure, basically means that the system as a whole broke down and should be regarded as complete system failure.

Number 1&2 are issues that are hard to solve at application level without significant overhead costs and probably should be solved at infrastructure level.

This investigation should be aimed at providing answers for 1&2 only. 3 is a case that needs to be investigated for broker failure recovery and involves how 1&2 themselves are further persisted away from the broker in failure with automatic recovery in all clients that connected to the broker. 4&5 are issues that are left for the application to solve (statistics?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant