Investigation: reliable pub/sub messaging #1

gtoonstra · 2015-06-29T23:54:59Z

Pub/sub messaging is not 100% reliable. If throughput is not too high, then it can be assumed that all currently connected consumers will receive all produced messages.

Some systems like map-reduce do not really care about dropped messages, because they have retry mechanisms in place to re-send if a reply or response is not given in an allotted time. So the general mechanism there is a "pigeon"-like messaging model, where you tie a message to a pigeon and regularly check if anything comes back:

a. Send a message to do something remotely and record work is to be done in a work log, expecting other systems to start sending updates on the progress of that work.
b. Leave execution context, continue doing other important work
c. Go through each work item in the list and check the work is progressing
d. If a job is dead, just start it again on a different node.

So the system inherently recovers itself from dropped messages and other kinds of failure. There's no specific mechanism that needs to be in place to allow the system to recover.

In a "google pregel" like implementation however, there is an important guarantee:

a message is guaranteed to be delivered once.

An alternative guarantee ( made by choice and these requirements are perpendicular ) is:

a message is guaranteed to be delivered one or more times.

beyond of course not guaranteeing any message delivery.

When message throughput increases, there will eventually be dropped messages, caused by different issues:

1 One single producer may produce too many messages for the broker, so the messages are dropped by the TCP stack on the broker. The producer/broker connection is a paired, directional connection, so the in-queue there only stores message from that single producer.
2 One single consumer may not be fast enough to process all messages sent to it, which means that messages coming from the broker may get dropped.
3 The broker may fail
4 The producer may fail to produce messages (i.e. no longer connected)
5 The consumer may fail to consume messages (i.e. no longer connected)

The latter two, 4&5, should be solved at application level, or simply ignored. In a vertex application, 4 is already guaranteed, because there's a monitor that runs a "SURVEY" process, querying workers that are actively producing data. This relies on the only assumption that if messages replying to the survey manage to reach the surveyor, other messages it produces should be able to reach other workers as well. This is a reasonable assumption, as all messages flow through the same broker.

Number 3, broker failure, basically means that the system as a whole broke down and should be regarded as complete system failure.

Number 1&2 are issues that are hard to solve at application level without significant overhead costs and probably should be solved at infrastructure level.

This investigation should be aimed at providing answers for 1&2 only. 3 is a case that needs to be investigated for broker failure recovery and involves how 1&2 themselves are further persisted away from the broker in failure with automatic recovery in all clients that connected to the broker. 4&5 are issues that are left for the application to solve (statistics?).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: reliable pub/sub messaging #1

Investigation: reliable pub/sub messaging #1

gtoonstra commented Jun 29, 2015

Investigation: reliable pub/sub messaging #1

Investigation: reliable pub/sub messaging #1

Comments

gtoonstra commented Jun 29, 2015