Skip to content

Sequences and Durability

Jens Alfke edited this page May 9, 2013 · 1 revision

One of the problems with implementing the _changes feed in a Sync Gateway cluster is establishing a consistent ordering (sequencing) of revisions. If we get this wrong, we can end up skipping revisions in the feed and the client won't find out about them (until the same doc is updated again), which is Very Bad.

Couchbase doesn't make this easy for us. We have two ways to find revisions: the changes view and the server's TAP feed. We need the view as a stable historical record, and we need the feed to get low latency so we can report changes immediately. But this causes some problems:

  1. The TAP feed notifies of documents added in-memory, before they're persisted to disk. So at the time the gateway learns of a new revision, it will probably not show up in the view results yet. The gateway needs to wait to query the view; but it doesn't know how long it'll take before the document is persisted.
  2. Different documents are stored on different server nodes, which have their own timetables for persisting them. So it's possible that one revision might take longer to write to disk than a subsequent one, in which case for a brief time the view results will have a gap in the sequence numbering.
  3. Since a client's consecutive _changes requests might hit different gateway nodes, they all need to generate a sufficiently-consistent partial ordering of the feed, otherwise there'll be no good way to interpret the since query parameter.

We assume that sequence numbers can be generated quickly enough that every document revision can be assigned a unique one, and that the numbers increment by 1.

We define a special document that stores the "checkpoint", the most recent durable sequence number. The invariant is that all revisions with sequences ≤ the checkpoint are safely stored on persistent storage and will show up in views. As time goes on this document is updated with larger sequence numbers.

Flow Of Control

A gateway node on startup reads the checkpoint document, queries the all-changes view for the most recent changes, and registers with the TAP feed.

It maintains a data structure that conceptually contains parallel arrays of sequence numbers; one is the numbers received in docs from the TAP feed, the other the numbers seen in the view results. Sequences less than the checkpoint can be ignored.

Example

Here's what these arrays might look like. The checkpoint sequence is 3, so arrays start after that.

TAP:  | 4 5 6   8
View: |   5 6   8

In other words, sequences up through 3 are known to be persistent. We've seen every following sequence up through 8, except for 7, show up in the TAP feed; but of those only 5, 6 and 8 have appeared in the latest view query.

This implies that the server node that stores 4 hasn't flushed it to disk yet, and that the gateway node adding 7 has acquired the sequence number but hasn't yet sent it to the Couchbase server (or the server node hasn't added it to RAM yet, or it has but the local gateway hasn't received the TAP entry yet.)

Updating The Checkpoint

A gateway node will write all data documents in "observe" mode -- that way when the write completes, it knows the document has been persisted to disk.

Afterwards, if the sequence number of the document is one plus the current checkpoint, it should increment the checkpoint and save the new value to its document.

Meanwhile, if the TAP feed reports that the checkpoint document has been updated by some other node, read the new checkpoint from the document.

Either way, after the checkpoint is incremented, sequences up to the new checkpoint can be trimmed from both arrays.

Querying The View

The view query should be re-run (with stale=false) after a new revision is persisted to disk. The issues are:

  • How do we know when this happens, to revisions written by other gateway nodes? The TAP feed notification comes too early, and the only node that can find out when persistence occurs is the one that wrote the document, so the others' can't tell.
  • Updating the index is quite expensive. If lots of documents are being updated, we should try to batch them together instead of updating the index after each one.
  • It's wasteful for all the gateway nodes to run the query. Fortunately the re-indexing will only occur once (per new revision) but scanning the b-tree and sending the response are relatively expensive too.

[A note on Couchbase view performance: it doesn't scale with Couchbase cluster size, because queries have to be run in parallel on every node. It's limited by the performance of the slowest individual node.]

Would it be best to capture the view output (at least for the most recent sequences) in a document? If one node runs the query and updates the doc, the rest of the nodes will see the change and can very cheaply get the new view output. We can use a document lock to arbitrate this job so that only one node at a time runs the query.

Public Sequence Numbers

We want to push new revisions out to clients as quickly as possible to minimize latency for apps like chat. That means sending them as soon as they appear in the TAP feed. But that means we won't necessarily be sending them in order of sequence number. This is a problem, because when the client reconnects to the changes feed, the sequence it sends in the since parameter will be the highest sequence it's seen, but that doesn't mean it's seen all sequences before it, so we won't know where to start.

Our solution is to make the sequence IDs shown to the client a bit more complex and a bit less specific. In cases where there's a known gap in the sequence numbering received over TAP, we don't report the actual sequence number. Instead we report the checkpoint (the latest persistent sequence) and append a suffix that denotes the set of subsequence sequences we've seen.

One idea for that suffix is to generate a binary number by treating every subsequence sequence as a bit, whose value is 0 if that sequence is missing and 1 if present, and then treating that as a binary number reading in the opposite direction (first sequence is 1, next is 2, etc.) For example, in the above example the checkpoint is 3 and the following bits are 01101, which read backwards is 10110 = 0x16. So the sequence ID for this state would be "3+16". This uniquely identifies the set of known sequences, so when receiving it from a client we can easily tell whether there's anything new to send back.

An alternate proposal is just to concatenate the highest known sequence onto the checkpoint, so in the above example the public sequence would be "3-8". This is simpler, but obviously not unique; when receiving such a string in the since parameter we can't tell whether or not anything has changed since. Instead we have to send back everything we have after sequence 3. (This may be wasteful but it won't cause incorrect behavior because the client will just ignore revisions it already has.)

In the event that there is no gap in the sequence numbering seen from TAP (which will be case any time there haven't been any very-recent writes), the last sequence and the checkpointed sequence will be the same, and we can just send out a regular sequence number.

Clone this wiki locally