Design Notes

Overview

The design consists of two parts:

A back end responsible for maintaining and indexing the data, processing queries and updates, and calculating diffs and merges on database revisions
A front end responsible for managing concurrency, consistency, replication, and durability (or a subset of these if they are not all needed).

There are more than one front ends possible, each tuned for different application needs. For example, one might implement ACID (Atomic, Consistent, Isolated, Durable) semantics while another implements BASE (Basically Available, Soft Updates, Eventually-consistent) semantics and a third implements a hybrid.

Back End

The back end may be considered a relational database management system (RDBMS) with distributed version control system (DVCS) features for managing database revisions.

Overview

Define a new, empty database revision by supplying a list of tables, each of which has a collection of columns and a primary key defined as a subset of those columns
- The identity of a row is defined by its primary key, and this is what we use when comparing rows while calculating diffs and merges
Create new revisions by defining and applying SQL-style inserts, updates, and deletes
Calculate diffs between revisions and serialize them for replication and persistence
- To generate a snapshot of an entire database, calculate a diff between the empty revision and the revision of interest
Calculate three-way merges from revisions for concurrency control
Define queries using SQL-style relational semantics
Execute queries by supplying two database revisions
- The result is two sets of tuples satisfying the query constraints:
  - New tuples which either appear in the second revision but not the first or which have changed from the first to the second
  - Obsolete tuples which appear in the first but not the second
- Note that traditional query semantics may be achieved by specifying an empty revision as the first parameter and the database to be queried as the second

Diff Algorithm

A diff operates on two database revisions and produces a set of inserts, updates, and deletes necessary to transform the first revision to the second.

compare the roots of the two revisions (i.e. the tables of tables)
- if they're the same, the databases are identical and the diff is empty
- otherwise, iterate over each table, comparing the root nodes of the topmost primary key indexes
  - if the root nodes match, the tables are the same and we can move on to the next table
  - if the root nodes differ, iterate over the nodes in the index, comparing each corresponding node value
    - if a node is in the first revision and not the second, add deletes to the diff for all the tuples under that node
    - if a node is in the second tree and not the first, add inserts to the diff all the tuples under that node
    - if a node is in both trees but the node value is different, descend to the next index in the primary key if applicable and repeat the above. If we've reached the end of the primary key, compare the rows.
      - if the rows are the same, move on to the next pair
      - if the rows differ, add an update to the diff

Query Diff Algorithm

A query diff operates on two database revisions and a query, and it produces a set of inserts, updates, and deletes necessary to transform the results of the query obtained from the first revision to those obtained from the second.

compare the roots of the two revisions (i.e. the tables of tables)
- if they're the same, or if each corresponding table referenced in the query is the same, the diff is empty
- otherwise, execute the query on the query source
  - if the source is a table reference,
    - calculate a scan based on the query test
    - visit each row in the scan on both table revisions simultaneously, skipping rows known to be identical in both revisions (unless we're being called recursively as the leaf of a join tree, in which case we may be requested to visit and test unchanged rows)
      - if the row was updated, apply the test to each version
        
        if the test passes on both, include an update in the result
        
        if the test passes on the old but not the new, include a delete of the old row in the result
        
        if the test passes on the new but not the old, include an insert of the new row in the result
      - if the row was deleted and it passes the test, include a delete in the result
      - if the row was inserted and it passes the test, include an insert in the result
  - if the source is a join,
    - extract the join predicate from the query test
    - recursively scan the rows of left subsource
      - if the row is unchanged, do a scan on the right subsource for corresponding inserted, updated, or deleted rows (ignoring unchanged ones) according to the join predicate, and finally include any matches as inserts, updates, or deletions, respectively, of the combined rows in the result
      - if the row was updated,
        
        do a scan on the new revision of the right subsource for corresponding rows, matching them according to the join predicate with the new row from the left subsource
        
        repeat the above with the old revision of the right subsource and old row from the left subsource, respectively
        
        for each match which appears in both of the above results, include an update of the combined rows in the result
        
        for each new match, include an insert
        
        for each obsolete match, include a delete
      - if the row was inserted, do a scan on the new revision of the right subsource for corresponding rows, matching them according to the join predicate with the new row from the left subsource, and finally include each match as an insert of the combined row in the result
      - if the row was deleted, do a scan on the old revision of the right subsource for corresponding rows, matching them according to the join predicate with the old row form the left subsource, and finally include each match as a delete of the combined row in the result

Merge Algorithm

A merge operates on three database revisions – a base and two forks considered to be derived from that base – plus an application-defined conflict resolver and produces a fourth revision which combines the modifications from the two forks.

calculate diffs between the base and each fork as described above
if one diff is empty, use the other one as the result
otherwise
- for each table appearing in one diff but not the other, include those changes in the result
- for each table appearing in both diffs
  - iterate over the row-level changes in the first diff
    - if the second diff does not operate on the same row, include the change in the result and remove it from (a mutable clone of) the second diff
    - if the second diff does operate on the same row:
      - merge them
        
        if both changes perform the same operation (e.g. insert or update with identical values, or both are deletes), then add the change to the result
        
        if both changes are inserts, compare each corresponding element in the tuples
        
        if the elements match, include that element in the result
        
        if the elements don't match, give up and defer to the application-define conflict resolver for this row
        
        if both changes are updates, compare each corresponding element in the tuples
        
        if the elements match, include that element in the result
        
        if the elements don't match, but one of them matches the base, include the new, non-matchine one in the result
        
        if one change is a delete and the other is an update, ignore the update and add the delete to the result
        
        note that it's impossible for one change to be an insert and the other an update or delete since the diffs were calculated from the same base, so we need not consider those cases
      - include the merged version in the result, and remove the changes from (mutable clones of) both diffs
  - iterate over the rows remaining in the second diff and add them to the result

Optimizations

In order to make the semantics described above both memory- and time-efficient, we consider the following optimizations, ordered roughly from most important to least important:

Allow a revision to share structure with its ancestors in inverse proportion to the difference between each one. This may be achieved using persistent data structures such as a persistent red/black tree.
Memoize diffs and store them in a table in the base revision which maps fork revisions to diffs. This table should held via a weak reference so that it is elegible for garbage collection under memory pressure.
Optimize queries via join links which tie joined columns together. For example, if b.y is frequently joined to a.x, give each a.x item a direct reference to its corresponding b.y so that the join does not require an index lookup.
When applying a patch containing multiple updates, cache the most recent index lookup so we don't have to do it again if the next update can reuse all or part of that lookup. Also consider sorting the updates in a patch ahead of time to maximize this benefit.
Ignore no-op changes (i.e. a delete or update that has no effect)
Allow destructive updates to limit new object creation (safer alternative: allow chaining updates together into a single patch to avoid creating persistent intermediate states)
When executing queries, do partial diffs when we only care about a subset of tables and cache these in such a way that we can build on them to create more complete diffs as needed
Consider automatically generating indexes and join links the first time a query or update might take advantage of them. The alternative is to require that they be create explicitly as is traditional.
Optimize diff and merge calculation by versioning paths through the index search trees (and across join links?) on update. If the versions match during diff calculation, we don't need to decend further.
Remove tables and index nodes from the tree as soon as they become empty.

Front Ends

As described above, there are multiple front ends possible, each implementing different concurrency, consistency, and durability semantics.

Concurrency

A front end has several concurrency models to choose from. In all of the following examples, we assume reads may proceed without ever blocking thanks to the MVCC model used by the back end. Thus, the only difference between these models is how concurrency is managed between multiple writers. Note also that concurrent writers may be in different threads on the same machine or on different machines.

We use the term "publish" in the following to mean making a revision public by updating a global reference to the current database revision and possibly writing a diff to a log or replication targets if applicable.

Pessimistic locking: each writer must obtain a mutex in order to make an update
- obtain lock (blocking if the lock is already held)
- apply update and publish new database revision
- release lock
Optimistic, rebase or STM-style concurrency: each writer may apply its update to produce a private revision and attempt to publish it using an atomic compare-and-set without blocking. If the compare-and-set fails (another writer beat us), we try the update again starting from the latest published revision.
- apply update to latest published revision
- attempt to publish result
  - if this succeeds, we're done
  - if this fails, we try again with the new published revision
Optimistic, merge-based concurrency: each writer proceeds as above, except that instead of retrying the update on compare-and-set failure, we merge the latest published revision with our private one, using the old revision as the base
- apply update to latest published revision
- attempt to publish result
  - if this succeeds, we're done
  - if this fails, we merge and try to publish the result, repeating as necessary

The last of these models is most suitable for a distributed system of autonomous nodes, since it allows each node to publish locally without ever blocking on a distributed locking mechanism or making synchronous writes to other nodes.

Durability

A front end is free to implement its own definition of durability. Here are some examples:

no persistence at all
soft updates (i.e. asynchronous writes to a transaction log which do not block writers)
soft updates plus durability confirmations (i.e. asynchronous writes as above, and when an update has been synced to disk we send a notification to the writer that it's been persisted)
durable transactions (i.e. traditional, synchronous ACID updates where the writer is blocked and the update unpublished until the transaction has been synced to disk)

Note that each of these persistence models may be generalized to include not just one disk-based log, but multiple logs and remote peers. For example, we might consider an update to be durable only once we've received confirmation of receipt from all nodes in the cluster and that it has been synced to one or more (local or remote) logs.

The "soft update plus durability confirmation" model is most appealing for a distributed system of autonomous nodes since it gives us durability without adding latency.

Replication and Distributed Concurrency

Requirements

In a distributed system of autonomous, cooperating nodes, we must ensure that the each node is able to share changes with the others and incorporate any changes it receives in a way that is

stable: no metastability or oscillations
eventually consistent
order-preserving: Revisions which have a well-defined ordering relationship (i.e. one is an ancestor of the other) are never in conflict and should never be treated as such. In particular, a change from a given revision must never overwrite a change from a decendent of that revision. We must therefore be careful to recognize and preserve such relationships even when updates may be received in different orders and by different paths by each node.
memory efficient: A finite data set distributed over a finite number of nodes must not require unbounded or excessive memory requirements for e.g. bookeeping.
time efficient: diff and merge calculation must not add significant latency to change propagation

Algorithm

Each node maintains references to the following database revisions:

the current local "head" revision
for each of the other nodes in the cluster,
- the remote head revision representing the remote node's state as of the most recent update we've received from that node
- an "ack map" with a key for all other nodes where each value is a pointer to a singly-linked list which contains the most recently-acknowledged revision from that node as well as all later revisions. We use this to mirror merges as we receive new acks.

This is the procedure for local updates (i.e. updates originating at the current node; not those received from other nodes):

update the local head
add the new revision to revision list for each of the other nodes
send the diff to each directly-connected node
send an ack to all directly-connected nodes

When we receive a diff from another node:

if we've already received this diff – or if it originated locally – ignore it
otherwise:
- forward the diff to all other directly-connected nodes
  - note: not every directly-connected node may be ready for more data, in which case we wait until it is and send it a diff which may combine the effect of multiple diffs
- send an ack to all directly-connected nodes (including the one we received the diff from)
- apply the diff to the remote head for that node
- merge the diff into the local head

When we receive an ack from another node:

if we've already received this ack, ignore it
otherwise:
- forward the ack to all other directly-connected nodes
  - note: not every directly-connected node may be ready for more data, in which case we wait until it is and send it the most recent ack we've received at that point
- merge the revision being acknowledged into the remote head for the acknowledger, using the previously-acknowledged revision as the base
- update the reference in the "ack map" to point to the linked list cell holding the newly-acknowledged revision

Optimizations

encourage tuple and index structure sharing by replacing remote heads (which have evolved based on incoming diffs) with the local head (or something based on it) whenever they match (or come close to matching)
avoid sending explicit deletes for data implicitly deleted due to foreign key constraint cascading
avoid sending diffs / acks to nodes that we know have gotten them (primarily, don't send a message back to the node that we got it from).
- potentially, keep the list of who has forwarded a specific diff message, and don't send to nodes on that list.