Skip to content

Commit

Permalink
Update blog post
Browse files Browse the repository at this point in the history
  • Loading branch information
yulu committed Sep 17, 2023
1 parent 5e79ad6 commit 1c97ce7
Showing 1 changed file with 47 additions and 42 deletions.
89 changes: 47 additions & 42 deletions blog1/_posts/2023-04-10-ddia-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,13 @@ toc: true

### Chapter Summary

This chapter is the start of the three related topics regarding storage systems - replication, partition and transaction. Replication is important because of these important properties it provides: 1) keep the data geographically close to the users 2) allow system to continue to work even if some of its parts have failed (redundancy) 3) allows scaling out to serve the increasing throughput.
This chapter is the start of the three related topics regarding storage systems, namely replication, partition and transaction. Replication is important because of these these properties it provides:

Three common patterns, namely single-leader, multi-leader and leaderless are discussed. In the single-leader pattern, the challenge is mainly the follower set-up and replication lag. For multi-leader and leaderless pattern, the main challenge is the write conflict detection and resolving. This chapter discussed the definition of the patterns, the challenges and various solutions in detail.
- 1) keep the data geographically close to the users
- 2) allow system to continue to work even if some of its parts have failed (redundancy)
- 3) allow scaling out to serve the increasing throughput

Three common patterns, namely single-leader, multi-leader and leaderless are discussed. In the single-leader pattern, the challenge is mainly the follower set-up and replication lag. For multi-leader and leaderless pattern, the main challenge is the write conflict detection and resolving. This chapter discusses the definition of the patterns, the challenges and various solutions in detail.

### Mind Map

Expand All @@ -25,50 +29,50 @@ Three common patterns, namely single-leader, multi-leader and leaderless are dis
#### Single-leader
💡 What are synchronous replication and asynchronous replication?

Synchronous replication means that when write happens, the successful response to the client will only be made after copying the data to all the followers.
Asynchronous replication means that the successful response is immediately returned to client when a write is saved to the leader. The copying from leader to followers will happen in background asynchronously.
- Synchronous replication means that when write happens, the successful response to the client will only be made after copying the data to all the followers.
- Asynchronous replication means that the successful response is immediately returned to client when a write is saved to the leader. The copying from leader to followers will happen in background asynchronously.

💡 What is semi-synchronous replication and why it is good?

In practice that synchronous replication is impractical because one unavailable node will block all the writes. Semi-synchronous refer to the way that the data write to the leader, is synchronously copied to 1 follower, and then copied to other followers asynchronously. This ensures that at any time, at least two nodes have the up-to-date copy of the data.
In practice, synchronous replication is impractical because one unavailable node will block all the writes. Semi-synchronous refer to the way that the data write to the leader, is synchronously copied to 1 follower, and then copied to other followers asynchronously. This ensures that at any time, at least two nodes have the up-to-date copy of the data.

💡 What is the trade-off of async replication?

Weak durability. In the case that a leader fails before all the data is copied to followers, there will be data lost.
It results in weak durability. In the case that a leader fails before all the data is copied to followers, there will be data lost.

💡 What are the 4 steps to setup a new follower?

- take a consistent snapshot of the leader's database at some point in time
- copy the snapshot to the new follower node
- the follower connects to the leader to request all the data change since the snapshot was taken
- caught up the change, which is to process the backlog of data changes since the snapshot
- Take a consistent snapshot of the leader's database at some point in time
- Copy the snapshot to the new follower node
- The follower connects to the leader to request all the data change since the snapshot was taken
- The follower catches up the change, which is to process the backlog of data changes since the snapshot

💡 What is the log sequence number in PostgreSQL and binlog coordinate in MySQL?

These two are the same concept, refers to the exact position in the leader's replication log that the snapshot of the database is associated with.

💡 What are the 3 steps for an automatic failover process?

Step1: determine the leader has failed, normally use a ping timeout
Step2: choose a new leader through an election process
Step3: reconfiguring the system to use the new leader
- Step1: determine the leader has failed, normally use a ping timeout
- Step2: choose a new leader through an election process
- Step3: reconfiguring the system to use the new leader

💡 Why manual operation to handle leader failover is preferred?

- automatically handle unreplicated writes in failed leader is tricky. Simple discarding might cause serious consequences
- split brain, which means that multiple leaders running at the same time may occur if the failover mechanism is not carefully designed
- if the timeout is not properly decided, false alarm may occur and could result in an unnecessary failover
- Automatically handle unreplicated writes in failed leader is tricky. Simple discarding might cause serious consequences
- May result in split brain problem, which means that multiple leaders running at the same time may occur if the failover mechanism is not carefully designed
- If the timeout is not properly decided, false alarm may occur and could result in an unnecessary failover

💡 What is split brain?

More than one leader is running in a single leader replication architecture. If there's no process to handle write conflict, it will result in data lost or corruption.

💡 What are the four leader-to-follower replication methods?

- statement-based
- write-ahead log shipping
- logical (row-based) log
- trigger-based
- Statement-based
- Write-ahead log shipping
- Logical (row-based) log
- Trigger-based

💡 In a statement-based replication approach, what conditions may break the replication?

Expand All @@ -82,17 +86,21 @@ The log data is stored in a very low level, which makes the replication closely

💡 What are the different ways to solve replication lag issue?

- reading your own writes (or read-after-write consistency): 1) serve the often-modified entry by leader only 2) serve by follower only after some time passed the write (e.g. 1 mins later after update) 3) record the update timestamp and use the timestamp to check if value exists
- monotonic reads: the same user always reads from the same replica
- consistent prefix reads: if any writes happens in a certain order, then anyone reading those writes will see them appear in the same order
- Reading your own writes (or read-after-write consistency): 1) serve the often-modified entry by leader only 2) serve by follower only after some time passed the write (e.g. 1 mins later after update) 3) record the update timestamp and use the timestamp to check if value exists
- Monotonic reads: the same user always reads from the same replica
- Consistent prefix reads: if any writes happens in a certain order, then anyone reading those writes will see them appear in the same order

#### Multi-leader
💡 What are the ways (name 4) of achieving convergent conflict resolution?

- last write wins approach:
- higher-numbered replica wins:
- somehow merge the values (e.g. concatenate strings):
- record the conflict in an explicit data structure that preserves all the information and ask for user to resolve at some later time:
- Last write wins approach
- Higher-numbered replica wins
- Somehow merge the values (e.g. concatenate strings)
- Record the conflict in an explicit data structure that preserves all the information and ask for user to resolve at some later time

💡 What is the replication topology in multi-leader replication architecture?

A replication topology describe the communication paths along which writes are propagated from one node to another.

💡 What are the commonly seen multi-leader replication topologies?

Expand All @@ -102,27 +110,23 @@ Circular topology, star topology and all-to-all topology

Avoid the single point of failure

💡 What is the replication topology in multi-leader replication architecture?

A replication topology describe the communication paths along which writes are propagated from one node to another

💡 What is the causality issue of all-to-all topology and what is the solution to that?

Causality issue refer to the problem that some operations does not arrive to the system in the order expected due to the network latency. So one operation that depends on the other might arrive earlier (e.g. an update statement comes before an insert of the item). Version vector is the solution to the causality issue

#### Leaderless
💡 How does an unavailable node catch up on the writes when it comes back online?

- read repair: the client fetches the data during read and detect the stale value from the nodes, it then write the update-to-date back to the nodes
- anti-entropy process: a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another
- Read repair: the client fetches the data during read and detect the stale value from the nodes, it then write the update-to-date back to the nodes
- Anti-entropy process: a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another

💡 What is the benefit of choosing r and w to be majority (more than n/2) of nodes?

This choice ensures w+r>n (quorum condition) while still tolerating up to n/2 node failures

💡 If a workload pattern is few writes and many reads, what is better config for the quorum and why?

Reduce the r, say to 1, increase the w, say to n (so r + w > n holds). In this case, we can reduce the load of read. However increase w to n has the disadvantage that a single unavailable nodes will cause write fall.
Reduce the r, say to 1, increase the w, say to n (so r + w > n holds). In this case, we can reduce the load of read. However increase w to n has the disadvantage that a single unavailable nodes will cause write failure.

💡 Define strict quorum reads and writes

Expand All @@ -139,15 +143,15 @@ In a large cluster with significantly more than n nodes, writes and reads still
> Leaderless architecture becomes popular after Amazon used it for its in-house *Dynamo* system (the paper is written in 2007). [Riak](https://github.com/basho/riak), [Cassandra](https://github.com/apache/cassandra) and [Voldemort](https://github.com/voldemort/voldemort) are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as *Dynamo-style*.
Quorums is the essential concept to guarantee the data consistency in Dynamo-style data storages. It is defined as this:
> If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes fro each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we're reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes
> If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we're reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes.
![ddia-quorum.excalidraw](https://s3.ap-southeast-1.amazonaws.com/littlecheesecake.me/blog-post/ddia-5/ddia-quorum.excalidraw.png)

We can smartly choose w and r in different use cases:
- to a write intensive use case, we can choose a smaller w (but a larger r) to reduce the write latencies. Vice versa for read intensive use case
- generally we can choose w > n/2 and r > n/2 so the system can tolerate up to n/2 nodes failures
- For a write-intensive use case, we can choose a smaller w (but a larger r) to reduce the write latencies. Vice versa for a read-intensive use case.
- Generally we can choose w > n/2 and r > n/2 so the system can tolerate up to n/2 nodes failures

However even with w + r > n, there are likely to be edge cases where stale values are returned. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency (BASE vs ACID).
However even with w + r > n, there are likely to be edge cases where stale values are returned. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency (BASE over ACID).

### Extended Topics
#### Version Vector
Expand All @@ -170,13 +174,14 @@ A simple algorithm to capture the happens-before relationship
Important aspect to consider when reasoning about the scalability of these version vector is the existence of three different orders of magnitude at play:
- a **small number** of replica nodes for each key
- a **large number** of server nodes
- a **huge number** of clients, keys and issued operations.
- a **huge number** of clients, keys and issued operations

Thus a scalable solution should avoid mechanisms that are linear with the highest magnitude and, if possible, even try to match the lowest scale

Here's a summary and comparison of different version vector approaches

##### Version vector with causal histories
causal histories ({a1, a2}) are maintained in this version vector solution, but this is a **not useful in practical systems** because they scale linearly with the number of updates.
Causal histories ({a1, a2}) are maintained in this version vector solution, but this is **not useful in practical systems** because they scale linearly with the number of updates.

![version-vector_1.excalidraw](https://s3.ap-southeast-1.amazonaws.com/littlecheesecake.me/blog-post/ddia-5/version-vector_1.excalidraw.png)

Expand All @@ -191,7 +196,7 @@ A concise version vector can be achieved by storing all the update-to-date serve
![version-vector-3.excalidraw](https://s3.ap-southeast-1.amazonaws.com/littlecheesecake.me/blog-post/ddia-5/version-vector-3.excalidraw.png)

##### Version vector with per-client entry
The version vectors with one entry per replica node are not enough to track causality among concurrent clients. One natural approach is to track causality by using version vectors with one entry per client. In the implementation, each client needs to maintain a counter increment it and provide it in each PUT, otherwise we will encountered the problem illustrated below. This approach is not very practical mainly because it is a big challenge to maintain the clients stability of the storage system - the application can run concurrent and dynamic threads, it is difficult to the identify and track them.
The version vectors with one entry per replica node are not enough to track causality among concurrent clients. One natural approach is to track causality by using version vectors with one entry per client. In the implementation, each client needs to maintain a counter increment and provide it in each PUT, otherwise we will encountered the problem illustrated below. This approach is not very practical mainly because it is a big challenge to maintain the clients stability of the storage system - the application can run concurrent and dynamic threads, it is difficult to identify and track them.

![version-vector-4.excalidraw](https://s3.ap-southeast-1.amazonaws.com/littlecheesecake.me/blog-post/ddia-5/version-vector-4.excalidraw.png)

Expand All @@ -215,7 +220,7 @@ The example shows how the version vectors are updated for each update:
![version-vector-5.excalidraw](https://s3.ap-southeast-1.amazonaws.com/littlecheesecake.me/blog-post/ddia-5/version-vector-5.excalidraw.png)


As we see, the popular databases like dynamoDB and Casandra are using simpler version of the version vectors in practice. This is very common. In reality, production-level engineering system comprises complexities for performance and engineering simplicity.
As we see, the popular databases like dynamoDB and Casandra are using simpler version of the version vectors in practice. This is very common engineering design pattern - In reality, production-level engineering system comprises complexities for performance and engineering simplicity. Dotted version vector implementation can be found in Riak.

### Further Reading
- [Detection of Mutual Inconsistency in Distributed Systems - 1983 classic paper of version vector](https://www.cs.purdue.edu/homes/bb/cs542-11Spr/Parker_TSE83.pdf)
Expand Down

0 comments on commit 1c97ce7

Please sign in to comment.