Skip to content

Commit e9d71e3

Browse files
zhouyueyue1115yikekeTomShawnSunRunAwayti-srebot
authored
commit-message: add file: tidb-storage.md (pingcap#3071)
* commit-message: add file: tidb-storage.md Signed-off-by: zhouyueyue1115 <[email protected]> * refine format and wording * Update TOC.md * Apply suggestions from code review LGTM Co-authored-by: TomShawn <[email protected]> * Apply suggestions from code review * refine format * Update tidb-storage.md * Update TOC.md * Apply suggestions from code review Co-authored-by: Feng Liyuan <[email protected]> Co-authored-by: Keke Yi <[email protected]> Co-authored-by: TomShawn <[email protected]> Co-authored-by: Feng Liyuan <[email protected]> Co-authored-by: ti-srebot <[email protected]>
1 parent 7f81045 commit e9d71e3

6 files changed

+108
-0
lines changed

TOC.md

+1
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,7 @@
193193
+ Reference
194194
+ Cluster Architecture
195195
+ [Overview](/tidb-architecture.md)
196+
+ [Storage](/tidb-storage.md)
196197
+ Key Monitoring Metrics
197198
+ [Overview](/grafana-overview-dashboard.md)
198199
+ [TiDB](/grafana-tidb-dashboard.md)

media/tidb-storage-1.png

14.7 KB
Loading

media/tidb-storage-2.png

10.4 KB
Loading

media/tidb-storage-3.png

344 KB
Loading

media/tidb-storage-architecture.png

171 KB
Loading

tidb-storage.md

+107
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: Storage of TiDB database
3+
summary: Understand the storage layer of a TiDB database
4+
---
5+
6+
# Storage of TiDB database
7+
8+
This document introduces some design ideas and key concepts of TiKV.
9+
10+
![storage-architecture](/media/tidb-storage-architecture.png)
11+
12+
## Key-Value pairs
13+
14+
The first thing to decide for a data storage system is the data storage model, that is, in what form the data is saved. TiKV's choice is the Key-Value model and provides an ordered traversal method. There are two key points for TiKV data storage model:
15+
16+
+ This is a huge Map (similar to `std::Map` in C++) , which stores Key-Value pairs.
17+
+ The Key-Value pair in the Map is ordered according to Keys' binary order, which means you can Seek the position of a particular Key and then call the Next method to get the Key-Value pairs larger than this Key in incremental order.
18+
19+
Note that the KV storage model for TiKV described in this document has nothing to do with tables of SQL. This document does not discuss any concepts related to SQL and only focuses on how to implement a high-performance, high-reliability, distributed Key-Value storage such as TiKV.
20+
21+
## Local storage (RocksDB)
22+
23+
For any persistent storage engine, data is eventually saved on disk, and TiKV is no exception. TiKV does not write data directly on the disk, but stores data in RocksDB, which is responsible for the data storage. The reason is that it costs a lot to develop a standalone storage engine, especially a high-performance standalone engine that requires careful optimization.
24+
25+
RocksDB is an excellent standalone storage engine open-sourced by Facebook. This engine can meet various requirements of TiKV for a single engine. Here, you can simply consider RocksDB as a single persistent Key-Value Map.
26+
27+
## Raft protocol
28+
29+
What's more, the implementation of TiKV faces a more difficult thing: to secure data safety in case a single machine fails.
30+
31+
A simple way is to replicate data to multiple machines, so that even if one machine fails, the replicas on other machines are still available. In other words, you need a data replication scheme that is reliable, efficient, and able to handle the situation of a failed replica. All of these are made possible by the Raft algorithm.
32+
33+
Raft is a consensus algorithm. This document only briefly introduces Raft. For more details, you can see [In Search of an Understandable Consensus Algorithm](https://raft.github.io/raft.pdf). The Raft has several important features:
34+
35+
- Leader election
36+
- Membership changes (such as adding replicas, deleting replicas, transferring leaders, and so on)
37+
- Log replication
38+
39+
TiKV use Raft to perform data replication. Each data change will be recorded as a Raft log. Through Raft log replication, data is safely and reliably replicated to multiple nodes of the Raft group. However, according to Raft protocol, successful writes only need that data is replicated to the majority of nodes.
40+
41+
![Raft in TiDB](/media/tidb-storage-1.png)
42+
43+
In summary, TiKV can quickly store data on disk via the standalone machine RocksDB, and replicate data to multiple machines via Raft in case of machine failure. Data is written through the interface of Raft instead of to RocksDB. With the implementation of Raft, TiKV becomes a distributed Key-Value storage. Even with a few machine failures, TiKV can automatically complete replicas by virtue of the native Raft protocol, which does not impact the application.
44+
45+
## Region
46+
47+
To make it easy to understand, let's assume that all data only has one replica. As mentioned earlier, TiKV can be regarded as a large, orderly KV Map, so data is distributed across multiple machines in order to achieve horizontal scalability. For a KV system, there are two typical solutions to distributing data across multiple machines:
48+
49+
* Hash: Create Hash by Key and select the corresponding storage node according to the Hash value.
50+
* Range: Divide ranges by Key, where a segment of serial Key is stored on a node.
51+
52+
TiKV chooses the second solution that divides the whole Key-Value space into a series of consecutive Key segments. Each segment is called a Region. There is a size limit for each Region to store data (the default value is 96 MB and the size can be configured). Each Region can be described by `[StartKey, EndKey)`, a left-closed and right-open interval.
53+
54+
![Region in TiDB](/media/tidb-storage-2.png)
55+
56+
Note that the Region here has nothing to do with the table in SQL. In this document, forget about SQL and focus on KV for now. After dividing data into Regions, TiKV will perform two important tasks:
57+
58+
* Distributing data to all nodes in the cluster and use Region as the basic unit. Try its best to ensure that the number of Regions on each node is roughly similar.
59+
* Performing Raft replication and membership management in Region.
60+
61+
These two tasks are very important and will be introduced one by one.
62+
63+
* First, data is divided into many Regions according to Key, and the data for each Region is stored on only one node (ignoring multiple replicas). The TiDB system has a PD component that is responsible for spreading Regions as evenly as possible across all nodes in the cluster. In this way, on one hand, the storage capacity is scaled horizontally (Regions on the other nodes are automatically scheduled to the newly added node); on the other hand, load balancing is achieved (the situation where one node has a lot of data while the others have little will not occur).
64+
65+
At the same time, in order to ensure that the upper client can access the needed data, there is a component (PD) in the system to record the distribution of Regions on the node, that is, the exact Region of a Key and the node of that Region placed through any Key.
66+
67+
* For the second task, TiKV replicates data in Regions, which means that data in one Region will have multiple replicas with the name “Replica”. Multiple Replicas of a Region are stored on different nodes to form a Raft Group, which is kept consistent through the Raft algorithm.
68+
69+
One of the Replicas serves as the Leader of the Group and other as the Follower. By default, all reads and writes are processed through the Leader, where reads are done and write are replicated to followers. The following diagram shows the whole picture about Region and Raft group.
70+
71+
![TiDB Storage](/media/tidb-storage-3.png)
72+
73+
As we distribute and replicate data in Regions, we have a distributed Key-Value system that, to some extent, has the capability of disaster recovery. You no longer need to worry about the capacity, or disk failure and data loss.
74+
75+
## MVCC
76+
77+
Many databases implement multi-version concurrency control (MVCC), and TiKV is no exception. Imagine the situation where two clients modify the value of a Key at the same time. Without MVCC, the data needs to be locked. In a distributed scenario, it might cause performance and deadlock problems. TiKV's MVCC implementation is achieved by appending a version number to Key. In short, without MVCC, TiKV's data layout can be seen as:
78+
79+
```
80+
Key1 -> Value
81+
Key2 -> Value
82+
……
83+
KeyN -> Value
84+
```
85+
86+
With MVCC, the key array of TiKV is like this:
87+
88+
```
89+
Key1_Version3 -> Value
90+
Key1_Version2 -> Value
91+
Key1_Version1 -> Value
92+
……
93+
Key2_Version4 -> Value
94+
Key2_Version3 -> Value
95+
Key2_Version2 -> Value
96+
Key2_Version1 -> Value
97+
……
98+
KeyN_Version2 -> Value
99+
KeyN_Version1 -> Value
100+
……
101+
```
102+
103+
Note that for multiple versions of the same Key, versions with larger numbers are placed first (see the [Key-Value](#key-value-pairs) section where Keys are arranged in order), so that when you obtain Value through Key + Version, the Key of MVCC can be constructed with Key and Version, which is `Key_Version`. Then you can directly locate the first position greater than or equal to this `Key_Version` through RocksDB's `SeekPrefix(Key_Version)` API.
104+
105+
## Distributed ACID transaction
106+
107+
Transaction of TiKV adopts the model used by Google in BigTable: [Percolator](https://research.google.com/pubs/pub36726.html). TiKV's implementation is inspired by this paper, with a lot of optimizations. See [transaction overview](/transaction-overview.md) for details.

0 commit comments

Comments
 (0)