From 1e48c6385929e397e96b84e726ce2a332fb0913e Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Thu, 22 Nov 2018 11:36:57 +0800 Subject: [PATCH 1/6] 80 characters per line is preferred Signed-off-by: Neil Shen --- template.md | 10 +- text/2017-12-22-read-pool.md | 95 +++++++++-------- text/2018-08-29-unsafe-destroy-range.md | 130 ++++++++++++++++++------ 3 files changed, 155 insertions(+), 80 deletions(-) diff --git a/template.md b/template.md index d36a1e63..7c19ebbf 100644 --- a/template.md +++ b/template.md @@ -1,10 +1,13 @@ + + # Summary One para explanation of the proposal. # Motivation -Why are we doing this? What use cases does it support? What is the expected outcome? +Why are we doing this? What use cases does it support? What is the expected +outcome? # Detailed design @@ -21,9 +24,10 @@ Why should we not do this? # Alternatives - Why is this design the best in the space of possible designs? -- What other designs have been considered and what is the rationale for not choosing them? +- What other designs have been considered and what is the rationale for not + choosing them? - What is the impact of not doing this? # Unresolved questions -What parts of the design are still to be determined? \ No newline at end of file +What parts of the design are still to be determined? diff --git a/text/2017-12-22-read-pool.md b/text/2017-12-22-read-pool.md index 8fb3762b..7372444d 100644 --- a/text/2017-12-22-read-pool.md +++ b/text/2017-12-22-read-pool.md @@ -1,26 +1,28 @@ # Summary -This RFC proposes a new thread pool that supports async operations, which is called `ReadPool`. It -is designed to be specifically used for all sorts of read operations like Kv Get, Kv Scan and -Coprocessor Read to improve performance. +This RFC proposes a new thread pool that supports async operations, which is +called `ReadPool`. It is designed to be specifically used for all sorts of read +operations like Kv Get, Kv Scan and Coprocessor Read to improve performance. # Motivation -Currently, for all KV operations that stay inside `Storage`, they are executed on the same thread -pool. This leads to bad operation isolation, i.e. heavy writes will cause slow reads. In fact, these -read operations can be executed on other threads to be not blocked. For Coprocessor operations, we -have an end-point worker thread to handle async snapshots. This is a bottleneck when there are heavy -coprocessor requests. +Currently, for all KV operations that stay inside `Storage`, they are executed +on the same thread pool. This leads to bad operation isolation, i.e. heavy +writes will cause slow reads. In fact, these read operations can be executed on +other threads to be not blocked. For Coprocessor operations, we have an +end-point worker thread to handle async snapshots. This is a bottleneck when +there are heavy coprocessor requests. -By introducing a standalone thread pool that supports async operations (which is called ReadPool) in -this RFC, we can fix both problems. For KV operations, read and write operations will be executed -on different thread pools, so that they won't affect each other. For Coprocessor operations, since -ReadPool supports async operations, the end-point worker is not necessary anymore and the bottleneck -no longer exists. +By introducing a standalone thread pool that supports async operations (which +is called ReadPool) in this RFC, we can fix both problems. For KV operations, +read and write operations will be executed on different thread pools, so that +they won't affect each other. For Coprocessor operations, since ReadPool +supports async operations, the end-point worker is not necessary anymore and +the bottleneck no longer exists. -In this RFC, there will be a ReadPool for KV operations and another ReadPool for Coprocessor -operations, so that they won't block each other. In the future, it may be possible to merge -them together. +In this RFC, there will be a ReadPool for KV operations and another ReadPool +for Coprocessor operations, so that they won't block each other. In the future, +it may be possible to merge them together. # Detailed design @@ -32,12 +34,13 @@ The ReadPool provides these features: 3. Support max task threshold as a load control mechanism. -The ReadPool can be implemented via 2 levels. The lower level is a new facility called -`util::FuturePool`, which is a simple encapsulation over [CpuPool](https://docs.rs/futures-cpupool/) -that supports feature 2. The higher level is `server::ReadPool`, which assembles multiple -`FuturePool`s and supports features 1 and 3. In this way, `FuturePool` becomes a common facility -that can be reused in the future to run other kinds of async operations that need contexts, not just -limited to the scenario listed in this RFC. +The ReadPool can be implemented via 2 levels. The lower level is a new facility +called `util::FuturePool`, which is a simple encapsulation over +[CpuPool](https://docs.rs/futures-cpupool/) that supports feature 2. The higher +level is `server::ReadPool`, which assembles multiple `FuturePool`s and +supports features 1 and 3. In this way, `FuturePool` becomes a common facility +that can be reused in the future to run other kinds of async operations that +need contexts, not just limited to the scenario listed in this RFC. ## FuturePool @@ -49,8 +52,8 @@ pub trait Context: fmt::Debug + Send { } ``` -The `FuturePool` user can provide customized `impl Context` so that some code can be executed -periodically within the context of the thread pool's thread. +The `FuturePool` user can provide customized `impl Context` so that some code +can be executed periodically within the context of the thread pool's thread. The `FuturePool` is defined as follows: @@ -63,9 +66,10 @@ pub struct FuturePool { } ``` -The logic of whether to call `on_tick` is, when each task finishes running, we check how much -time has elapsed. If the elapsed time since the last tick is longer than our tick interval, we call -`on_tick`. In order to do so in a clean way, we can wrap this logic into a `ContextDelegator`: +The logic of whether to call `on_tick` is, when each task finishes running, we +check how much time has elapsed. If the elapsed time since the last tick is +longer than our tick interval, we call `on_tick`. In order to do so in a clean +way, we can wrap this logic into a `ContextDelegator`: ```rust struct ContextDelegator { @@ -85,14 +89,15 @@ impl ContextDelegator { } ``` -`FuturePool` users should pass a `Future` to the `FuturePool` to execute, just like `CpuPool`. -However, to obtain access to the `Context` inside `Future`, the provided `Future` should be wrapped -in a closure which provides access to the `Context` via the parameter. In the further, the parameter -may live multiple `Future`s, which may run on different threads and different `Context`s. Thus, the -parameter is a `Context` accessor instead of a specific `Context`. +`FuturePool` users should pass a `Future` to the `FuturePool` to execute, just +like `CpuPool`. However, to obtain access to the `Context` inside `Future`, the +provided `Future` should be wrapped in a closure which provides access to the +`Context` via the parameter. In the further, the parameter may live multiple +`Future`s, which may run on different threads and different `Context`s. Thus, +the parameter is a `Context` accessor instead of a specific `Context`. -As a result, this RFC further introduces the following struct, which provides an access to the -current `Context` based on running thread: +As a result, this RFC further introduces the following struct, which provides +an access to the current `Context` based on running thread: ```rust pub struct ContextDelegators { @@ -106,7 +111,8 @@ impl ContextDelegators { } ``` -The `FuturePool` provides `spawn` interface to execute a `Future` while providing +The `FuturePool` provides `spawn` interface to execute a `Future` while +providing `ContextDelegators`: ```rust @@ -138,8 +144,8 @@ pub struct ReadPool { } ``` -It also provides an interface similar to `FuturePool::spawn`, which specifies the priority and -returns an error when `FuturePool` is full: +It also provides an interface similar to `FuturePool::spawn`, which specifies +the priority and returns an error when `FuturePool` is full: ```rust pub enum Priority { @@ -177,17 +183,18 @@ impl ReadPool { # Drawbacks -The `on_tick` implementation in `FuturePool` is not perfect. It is driven by task completion, which -means it will not be executed when there is no task. This is acceptable since we are going to use -`on_tick` to report metrics. +The `on_tick` implementation in `FuturePool` is not perfect. It is driven by +task completion, which means it will not be executed when there is no task. +This is acceptable since we are going to use `on_tick` to report metrics. # Alternatives -We can implement our own future pool instead of utilizing `CpuPool`. In this way, we can have a true -`on_tick`. However, this is complex and is not a necessary feature we need. +We can implement our own future pool instead of utilizing `CpuPool`. In this +way, we can have a true `on_tick`. However, this is complex and is not a +necessary feature we need. -We can implement our own async model instead of utilizing `Future`. However, there is no benefits -compared to the current solution. +We can implement our own async model instead of utilizing `Future`. However, +there is no benefits compared to the current solution. # Unresolved questions diff --git a/text/2018-08-29-unsafe-destroy-range.md b/text/2018-08-29-unsafe-destroy-range.md index 1295ac2c..15b337b5 100644 --- a/text/2018-08-29-unsafe-destroy-range.md +++ b/text/2018-08-29-unsafe-destroy-range.md @@ -1,16 +1,31 @@ # Summary -Support RPC `UnsafeDestroyRange`. This call is on the whole TiKV rather than a certain Region. When it is invoked, TiKV will use `delete_files_in_range` to quickly free a large amount of space, and then scan and delete all remaining keys in the range. Raft layer will be bypassed, and the range can cover multiple Regions. The invoker should promise that after invoking `UnsafeDestroyRange`, the range will **never** be accessed again. That is to say, the range is permanently scrapped. +Support RPC `UnsafeDestroyRange`. This call is on the whole TiKV rather than a +certain Region. When it is invoked, TiKV will use `delete_files_in_range` to +quickly free a large amount of space, and then scan and delete all remaining +keys in the range. Raft layer will be bypassed, and the range can cover +multiple Regions. The invoker should promise that after invoking +`UnsafeDestroyRange`, the range will **never** be accessed again. That is to +say, the range is permanently scrapped. -This interface is only designed for TiDB. It's used to clean up data after dropping/truncating a huge table/index. +This interface is only designed for TiDB. It's used to clean up data after +dropping/truncating a huge table/index. # Motivation -Currently, when TiDB drops/truncates a table/index, after GC lifetime it will invoke `DeleteRange` on all affected Regions respectively. Though TiKV will call `delete_files_in_range` to quickly free up the space for each Region, the SST files that span over multiple Regions cannot be removed immediately. As a result, the disk space is not able to be recycled in time. Another problem is that the process to delete all Regions respectively is too slow. +Currently, when TiDB drops/truncates a table/index, after GC lifetime it will +invoke `DeleteRange` on all affected Regions respectively. Though TiKV will +call `delete_files_in_range` to quickly free up the space for each Region, the +SST files that span over multiple Regions cannot be removed immediately. As a +result, the disk space is not able to be recycled in time. Another problem is +that the process to delete all Regions respectively is too slow. -This issue has troubled some of our users. It should be fixed as soon as possible. +This issue has troubled some of our users. It should be fixed as soon as +possible. -In our tests, this can finish deleting data of a hundreds-of-GB table in seconds, while the old implementation may cost hours to finish. Also the disk space is released very fast. +In our tests, this can finish deleting data of a hundreds-of-GB table in +seconds, while the old implementation may cost hours to finish. Also the disk +space is released very fast. # Detailed design @@ -24,15 +39,28 @@ There are several steps to implement it. bytes end_key = 3; } ``` - The `context` in the request body does not indicate which Region to work. Fields about Region, Peer, etc. are ignored. - - We keep the `context` here. One of the reasons is to make it uniform with other requests. Another reason is that, there are other parameters (like `fill_cache`) in `Context`. Though we don't need to use anything in it currently, it's possible that there will be more parameters added to `Context`, which is also possibly needed by `UnsafeDestroyRange`. - -2. When TiKV receives `UnsafeDestroyRangeRequest`, it immediately runs `delete_files_in_range` on the whole range on RocksDB, bypassing the Raft layer. Then it will scan and delete all remaining keys in the range. - - * Due to that `UnsafeDestroyRange` only helps TiDB's GC, we put the logic of handling `UnsafeDestroyRange` request to `storage/gc_worker`. - * `GCWorker` needs a reference to raftstore's underlying RocksDB now. However, it's a component of `Storage` that knows nothing about the implementation of the `Engine`. Either, we cannot specialize it for `RaftKv` to get the RocksDB, since it's just a router. The simplest way is to pass the RocksDB's Arc pointer explicitly in `tikv-server.rs`. - * We regard `UnsafeDestroyRange` as a special type of GCTask, which will be executed in TiKV's GCWorker's thread: + The `context` in the request body does not indicate which Region to work. +Fields about Region, Peer, etc. are ignored. + + We keep the `context` here. One of the reasons is to make it uniform with +other requests. Another reason is that, there are other parameters (like +`fill_cache`) in `Context`. Though we don't need to use anything in it +currently, it's possible that there will be more parameters added to `Context`, +which is also possibly needed by `UnsafeDestroyRange`. + +2. When TiKV receives `UnsafeDestroyRangeRequest`, it immediately runs +`delete_files_in_range` on the whole range on RocksDB, bypassing the Raft +layer. Then it will scan and delete all remaining keys in the range. + + * Due to that `UnsafeDestroyRange` only helps TiDB's GC, we put the logic +of handling `UnsafeDestroyRange` request to `storage/gc_worker`. + * `GCWorker` needs a reference to raftstore's underlying RocksDB now. +However, it's a component of `Storage` that knows nothing about the +implementation of the `Engine`. Either, we cannot specialize it for `RaftKv` to +get the RocksDB, since it's just a router. The simplest way is to pass the +RocksDB's Arc pointer explicitly in `tikv-server.rs`. + * We regard `UnsafeDestroyRange` as a special type of GCTask, which will be +executed in TiKV's GCWorker's thread: ```rust enum GCTask { GC { @@ -48,33 +76,69 @@ There are several steps to implement it. }, } ``` - * The logic of `UnsafeDestroyRange` is quite similar to `DeleteRange` in apply worker: First, call `delete_files_in_range` to quickly drop most of the data and also quickly release the space; Then, scan and delete all remaining keys in the range. - - We should choose a proper batch size for the second step (scan and delete). Currently raftstore uses 4MB as the max batch size when doing scan-deleting. In our tests, we found that if we use 4MB, as the write batch size in the second step (scan and delete), running it will reduce OLTP QPS by 30% ~ 60%. However, if we set batch size smaller such as 32KB, then QLTP QPS is not affected any more. - - The story of TiKV is ended here. But to show why this thing is useful, let's continue to see what we will do on TiDB: - -3. In TiDB, when you execute `TRUNCATE/DROP TABLE/INDEX`, the meta data of the related table will be modified at first, and then the range covered by the truncated/dropped table/index will be recorded in the system table `gc_delete_range` with timestamp when the `TRUNCATE/DROP` query is executed. TiDB GC worker will check this table every time it does GC. In the current implementation, if there are records in `gc_delete_range` with its timestamp smaller than the safe point, TiDB will delete ranges indicated by these records by invoking `DeleteRange`, and also move these records from `gc_delete_range` to `gc_delete_range_done`. So in the new implementation: - - * When doing `DeleteRanges` (the second step of GC), invoke the new `UnsafeDestroyRange` on each TiKV, instead of running the old `DeleteRange` on each Region. Also we need to add an interface to PD so that TiDB can get a list of all TiKVs. - * Check the record again from `gc_delete_range_done` after 24 hours (24 hours since it was moved to table `gc_delete_range_done`), and invoke `UnsafeDestroyRange` again. Then this record can be removed from table `gc_delete_range_done`. - - And why do we need to check it one more time after 24 hours? After deleting the range the first time, if coincidentally PD is trying to move a Region or something, some data may still appear in the range. So check it one more time after a proper time to greatly reduce the possibility. + * The logic of `UnsafeDestroyRange` is quite similar to `DeleteRange` in +apply worker: First, call `delete_files_in_range` to quickly drop most of the +data and also quickly release the space; Then, scan and delete all remaining +keys in the range. + + We should choose a proper batch size for the second step (scan and +delete). Currently raftstore uses 4MB as the max batch size when doing +scan-deleting. In our tests, we found that if we use 4MB, as the write batch +size in the second step (scan and delete), running it will reduce OLTP QPS by +30% ~ 60%. However, if we set batch size smaller such as 32KB, then QLTP QPS is +not affected any more. + + The story of TiKV is ended here. But to show why this thing is useful, +let's continue to see what we will do on TiDB: + +3. In TiDB, when you execute `TRUNCATE/DROP TABLE/INDEX`, the meta data of the +related table will be modified at first, and then the range covered by the +truncated/dropped table/index will be recorded in the system table +`gc_delete_range` with timestamp when the `TRUNCATE/DROP` query is executed. +TiDB GC worker will check this table every time it does GC. In the current +implementation, if there are records in `gc_delete_range` with its timestamp +smaller than the safe point, TiDB will delete ranges indicated by these records +by invoking `DeleteRange`, and also move these records from `gc_delete_range` +to `gc_delete_range_done`. So in the new implementation: + + * When doing `DeleteRanges` (the second step of GC), invoke the new +`UnsafeDestroyRange` on each TiKV, instead of running the old `DeleteRange` on +each Region. Also we need to add an interface to PD so that TiDB can get a list +of all TiKVs. + * Check the record again from `gc_delete_range_done` after 24 hours (24 +hours since it was moved to table `gc_delete_range_done`), and invoke +`UnsafeDestroyRange` again. Then this record can be removed from table +`gc_delete_range_done`. + + And why do we need to check it one more time after 24 hours? After deleting +the range the first time, if coincidentally PD is trying to move a Region or +something, some data may still appear in the range. So check it one more time +after a proper time to greatly reduce the possibility. # Drawbacks * It breaks the consistency among Raft replicas. - * But fortunately, in TiDB, when `DeleteRanges` happens, the deleted range will never be accessed again. -* To bypass the Raft layer, the code might violate the layered architecture of TiKV. -* Sending requests to each TiKV server never occurs in TiDB before. This pattern is introduced to TiDB's code the first time. -* After deleting all files in a very-large range, it may trigger RocksDB's compaction (even it is not really needed) and even cause stalling. + * But fortunately, in TiDB, when `DeleteRanges` happens, the deleted range +will never be accessed again. +* To bypass the Raft layer, the code might violate the layered architecture of +TiKV. +* Sending requests to each TiKV server never occurs in TiDB before. This +pattern is introduced to TiDB's code the first time. +* After deleting all files in a very-large range, it may trigger RocksDB's +compaction (even it is not really needed) and even cause stalling. # Alternatives -We've considered several different ideas. But in order to solve the problem quickly, we prefer an easier approach. +We've considered several different ideas. But in order to solve the problem +quickly, we prefer an easier approach. -For example, we tried to keep the consistency of Raft replicas. But in a scenario where TiKV is only used by TiDB, deleted range will never be accessed again. So in fact, if we add an interface that is specific for TiDB, we just needn't keep this kind of consistency. It's safe enough. +For example, we tried to keep the consistency of Raft replicas. But in a +scenario where TiKV is only used by TiDB, deleted range will never be accessed +again. So in fact, if we add an interface that is specific for TiDB, we just +needn't keep this kind of consistency. It's safe enough. # Unresolved questions -* A huge amount of empty regions will be left after invoking `UnsafeDestroyRange`. It may take a long time to merge them all. If possible, they should be merged faster. +* A huge amount of empty regions will be left after invoking +`UnsafeDestroyRange`. It may take a long time to merge them all. If possible, +they should be merged faster. From d9bbbbd19a1481c047d368a713f1547b2a25eada Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Mon, 26 Nov 2018 10:38:12 +0800 Subject: [PATCH 2/6] 'Fold README.md Signed-off-by: Neil Shen --- README.md | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index abf65668..29959bd6 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,14 @@ # TiKV RFCs -Many changes, including bug fixes and documentation improvements can be implemented and reviewed via the normal GitHub pull request workflow. +Many changes, including bug fixes and documentation improvements can be +implemented and reviewed via the normal GitHub pull request workflow. -Some changes though are "substantial", and we ask that these be put through a bit of a design process and produce a consensus among the TiKV community. +Some changes though are "substantial", and we ask that these be put through a +bit of a design process and produce a consensus among the TiKV community. -The "RFC" (request for comments) process is intended to provide a consistent and controlled path for new features to enter the project, so that all stakeholders can be confident about the direction the project is evolving in. +The "RFC" (request for comments) process is intended to provide a consistent +and controlled path for new features to enter the project, so that all +stakeholders can be confident about the direction the project is evolving in. ### How to submit an RFC @@ -16,13 +20,16 @@ The "RFC" (request for comments) process is intended to provide a consistent and 1. An RFC is submitted as a PR. 2. Discussion takes place, and the text is revised in response. -3. The PR is merged or closed when at least two project maintainers reach consensus. +3. The PR is merged or closed when at least two project maintainers reach + consensus. ### License -This content is licensed under Apachie License, Version 2.0, ([LICENSE](LICENSE) or - http://www.apache.org/licenses/LICENSE-2.0) +This content is licensed under Apachie License, Version 2.0, +([LICENSE](LICENSE) or http://www.apache.org/licenses/LICENSE-2.0) ### Contributions -Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions. \ No newline at end of file +Unless you explicitly state otherwise, any contribution intentionally submitted +for inclusion in the work by you, as defined in the Apache-2.0 license, shall +be dual licensed as above, without any additional terms or conditions. From 8a08e9754d5ebd8a1ec82ae3e267c10a112a3c81 Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Mon, 26 Nov 2018 11:50:20 +0800 Subject: [PATCH 3/6] Check format in CI Signed-off-by: Neil Shen --- .travis.yml | 6 ++++++ 1 file changed, 6 insertions(+) create mode 100644 .travis.yml diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 00000000..f2e9a75e --- /dev/null +++ b/.travis.yml @@ -0,0 +1,6 @@ +language: minimal +script: + - for f in text/*; do + fold -s $f | diff - $f || + ( echo "Please wrap rfcs at 80 characters" && exit 1 ); + done From e432214e4d680e8f659964fa772fdad1cea63994 Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Mon, 26 Nov 2018 17:21:45 +0800 Subject: [PATCH 4/6] Ignore urls Signed-off-by: Neil Shen --- .travis.yml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/.travis.yml b/.travis.yml index f2e9a75e..dccf59c1 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,6 +1,10 @@ language: minimal + +env: + - IGNORE_PAT="^\[.*\]:" + script: - for f in text/*; do - fold -s $f | diff - $f || - ( echo "Please wrap rfcs at 80 characters" && exit 1 ); + grep -vE $IGNORE_PAT $f | fold -s | diff -I $IGNORE_PAT - $f || + ( echo "Please wrap RFCs at 80 characters" && exit 1 ); done From e5012d88876825868d448e5b289a4dd7e7652349 Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Wed, 28 Nov 2018 14:54:17 +0800 Subject: [PATCH 5/6] Address comment Signed-off-by: Neil Shen --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 29959bd6..514cf8c5 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,10 @@ stakeholders can be confident about the direction the project is evolving in. 3. The PR is merged or closed when at least two project maintainers reach consensus. +### Style of an RFC + +We should wrap line at 80 characters. + ### License This content is licensed under Apachie License, Version 2.0, From 367e12ac816c1f3d61122e28a73781ca3c6c3b81 Mon Sep 17 00:00:00 2001 From: Neil Shen Date: Thu, 29 Nov 2018 13:59:34 +0800 Subject: [PATCH 6/6] Format unified-key-format Signed-off-by: Neil Shen --- text/2018-11-12-unified-key-format.md | 131 +++++++++++++++----------- 1 file changed, 74 insertions(+), 57 deletions(-) diff --git a/text/2018-11-12-unified-key-format.md b/text/2018-11-12-unified-key-format.md index 380ee71e..dc88d3b0 100644 --- a/text/2018-11-12-unified-key-format.md +++ b/text/2018-11-12-unified-key-format.md @@ -1,8 +1,9 @@ # Summary -This RFC proposes a new unified and human readable format for keys. All TiDB software stack -components (e.g. TiDB, TiKV, pd-ctl, tikv-ctl, ...) *must* follow this new format when printing out -keys and accepting keys from user input like command line parameters. +This RFC proposes a new unified and human readable format for keys. All TiDB +software stack components (e.g. TiDB, TiKV, pd-ctl, tikv-ctl, ...) *must* follow +this new format when printing out keys and accepting keys from user input like +command line parameters. # Motivation @@ -10,51 +11,59 @@ Currently we have multiple human readable formats for keys: - Hex format, like `74000000000000001C5F7200000000000000FA` -- Protobuffer format, like `t\000\000\000\000\000\000\000\034_r\000\000\000\000\000\000\000\372` +- Protobuffer format, like + `t\000\000\000\000\000\000\000\034_r\000\000\000\000\000\000\000\372` -- Go format, like `t\x00\x00\x00\x00\x00\x00\x00\x1c_r\x00\x00\x00\x00\x00\x00\x00\xfa` +- Go format, like + `t\x00\x00\x00\x00\x00\x00\x00\x1c_r\x00\x00\x00\x00\x00\x00\x00\xfa` Which creates the following problems for us: -- Bad interoperability: These different formats are pain points when working with different tools - together. For example, pd-ctl produces keys in Go format while tikv-ctl accepts keys in - Protobuffer format. - -- Hard to implement: Actually there are some attempts to support different kind of formats, like - https://github.com/tikv/tikv/pull/3182 and https://github.com/pingcap/pd/pull/474. However it - turns out that these implementations won't work in real world scenario. For example, it is - not sufficient to decode the Golang format by only dealing with hex escapes `\x..` [(1)] and - decode the Protobuffer format by only dealing with oct escapes `\...` [(2)]. - -- Hard to recognize different parts manually: For hex format, it is hard to extract different parts - from it manually. - -- Hard to be used in shells: For Protobuffer format and Go format, because there are escape - characters U+005C (\\) in the output, we need to do an escape (convert `\` to `\\`) when passing - it as a command line parameter in shells. As a result, current documentations are suggesting - wrong usages [(3)] that *happens* to work under specially crafted code, or won't work [(4)] - actually. - -- Contains ambiguous multi-bytes: For Go format, it directly outputs Unicode character if there is - a valid unicode sequence. For example, hex `00E6B58B01FF` will be printed as `\x00测\x01\xff` in - Go format. This may result in ambiguous bytes represents in different environments when such +- Bad interoperability: These different formats are pain points when working + with different tools together. For example, pd-ctl produces keys in Go format + while tikv-ctl accepts keys in Protobuffer format. + +- Hard to implement: Actually there are some attempts to support different kind + of formats, like https://github.com/tikv/tikv/pull/3182 and + https://github.com/pingcap/pd/pull/474. However it turns out that these + implementations won't work in real world scenario. For example, it is not + sufficient to decode the Golang format by only dealing with hex escapes + `\x..` [(1)] and decode the Protobuffer format by only dealing with oct + escapes `\...` [(2)]. + +- Hard to recognize different parts manually: For hex format, it is hard to + extract different parts from it manually. + +- Hard to be used in shells: For Protobuffer format and Go format, because there + are escape characters U+005C (\\) in the output, we need to do an escape + (convert `\` to `\\`) when passing it as a command line parameter in shells. + As a result, current documentations are suggesting wrong usages [(3)] that + *happens* to work under specially crafted code, or won't work [(4)] actually. + +- Contains ambiguous multi-bytes: For Go format, it directly outputs Unicode + character if there is a valid unicode sequence. For example, hex + `00E6B58B01FF` will be printed as `\x00测\x01\xff` in Go format. This may + result in ambiguous bytes represents in different environments when such characters are printed. -To overcome above issues, we need a new human readable format and use it among all TiDB software -stack components, which provides these feature: +To overcome above issues, we need a new human readable format and use it among +all TiDB software stack components, which provides these feature: - Easy to recognize different parts after encoding by human. -- Easy to implement encoding and decoding. Better to have a lot of existing libraries. +- Easy to implement encoding and decoding. Better to have a lot of existing + libraries. -- No escape character U+005C (\\) after encoding so that it can be easily used in shells. +- No escape character U+005C (\\) after encoding so that it can be easily used + in shells. - No multi-byte characters after encoding to avoid encoding issues. # Detailed Design -This RFC proposes to use hex format like `74000000000000001C5F7200000000000000FA` as unified a human -readable format for keys: +This RFC proposes to use hex format like +`74000000000000001C5F7200000000000000FA` as unified a human readable format for +keys: - Each byte must be encoded to a 2 digit hex in upper case. @@ -64,8 +73,9 @@ readable format for keys: - Decoder must throw errors when meeting non-hex or odd number of characters. -The encoding process accepts a byte array (the exact type varies according to languages) and output -a string. It can be described by the JavaScript code below: +The encoding process accepts a byte array (the exact type varies according to +languages) and output a string. It can be described by the JavaScript code +below: ```js function encode(buffer) { @@ -86,8 +96,8 @@ function encode(buffer) { // ==> '627566666572' ``` -The decoding process accepts a string and outputs a byte array. It can be described by the -JavaScript code below: +The decoding process accepts a string and outputs a byte array. It can be +described by the JavaScript code below: ```js function decode(hexString) { @@ -122,8 +132,9 @@ function decode(hexString) { // ==> Error: Invalid hex string ``` -Note that the code above is for demonstration purpose, which is neither the simpliest nor a -reference implementation. For example, a better encoding implementation in JavaScript can be: +Note that the code above is for demonstration purpose, which is neither the +simpliest nor a reference implementation. For example, a better encoding +implementation in JavaScript can be: ```js function encode(buffer) { @@ -134,34 +145,40 @@ function encode(buffer) { // ==> '48656C6C6F20576F726C64' ``` -Although hex format has some defects as discussed in the motivation section above, it is still -preferred over other formats (see alternatives section) with great advantages in simplicity and -uniformity. In addition, it is very easy to be implemented in all languages and does not introduce -escape character U+005C (\\). +Although hex format has some defects as discussed in the motivation section +above, it is still preferred over other formats (see alternatives section) with +great advantages in simplicity and uniformity. In addition, it is very easy to +be implemented in all languages and does not introduce escape character +U+005C (\\). -In hex format, we are not able to easily recognize special parts like `t`, `_r` and `_i`. This can -be solved by other utility functions that further operates on this key format, or just memorizing -the hex representation of these magic bytes. +In hex format, we are not able to easily recognize special parts like `t`, `_r` +and `_i`. This can be solved by other utility functions that further operates on +this key format, or just memorizing the hex representation of these magic bytes. # Drawbacks -- We need to memorize hex representation of magic bytes, or implement additional helper tools. +- We need to memorize hex representation of magic bytes, or implement additional +helper tools. -- It is not compatible with current formats and we need to write format converters. +- It is not compatible with current formats and we need to write format + converters. # Alternatives -Other binary-to-text encodings like Base64, Percent-Encoding and punycode are investigated. +Other binary-to-text encodings like Base64, Percent-Encoding and punycode are +investigated. -For Base64 and punycode, it is nearly impossible to read magic bytes manually because there is no -fixed markers in encoded representations, while for current solution it is still approachable by -memorizing hex encoded markers, i.e. `5F72` for `_r` and `5F69` for `_i`. +For Base64 and punycode, it is nearly impossible to read magic bytes manually +because there is no fixed markers in encoded representations, while for current +solution it is still approachable by memorizing hex encoded markers, i.e. `5F72` +for `_r` and `5F69` for `_i`. -For Percent-Encoding, it is a nice format. But due to historical reasons its existing -implementations are varied and it is very hard to find a library that sticks to the RFC. In -addition, RFC allows non-identical encoded result so that it becomes impossible to just do simple -`grep` when searching for key existance in logs. This means that we must define and implement our -own Percent-Encoding derivative, which is more complicated than current solution. +For Percent-Encoding, it is a nice format. But due to historical reasons its +existing implementations are varied and it is very hard to find a library that +sticks to the RFC. In addition, RFC allows non-identical encoded result so that +it becomes impossible to just do simple `grep` when searching for key existance +in logs. This means that we must define and implement our own Percent-Encoding +derivative, which is more complicated than current solution. # Unresolved questions