Skip to content

Commit

Permalink
Merge pull request #12 from tikv/unified-key-format
Browse files Browse the repository at this point in the history
Unified Key Format
  • Loading branch information
Hoverbear authored Nov 28, 2018
2 parents 32f1947 + a43492e commit d83e872
Showing 1 changed file with 173 additions and 0 deletions.
173 changes: 173 additions & 0 deletions text/2018-11-12-unified-key-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Summary

This RFC proposes a new unified and human readable format for keys. All TiDB software stack
components (e.g. TiDB, TiKV, pd-ctl, tikv-ctl, ...) *must* follow this new format when printing out
keys and accepting keys from user input like command line parameters.

# Motivation

Currently we have multiple human readable formats for keys:

- Hex format, like `74000000000000001C5F7200000000000000FA`

- Protobuffer format, like `t\000\000\000\000\000\000\000\034_r\000\000\000\000\000\000\000\372`

- Go format, like `t\x00\x00\x00\x00\x00\x00\x00\x1c_r\x00\x00\x00\x00\x00\x00\x00\xfa`

Which creates the following problems for us:

- Bad interoperability: These different formats are pain points when working with different tools
together. For example, pd-ctl produces keys in Go format while tikv-ctl accepts keys in
Protobuffer format.

- Hard to implement: Actually there are some attempts to support different kind of formats, like
https://github.com/tikv/tikv/pull/3182 and https://github.com/pingcap/pd/pull/474. However it
turns out that these implementations won't work in real world scenario. For example, it is
not sufficient to decode the Golang format by only dealing with hex escapes `\x..` [(1)] and
decode the Protobuffer format by only dealing with oct escapes `\...` [(2)].

- Hard to recognize different parts manually: For hex format, it is hard to extract different parts
from it manually.

- Hard to be used in shells: For Protobuffer format and Go format, because there are escape
characters U+005C (\\) in the output, we need to do an escape (convert `\` to `\\`) when passing
it as a command line parameter in shells. As a result, current documentations are suggesting
wrong usages [(3)] that *happens* to work under specially crafted code, or won't work [(4)]
actually.

- Contains ambiguous multi-bytes: For Go format, it directly outputs Unicode character if there is
a valid unicode sequence. For example, hex `00E6B58B01FF` will be printed as `\x00测\x01\xff` in
Go format. This may result in ambiguous bytes represents in different environments when such
characters are printed.

To overcome above issues, we need a new human readable format and use it among all TiDB software
stack components, which provides these feature:

- Easy to recognize different parts after encoding by human.

- Easy to implement encoding and decoding. Better to have a lot of existing libraries.

- No escape character U+005C (\\) after encoding so that it can be easily used in shells.

- No multi-byte characters after encoding to avoid encoding issues.

# Detailed Design

This RFC proposes to use hex format like `74000000000000001C5F7200000000000000FA` as unified a human
readable format for keys:

- Each byte must be encoded to a 2 digit hex in upper case.

- Encoded hex string must be concatenated together without any other characters.

- Decoder must support hex string in either lower-case or upper-case.

- Decoder must throw errors when meeting non-hex or odd number of characters.

The encoding process accepts a byte array (the exact type varies according to languages) and output
a string. It can be described by the JavaScript code below:

```js
function encode(buffer) {
return [...buffer]
.map(eachByte => eachByte.toString(16)) // convert to hex
.map(hex => hex.padStart(2, '0')) // pad with zero
.map(hex => hex.toUpperCase()) // convert to upper case
.join('');
}

// encode(Buffer.from('Hello World'))
// ==> '48656C6C6F20576F726C64'
// encode(Buffer.from([]))
// ==> ''
// encode(Buffer.from('测试', 'utf8'))
// ==> 'E6B58BE8AF95'
// encode(Buffer.from([0x62, 0x75, 0x66, 0x66, 0x65, 0x72]))
// ==> '627566666572'
```

The decoding process accepts a string and outputs a byte array. It can be described by the
JavaScript code below:

```js
function decode(hexString) {
// Check length
if (hexString.length % 2 !== 0) {
throw new Error('Invalid hex string');
}
// Check chars
if (!hexString.match(/^[a-fA-F0-9]*$/)) {
throw new Error('Invalid hex string');
}
const bytes = [];
for (let i = 0; i < hexString.length; i += 2) {
// Take two hex characters
let hexByte = hexString.slice(i, i + 2);
// Parse hex
let byte = parseInt(hexByte, 16);
bytes.push(byte);
}
return Buffer.from(bytes);
}

// decode('48656C6C6F20576F726C64').toString()
// ==> 'Hello World'
// decode('c0aa')
// ==> <Buffer c0 aa>
// decode('')
// ==> ''
// decode('c0ag')
// ==> Error: Invalid hex string
// decode('c0a')
// ==> Error: Invalid hex string
```

Note that the code above is for demonstration purpose, which is neither the simpliest nor a
reference implementation. For example, a better encoding implementation in JavaScript can be:

```js
function encode(buffer) {
return buffer.toString('hex').toUpperCase()
}

// encode(Buffer.from('Hello World'))
// ==> '48656C6C6F20576F726C64'
```

Although hex format has some defects as discussed in the motivation section above, it is still
preferred over other formats (see alternatives section) with great advantages in simplicity and
uniformity. In addition, it is very easy to be implemented in all languages and does not introduce
escape character U+005C (\\).

In hex format, we are not able to easily recognize special parts like `t`, `_r` and `_i`. This can
be solved by other utility functions that further operates on this key format, or just memorizing
the hex representation of these magic bytes.

# Drawbacks

- We need to memorize hex representation of magic bytes, or implement additional helper tools.

- It is not compatible with current formats and we need to write format converters.

# Alternatives

Other binary-to-text encodings like Base64, Percent-Encoding and punycode are investigated.

For Base64 and punycode, it is nearly impossible to read magic bytes manually because there is no
fixed markers in encoded representations, while for current solution it is still approachable by
memorizing hex encoded markers, i.e. `5F72` for `_r` and `5F69` for `_i`.

For Percent-Encoding, it is a nice format. But due to historical reasons its existing
implementations are varied and it is very hard to find a library that sticks to the RFC. In
addition, RFC allows non-identical encoded result so that it becomes impossible to just do simple
`grep` when searching for key existance in logs. This means that we must define and implement our
own Percent-Encoding derivative, which is more complicated than current solution.

# Unresolved questions

None.

[(1)]: https://golang.org/ref/spec#Rune_literals
[(2)]: https://github.com/pingcap/pd/pull/1298/files#diff-ff78a54cb96e131d51e4628c92f70184R246
[(3)]: https://github.com/pingcap/docs/blob/e81f3225803d37ed4b23f3257dfa48fda38a22f4/tools/tikv-control.md#view-mvcc-of-a-given-key
[(4)]: https://github.com/pingcap/docs/blob/578c4cbb88e17ad55d0b6a99a1158710425f72fb/tools/pd-control.md#region-key---formatrawpbprotoprotobuf-key

0 comments on commit d83e872

Please sign in to comment.