Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce latest cf #95

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions text/0095-add-latest-cf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Introduce latest cf

- RFC PR: https://github.com/tikv/rfcs/pull/95
- Tracking Issue: https://github.com/tikv/repo/issues/0000

## Summary

Add a cf (column family in RocksDB) named "latest" to store the latest version of keys in MVCC.

## Motivation

As we all know, currently TiKV stores all its data in RocksDB. It creates a cf (column family) named "write" to store all the available versions. The key format looks like following:

```
| key | version |
```

. Version is a 64 bits number and encoded in desc order (larger goes first).

When reading a key with version v0, TiKV is expected to return the largest version v1 that v1 <= v0. Because TiKV has no idea what version is available, so it has to create an iterator and use version v0 to encode a seek key. If any key is hit, then the key should be the requested key.

The procedure is straightforward but expensive. Creating an iterator in RocksDB is not free. Even for point get, an iterator is still necessary. And seek operation is also expensive, almost as expendsive as creating iterator. To avoid seeking too many times, we introduce `near_seek` to TiKV in the early days, which tries `next` several times before fallback to `seek`.

As explained, the reason why seek is necessary is because TiKV has no idea what versions are available. Otherwise it can just use `get` to query the specific key, which is a lot cheaper.

## Detailed design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about lightning, cdc, br, tiflash's compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not part of TiKV. I would like to talk about them in another doc.


TiKV doesn't need to know all existing versions of keys. In fact, most of the time, v0 is larger than any existing versions of keys in TiKV if there is more read than write. So it should be enough to just let TiKV knows the latest version of all keys.

The RFC propose to add a new cf named "latest". When a key is inserted using transaction API, it should update write cf (and default cf) as current does. In addition, it also insert a key to latest cf with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this double write may also double the total disk size in some scenarios.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for those new table(new range) in the existing cluster and the new created cluster we can use the new mechanism(only write latest cf, move the old version to write cf only when the older version existed) is a better choice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work, though it loses the ability to upgrade existing table to new format online.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can introduce 3 formats. One is the original one, second is double write, third is move on update. They can be upgraded online one by one. For new tables, third format is used by default.

- key set the original key without any encoding or version
- value set the same value in write cf but include the corresponding version.

For example, insert k1 with version v0 and value dummy will insert two keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will double the write amplification, not only by size but also by key value pairs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's addressed in the drawback section.

- to write cf, k1|v0 -> (dummy and other meta)
- to latest cf, k1 -> (dummy, v0 and other meta)

So all keys in latest cf represent the latest version of all keys.

When all the versions of a key are gced, then it should also delete the key in latest cf only when it matches the last gc key version.

When a key is queried, it should query latest cf using `get` first. If nothing is found or the version is larger than requested, it should query the write cf as fallback. In most case, only one `get` is performed.

When a range scan is triggered, it should scan the latest cf directly. If a larger version is met, it should fallback to seek write cf for that specific key instead. Because only the latest versions are stored in latest cf, so the keys needs to be scanned will be way fewer than the write cf. And in most cases, only one `seek` is performed, and all other operations are `next`.

All of the fallback queries should be performed lazily.

The improvment should be very significant when update keys frequently.

### Compatibility

Because all existing cfs are updated just as before, so there are no major compatibility issues.

But using latest cf should be triggered explicitly by client. Client should ensure only when it updates all keys with latest cf will it ask TiKV to query using latest cf.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it won't as the client already needs to manage different ranges, just like table in TiDB or key space.


Take TiDB as an example, it can add a new storage format at table level. Perhaps even add a new DDL job for table to changing the storage format. In the new format, latest cf is always updated. And only trigger TiKV to use latest cf when the target table is fully upgraded to the new format.

As a new cf is added, it needs to be also included in the raft snapshot between replicas.

### Why use a new cf?

If the latest keys are written to write cf instead, then it will break compatibility. It also makes range scan less efficient as more version need to be scanned and skipped.

## Drawbacks

It introduces a write amplification appearantly. That is also the reason why the access pattern needs to be controlled by client. Client should enable latest cf only when it knows a range of keys are updated very often and can be benificial from the change.

On the other hand, the additional write is just a key in a different cf and a value that is probably not larger than 255 bytes, the overhead may not be very signifiant. More experiments are needed.

## Alternatives

unistore separates the latest version and other versions by adjust file format. So when flushing or compacting, it will make latest versions key be the first part, and the rest in the second part. This approach doesn't have write overhead, but is not backward compatible in TiKV.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about create a history cf to store as many history version as it can, the write cf store the latest several versions, and background thread like GC thread move existed history versions to history cf.
For point get, using RocksDB's user timestamp feature to get the latest version can see, this also can prevent creating iterator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no compatibility issue for this solution.

Copy link
Member Author

@BusyJay BusyJay Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compaction filter is not reliable, we have seen issue caused by not having compaction in time like tikv/tikv#12729. This proposal doesn't depend on the user timestamp, which is not used in production yet. And can be landed table by table instead of the whole cluster, have less impact on existing code. And this proposal should have better scan performance than relying on compaction filter.

Copy link
Member Author

@BusyJay BusyJay Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as an alternative with more argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using history cf doesn't need the compaction filter, gc just check the history cf sst file's max ts to determine wether to drop the whole sst file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compaction filter is not used for dropping history sst, but for moving data from write cf to history cf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compaction filter is not necessary, I think version statistics-based GC is much better.


Another proposal has also been discussed in the past that instead of adding latest cf, adding a history cf to store as many as versions. All keys are written to write cf first, and then using compaction filter to move all versions except the latest to history cf. This approach delay the additional writes to background job, so may have less impact on the foreground writes. But it has following shortcomings:
- compaction filter is not reliable. The timing it's triggered can be tricky. We have observed issue that introduced by compaction not in time. tikv/tikv#12729.
- compaction filter only works on SST files, versions in memory are still mixed.
- point get still requires seek unless we switch to user timestamp completely, which is not used in production yet.
- If we remove KV WAL completely, writing in compaction can be expensive as it needs to be either ingested as new SST or triggers flush, otherwise restarting TiKV may lose data.

## Unresolved questions