Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

physical isolation between region #93

Merged
merged 5 commits into from
Aug 25, 2022
Merged

Conversation

BusyJay
Copy link
Member

@BusyJay BusyJay commented Jun 2, 2022

This is the followup design of #82.

Rendered version


### Data layout

A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as ${region_id}_${tablet_index}. tablet_index is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding tablet_index is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two questions here:

  • In this way, Means a rocksdb corresponds to a raft group?is it unlikely that we will have multiple regions sharing a rocksdb in the future?
  • Will the space GC be more efficient than before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will have multiple regions sharing a rocksdb in the future?

I would not close the door for the possibility. But for now I don't want to implement such support for simplicity.

space GC

I'm not sure what it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here space GC such as remove region needs to wait for compaction to physically delete it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing a peer is just deleting the directory.


Replication works almost the same as v6.x that uses the raft algorithm to do log replications and snapshot replications.

The tablet itself is a complete snapshot of a region, so we don’t need to scan and generate anymore. To generate a snapshot, just use the checkpoint API from rocksdb. Because atomic flush is enabled, the checkpoint result is still complete and correct without flushing. Sending a snapshot needs a new protocol as we need to send a rocksdb instead of several SST files. In v6.x, snapshots are generated by scanning and writing, so all generated files are probably in system cache, there is no need to do flow control. But with this RFC, all files are just hardlinked, reading those files may introduce IO if they are generated by compactions a long time ago. Hence we need to do flow control with sending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider the flow control can be tune by PD? I'm planning to change the unit of store limit from count to IO size, then it can be adapted to the size of different regions and more intuitive to users.
and we may need feedback on the sending/applying IO to PD to help make the network IO more balance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do flow control with sending------can the existing snapshot flow control mechanism work here?

Copy link
Contributor

@nolouch nolouch Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think current there only snapshot IO limiter in TiKV can limit sending,and it's also shared with receiving,

Copy link

@BornChanger BornChanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also say something on the change of snapshot?

@BusyJay
Copy link
Member Author

BusyJay commented Jul 15, 2022

Copy link
Member

@zhangjinpeng87 zhangjinpeng87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, amazing feature

@BornChanger
Copy link


Because bloom filters are allocated immediately by a factor of capacity when creating a memtable, if there are a lot of tablets, the memory consumption can be large even without any workload. In fact, configuration of v6.1 is set for mix storage, for separate storage, we can choose a smaller memtable size as it only contains data from one region, and the factor can also be reduced. We change it to 0.05 from 0.1 in the prototype.

By separating RocksDB instances, there are only apply state and region state in the raft cf of KV DB. Writing two states every time is inefficient and triggers unnecessary compaction. We will move these stats to raft engine after disabling KV WAL.
Copy link

@tonyxuqqi tonyxuqqi Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Writing two states every time is inefficient and triggers unnecessary compaction"---so what are your proposal in this RFC? (Moving these states to raft engine alone seems not answering this question) If without persisting the apply state and region state for each apply, how does it impact the recovery logic after panic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's covered in the #94.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's covered in the #94.
OK. You may mention that this issue is covered by #94 to avoid the confusion. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's stated clearly in the "Disable KV WAL" section:

Disabling KV WAL is a complicated design, it will be proposed as a separate RFC.

I will add a link when the #94 is merged.


For example, if region 2 is split into region 2 and region 3 at index 11, then create tablets 2_11 and 3_5. Before split command is persisted in all new tablets, although new writes are written to new tablets, but tablet index in raft db of region 2 should still point to the old tablet and old tablet should be kept. After all tablets have persisted the split command, region 2 can point to the new tablet, and old tablet can be removed after all queries are finished.

FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FreezeAndClone can create multiple references to the same frozen memtables
In the diagram above, it appears that all frozen memtables of parent tablet are flushed into SST. And the new tablets do not have frozen memtables from parent once created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately mermaid doesn't allow me to include same rectangle in two graphs.


### Merge

The merge algorithm is almost the same as v6.1. To make it easy, we introduce a new interface: FreezeAndMerge. It will freeze the mutable memtables of the two tablet, and then merge their SST files by level and frozen memtables. After being merged, the new tablet need to adjust the Log Sequence Number (LSN) to the maximum number of two original tablets. To guarantee correctness, merge should be triggered only after clean up has finished in both two tablets, which should only contain the data within their region. This can be done by an additional check. For raft cf, there is no need to merge, only the target tablet’s version needs to be kept. After merge, there may be LSN regression in frozen memtables. This can only happen when two frozen memtables are from two tablets. As explained, the two memtables can’t overlap with each other, so LSN regression should not break correctness.
Copy link

@tonyxuqqi tonyxuqqi Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will both source and target region need to stop writing before the merge? Also will source regions need to flush memtable before the merge?
The current implementation in RocksDB assumes both of that, I think it would be better to explicitly mention them here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will both source and target region need to stop writing before the merge?

Yes, but requires no change as it's part of existing merge algorithm.

Also will source regions need to flush memtable before the merge?

No.

The current implementation in RocksDB assumes both of that,

I don't think so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also will source regions need to flush memtable before the merge? No.
In today's implementation (Xinye's PR in review), we try to avoid the complexity of merging WAL files and thus require all source regions' memtable are flushed and no WAL data. (Only merging memtable without WAL files can lose data if crash)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is misunderstanding. First, this RFC requires disabling WAL, so there is no such merging WAL requirement. Second, it's OK to lose data as addressed in the next paragraph.


#### Upgrade

The storage architecture is very different between v6.1 and dynamic regions. Changes of metadata can be updated every easily during rolling update, however migrating data is nearly impossible. There are two problems:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every -> very

- Multiple regions share the same rocksdb in v5.x, split them into tablets takes too much time and can bring much additional writes;
- There are many small regions in v5.x, which can be more than 100k in a single instance. Upgrading them at once probably OOM.

To solve the problem, let’s introduce a migration stage. Before upgrading, PD should merge small regions into large one. When the region count is small enough, or the size of a region becomes large enough, PD triggers tablet split. After all regions store their data in tablets, the upgrade is finished. PD should provide a switch to users for pausing, resuming, checking the progress of upgrading and controlling its speed. Upgrading is enabled by default, and service should not be interrupted during that time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PD should merge small regions into large one ---at this stage, is the data moved to multi rocksdb or still single rocksdb? I suppose it's still in single rocksdb, it's better to explicitly mentioned here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original sentence is "Before upgrading, PD should merge small regions into large one". It's not upgraded, there is no tablet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original sentence is "Before upgrading, PD should merge small regions into large one". It's not upgraded, there is no tablet.

OK

- Multiple regions share the same rocksdb in v5.x, split them into tablets takes too much time and can bring much additional writes;
- There are many small regions in v5.x, which can be more than 100k in a single instance. Upgrading them at once probably OOM.

To solve the problem, let’s introduce a migration stage. Before upgrading, PD should merge small regions into large one. When the region count is small enough, or the size of a region becomes large enough, PD triggers tablet split. After all regions store their data in tablets, the upgrade is finished. PD should provide a switch to users for pausing, resuming, checking the progress of upgrading and controlling its speed. Upgrading is enabled by default, and service should not be interrupted during that time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PD triggers tablet split------here the tablet split will split the region from single rocksdb into tablets with multirocksdb, right? It's different with tablet split of either single rocksdb or multi rocksdb. Probably need some details about it.

Also during the upgrade, we will have some regions sharing a rocksdb instance, while other regions have dedicated rocksdb instance. It seems to me this state can last very very long. And so back to Shuning's earlier question, we may support multiple regions sharing single rocksdb as it's the situation in upgrade anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different with tablet split of either single rocksdb or multi rocksdb.

Actually it can share the same method as multiple rocksdb. But I will leave it as an implementation decision.

we may support multiple regions sharing single rocksdb as it's the situation in upgrade anyway.

Yes, supporting 1 mixed storage + n isolated storage is possible. But it's a too complicated architecture, and has not much benefits compared to fully isolated solution.

Copy link

@tonyxuqqi tonyxuqqi Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's different with tablet split of either single rocksdb or multi rocksdb.
I think here it has some details worth to discuss. One option is to first move the region from single rocksdb to multi rocksdb and then use multi rocksdb's method for splitting.
The other option is to read the data of new regions directly and then create a rocksdb instance with that. To me the first option is simpler and better.

Copy link

@tonyxuqqi tonyxuqqi Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes, supporting 1 mixed storage + n isolated storage is possible. But it's a too complicated architecture, and has not much benefits compared to fully isolated solution.

My point is during the upgrade ( I suppose regions are upgraded one by one), we have 1 mixed storage and n isolated storage anyway. And this period can be very long depending on the data size, so we need to implement it anyway as part of the online upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first move the region from single rocksdb to multi rocksdb and then use multi rocksdb's method for splitting.

I don't see the point of moving it to multi rocksdb first. Just use FreezeAndClone should be sufficient.

so we need to implement it anyway as part of the online upgrade?

Not really. If we found the implementation is too complicated, we can still fallback to offline upgrading or primary/sencondary upgrading. Nevertheless, the upgrading mode is not supposed to be performant, it's a temporary state. For example, we may disable certain features in upgrading states. All of this RFC is designed for physically isolation.


### Data layout

A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as ${region_id}_${tablet_index}. tablet_index is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding tablet_index is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the displayed format expected ?

image

Signed-off-by: Jay Lee <[email protected]>

### Data layout

A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as `${region_id}_${tablet_index}`. `tablet_index` is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding `tablet_index` is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rendered version is broken in ${region_id}_${tablet_index}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems you just forgot to update the link


FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.

FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just exclude the file whose range does not belong to the new tablet in FreezeAndClone. DeleteFilesInRange has some limitations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can. Though it's more complicated.


FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.

The maximum size of a tablet is supposed to be 10GiB, its file count can usually be 20 ~ 60. Hard link tens of files can only take hundreds of microseconds, which is smaller than a network roundtrip. A tablet only contains about 2 or 3 layers, cleanup may only need to rewrite about 2 or 3 files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we consider decreasing the max level to 3?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, unless there is a clear benefits.


### Merge

The merge algorithm is almost the same as v6.1. To make it easy, we introduce a new interface: FreezeAndMerge. It will freeze the mutable memtables of the two tablet, and then merge their SST files by level and frozen memtables. After being merged, the new tablet need to adjust the Log Sequence Number (LSN) to the maximum number of two original tablets. To guarantee correctness, merge should be triggered only after clean up has finished in both two tablets, which should only contain the data within their region. This can be done by an additional check. For raft cf, there is no need to merge, only the target tablet’s version needs to be kept. After merge, there may be LSN regression in frozen memtables. This can only happen when two frozen memtables are from two tablets. As explained, the two memtables can’t overlap with each other, so LSN regression should not break correctness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it know clean up has finished in both two tablets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the paragraph, "This can be done by an additional check." The simplest way is to schedule a pre-merge check to see if all replicas have finish cleaned up.

end
```

For example, if region 2 is split into region 2 and region 3 at index 11, then create tablets 2_11 and 3_5. Before split command is persisted in all new tablets, although new writes are written to new tablets, but tablet index in raft db of region 2 should still point to the old tablet and old tablet should be kept. After all tablets have persisted the split command, region 2 can point to the new tablet, and old tablet can be removed after all queries are finished.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by split command is persisted in all new tablets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data is flushed to SST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it know all tablets have persisted the split command

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same mechanism used for removing WAL. More precisely, only all tablets have persisted the split comand, can WAL (raft log) be deleted. Only after that, can the old tablet be deleted. An easy implementation is to freeze all the memtables and wait until all frozen memtables are flushed.

@BusyJay BusyJay added the Final Comment Period This RFC is in the final comment period, and has a limited amount of time to give input on. label Aug 16, 2022
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@BusyJay BusyJay merged commit fe50384 into tikv:master Aug 25, 2022
@BusyJay BusyJay deleted the dynamic-region branch August 25, 2022 05:14
pingyu pushed a commit to pingyu/tikv-rfcs that referenced this pull request Nov 4, 2022
pingyu added a commit that referenced this pull request Nov 8, 2022
* RFC: RawKV Batch Export (#76)

Signed-off-by: pingyu <[email protected]>

* rawkv bulk load: add description for pause merge (#74)

* rawkv bulk load: add description for pause merge

Signed-off-by: Peng Guanwen <[email protected]>

* Update text/0072-online-bulk-load-for-rawkv.md

Co-authored-by: Liangliang Gu <[email protected]>
Signed-off-by: Peng Guanwen <[email protected]>

* Add future improvements

Signed-off-by: Peng Guanwen <[email protected]>

Co-authored-by: Liangliang Gu <[email protected]>
Signed-off-by: pingyu <[email protected]>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <[email protected]>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <[email protected]>

* remove raw cf

Signed-off-by: Andy Lok <[email protected]>
Signed-off-by: pingyu <[email protected]>

* update

Signed-off-by: Andy Lok <[email protected]>
Signed-off-by: pingyu <[email protected]>

* update pd design

Signed-off-by: andylokandy <[email protected]>
Signed-off-by: pingyu <[email protected]>

* revert to keyspace_next_id

Signed-off-by: andylokandy <[email protected]>
Signed-off-by: pingyu <[email protected]>

* RFC: Improve the Scalability of TSO Service (#78)

Signed-off-by: pingyu <[email protected]>

* make region size dynamic (#82)

Signed-off-by: Jay Lee <[email protected]>
Signed-off-by: pingyu <[email protected]>

* update pd url

Signed-off-by: andylokandy <[email protected]>
Signed-off-by: pingyu <[email protected]>

* address comment

Signed-off-by: andylokandy <[email protected]>
Signed-off-by: pingyu <[email protected]>

* resolve pd flashback problem

Signed-off-by: andylokandy <[email protected]>
Signed-off-by: pingyu <[email protected]>

* update rfcs

Signed-off-by: Andy Lok <[email protected]>
Signed-off-by: pingyu <[email protected]>

* RFC: In-memory Pessimistic Locks (#77)

* RFC: In-memory Pessimistic Locks

Signed-off-by: Yilin Chen <[email protected]>

* clarify where to delete memory locks after writing a lock CF KV

Signed-off-by: Yilin Chen <[email protected]>

* Elaborate transfer leader handlings and add correctness section

Signed-off-by: Yilin Chen <[email protected]>

* add an addition step of proposing pessimistic locks before transferring leader

Signed-off-by: Yilin Chen <[email protected]>

* clarify about new leaders of region split

Signed-off-by: Yilin Chen <[email protected]>

* Add tracking issue link

Signed-off-by: Yilin Chen <[email protected]>

* update design and correctness analysis of lock migration

Signed-off-by: Yilin Chen <[email protected]>

* add configurations

Signed-off-by: Yilin Chen <[email protected]>
Signed-off-by: pingyu <[email protected]>

* propose online unsafe recovery (#91)

Signed-off-by: Connor1996 <[email protected]>
Signed-off-by: pingyu <[email protected]>

* physical isolation between region (#93)

Signed-off-by: Jay Lee <[email protected]>
Signed-off-by: pingyu <[email protected]>

* wip

Signed-off-by: pingyu <[email protected]>

* update

Signed-off-by: pingyu <[email protected]>

* update

Signed-off-by: pingyu <[email protected]>

* Apply suggestions from code review

Co-authored-by: Xiaoguang Sun <[email protected]>
Signed-off-by: pingyu <[email protected]>

* fix case

Signed-off-by: pingyu <[email protected]>

Signed-off-by: pingyu <[email protected]>
Signed-off-by: Andy Lok <[email protected]>
Signed-off-by: andylokandy <[email protected]>
Signed-off-by: Jay Lee <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>
Signed-off-by: Connor1996 <[email protected]>
Co-authored-by: Liangliang Gu <[email protected]>
Co-authored-by: Peng Guanwen <[email protected]>
Co-authored-by: Andy Lok <[email protected]>
Co-authored-by: JmPotato <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Connor <[email protected]>
Co-authored-by: Xiaoguang Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Final Comment Period This RFC is in the final comment period, and has a limited amount of time to give input on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants