physical isolation between region #93

BusyJay · 2022-06-02T00:04:39Z

This is the followup design of #82.

Signed-off-by: Jay Lee <[email protected]>

nolouch · 2022-06-14T07:47:30Z

text/0093-rocksdb-per-region.md

+
+### Data layout
+
+A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as ${region_id}_${tablet_index}. tablet_index is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding tablet_index is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.


I have two questions here:

In this way, Means a rocksdb corresponds to a raft group？is it unlikely that we will have multiple regions sharing a rocksdb in the future?

Will the space GC be more efficient than before?

we will have multiple regions sharing a rocksdb in the future?

I would not close the door for the possibility. But for now I don't want to implement such support for simplicity.

space GC

I'm not sure what it is.

Here space GC such as remove region needs to wait for compaction to physically delete it.

Yes, removing a peer is just deleting the directory.

nolouch · 2022-06-15T03:54:40Z

text/0093-rocksdb-per-region.md

+
+Replication works almost the same as v6.x that uses the raft algorithm to do log replications and snapshot replications.
+
+The tablet itself is a complete snapshot of a region, so we don’t need to scan and generate anymore. To generate a snapshot, just use the checkpoint API from rocksdb. Because atomic flush is enabled, the checkpoint result is still complete and correct without flushing. Sending a snapshot needs a new protocol as we need to send a rocksdb instead of several SST files. In v6.x, snapshots are generated by scanning and writing, so all generated files are probably in system cache, there is no need to do flow control. But with this RFC, all files are just hardlinked, reading those files may introduce IO if they are generated by compactions a long time ago. Hence we need to do flow control with sending.


Can we consider the flow control can be tune by PD? I'm planning to change the unit of store limit from count to IO size, then it can be adapted to the size of different regions and more intuitive to users.
and we may need feedback on the sending/applying IO to PD to help make the network IO more balance?

do flow control with sending------can the existing snapshot flow control mechanism work here?

I think current there only snapshot IO limiter in TiKV can limit sending，and it's also shared with receiving,

BornChanger

Can you also say something on the change of snapshot?

BusyJay · 2022-07-15T00:36:11Z

@BornChanger it's covered in Replication section https://github.com/tikv/rfcs/pull/93/files#diff-2670386206ff4ffd050af2d90ec0894dd2607063644edb2133167eca100c420fR79.

zhangjinpeng87

LGTM, amazing feature

BornChanger · 2022-07-17T03:17:52Z

@BornChanger it's covered in Replication section https://github.com/tikv/rfcs/pull/93/files#diff-2670386206ff4ffd050af2d90ec0894dd2607063644edb2133167eca100c420fR79.

Got it ~~

tonyxuqqi · 2022-07-18T20:39:20Z

text/0093-rocksdb-per-region.md

+
+Because bloom filters are allocated immediately by a factor of capacity when creating a memtable, if there are a lot of tablets, the memory consumption can be large even without any workload. In fact, configuration of v6.1 is set for mix storage, for separate storage, we can choose a smaller memtable size as it only contains data from one region, and the factor can also be reduced. We change it to 0.05 from 0.1 in the prototype.
+
+By separating RocksDB instances, there are only apply state and region state in the raft cf of KV DB. Writing two states every time is inefficient and triggers unnecessary compaction. We will move these stats to raft engine after disabling KV WAL.


"Writing two states every time is inefficient and triggers unnecessary compaction"---so what are your proposal in this RFC? (Moving these states to raft engine alone seems not answering this question) If without persisting the apply state and region state for each apply, how does it impact the recovery logic after panic.

It's covered in the #94.

It's covered in the #94.
OK. You may mention that this issue is covered by #94 to avoid the confusion. Thanks.

I think it's stated clearly in the "Disable KV WAL" section:

Disabling KV WAL is a complicated design, it will be proposed as a separate RFC.

I will add a link when the #94 is merged.

tonyxuqqi · 2022-07-18T21:02:00Z

text/0093-rocksdb-per-region.md

+
+For example, if region 2 is split into region 2 and region 3 at index 11, then create tablets 2_11 and 3_5. Before split command is persisted in all new tablets, although new writes are written to new tablets, but tablet index in raft db of region 2 should still point to the old tablet and old tablet should be kept. After all tablets have persisted the split command, region 2 can point to the new tablet, and old tablet can be removed after all queries are finished.
+
+FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.


FreezeAndClone can create multiple references to the same frozen memtables
In the diagram above, it appears that all frozen memtables of parent tablet are flushed into SST. And the new tablets do not have frozen memtables from parent once created.

Unfortunately mermaid doesn't allow me to include same rectangle in two graphs.

tonyxuqqi · 2022-07-18T21:07:52Z

text/0093-rocksdb-per-region.md

+
+### Merge
+
+The merge algorithm is almost the same as v6.1. To make it easy, we introduce a new interface: FreezeAndMerge. It will freeze the mutable memtables of the two tablet, and then merge their SST files by level and frozen memtables. After being merged, the new tablet need to adjust the Log Sequence Number (LSN) to the maximum number of two original tablets. To guarantee correctness, merge should be triggered only after clean up has finished in both two tablets, which should only contain the data within their region. This can be done by an additional check. For raft cf, there is no need to merge, only the target tablet’s version needs to be kept. After merge, there may be LSN regression in frozen memtables. This can only happen when two frozen memtables are from two tablets. As explained, the two memtables can’t overlap with each other, so LSN regression should not break correctness.


Will both source and target region need to stop writing before the merge? Also will source regions need to flush memtable before the merge?
The current implementation in RocksDB assumes both of that, I think it would be better to explicitly mention them here.

Will both source and target region need to stop writing before the merge?

Yes, but requires no change as it's part of existing merge algorithm.

Also will source regions need to flush memtable before the merge?

No.

The current implementation in RocksDB assumes both of that,

I don't think so.

Also will source regions need to flush memtable before the merge? No.
In today's implementation (Xinye's PR in review), we try to avoid the complexity of merging WAL files and thus require all source regions' memtable are flushed and no WAL data. (Only merging memtable without WAL files can lose data if crash)

I think there is misunderstanding. First, this RFC requires disabling WAL, so there is no such merging WAL requirement. Second, it's OK to lose data as addressed in the next paragraph.

tonyxuqqi · 2022-07-18T21:10:53Z

text/0093-rocksdb-per-region.md

+
+#### Upgrade
+
+The storage architecture is very different between v6.1 and dynamic regions. Changes of metadata can be updated every easily during rolling update, however migrating data is nearly impossible. There are two problems:


every -> very

tonyxuqqi · 2022-07-18T21:15:55Z

text/0093-rocksdb-per-region.md

+- Multiple regions share the same rocksdb in v5.x, split them into tablets takes too much time and can bring much additional writes;
+- There are many small regions in v5.x, which can be more than 100k in a single instance. Upgrading them at once probably OOM.
+
+To solve the problem, let’s introduce a migration stage. Before upgrading, PD should merge small regions into large one. When the region count is small enough, or the size of a region becomes large enough, PD triggers tablet split. After all regions store their data in tablets, the upgrade is finished. PD should provide a switch to users for pausing, resuming, checking the progress of upgrading and controlling its speed. Upgrading is enabled by default, and service should not be interrupted during that time.


PD should merge small regions into large one ---at this stage, is the data moved to multi rocksdb or still single rocksdb? I suppose it's still in single rocksdb, it's better to explicitly mentioned here.

The original sentence is "Before upgrading, PD should merge small regions into large one". It's not upgraded, there is no tablet.

The original sentence is "Before upgrading, PD should merge small regions into large one". It's not upgraded, there is no tablet.

OK

tonyxuqqi · 2022-07-18T21:22:22Z

text/0093-rocksdb-per-region.md

+- Multiple regions share the same rocksdb in v5.x, split them into tablets takes too much time and can bring much additional writes;
+- There are many small regions in v5.x, which can be more than 100k in a single instance. Upgrading them at once probably OOM.
+
+To solve the problem, let’s introduce a migration stage. Before upgrading, PD should merge small regions into large one. When the region count is small enough, or the size of a region becomes large enough, PD triggers tablet split. After all regions store their data in tablets, the upgrade is finished. PD should provide a switch to users for pausing, resuming, checking the progress of upgrading and controlling its speed. Upgrading is enabled by default, and service should not be interrupted during that time.


PD triggers tablet split------here the tablet split will split the region from single rocksdb into tablets with multirocksdb, right? It's different with tablet split of either single rocksdb or multi rocksdb. Probably need some details about it.

Also during the upgrade, we will have some regions sharing a rocksdb instance, while other regions have dedicated rocksdb instance. It seems to me this state can last very very long. And so back to Shuning's earlier question, we may support multiple regions sharing single rocksdb as it's the situation in upgrade anyway.

It's different with tablet split of either single rocksdb or multi rocksdb.

Actually it can share the same method as multiple rocksdb. But I will leave it as an implementation decision.

we may support multiple regions sharing single rocksdb as it's the situation in upgrade anyway.

Yes, supporting 1 mixed storage + n isolated storage is possible. But it's a too complicated architecture, and has not much benefits compared to fully isolated solution.

It's different with tablet split of either single rocksdb or multi rocksdb.
I think here it has some details worth to discuss. One option is to first move the region from single rocksdb to multi rocksdb and then use multi rocksdb's method for splitting.
The other option is to read the data of new regions directly and then create a rocksdb instance with that. To me the first option is simpler and better.

Yes, supporting 1 mixed storage + n isolated storage is possible. But it's a too complicated architecture, and has not much benefits compared to fully isolated solution.

My point is during the upgrade ( I suppose regions are upgraded one by one), we have 1 mixed storage and n isolated storage anyway. And this period can be very long depending on the data size, so we need to implement it anyway as part of the online upgrade?

first move the region from single rocksdb to multi rocksdb and then use multi rocksdb's method for splitting.

I don't see the point of moving it to multi rocksdb first. Just use FreezeAndClone should be sufficient.

so we need to implement it anyway as part of the online upgrade?

Not really. If we found the implementation is too complicated, we can still fallback to offline upgrading or primary/sencondary upgrading. Nevertheless, the upgrading mode is not supposed to be performant, it's a temporary state. For example, we may disable certain features in upgrading states. All of this RFC is designed for physically isolation.

BornChanger · 2022-07-17T04:23:45Z

text/0093-rocksdb-per-region.md

+
+### Data layout
+
+A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as ${region_id}_${tablet_index}. tablet_index is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding tablet_index is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.


Is the displayed format expected ?

Signed-off-by: Jay Lee <[email protected]>

Connor1996 · 2022-08-01T03:21:27Z

text/0093-rocksdb-per-region.md

+
+### Data layout
+
+A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as `${region_id}_${tablet_index}`. `tablet_index` is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding `tablet_index` is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.


the rendered version is broken in ${region_id}_${tablet_index}

seems you just forgot to update the link

Connor1996 · 2022-08-01T03:58:32Z

text/0093-rocksdb-per-region.md

+
+FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.
+
+FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.


Can we just exclude the file whose range does not belong to the new tablet in FreezeAndClone. DeleteFilesInRange has some limitations.

Yes, it can. Though it's more complicated.

Connor1996 · 2022-08-01T04:02:28Z

text/0093-rocksdb-per-region.md

+
+FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.
+
+The maximum size of a tablet is supposed to be 10GiB, its file count can usually be 20 ~ 60. Hard link tens of files can only take hundreds of microseconds, which is smaller than a network roundtrip. A tablet only contains about 2 or 3 layers, cleanup may only need to rewrite about 2 or 3 files.


do we consider decreasing the max level to 3?

No, unless there is a clear benefits.

Connor1996 · 2022-08-01T04:15:48Z

text/0093-rocksdb-per-region.md

+
+### Merge
+
+The merge algorithm is almost the same as v6.1. To make it easy, we introduce a new interface: FreezeAndMerge. It will freeze the mutable memtables of the two tablet, and then merge their SST files by level and frozen memtables. After being merged, the new tablet need to adjust the Log Sequence Number (LSN) to the maximum number of two original tablets. To guarantee correctness, merge should be triggered only after clean up has finished in both two tablets, which should only contain the data within their region. This can be done by an additional check. For raft cf, there is no need to merge, only the target tablet’s version needs to be kept. After merge, there may be LSN regression in frozen memtables. This can only happen when two frozen memtables are from two tablets. As explained, the two memtables can’t overlap with each other, so LSN regression should not break correctness.


how does it know clean up has finished in both two tablets?

As said in the paragraph, "This can be done by an additional check." The simplest way is to schedule a pre-merge check to see if all replicas have finish cleaned up.

Connor1996 · 2022-08-01T04:17:10Z

text/0093-rocksdb-per-region.md

+    end
+```
+
+For example, if region 2 is split into region 2 and region 3 at index 11, then create tablets 2_11 and 3_5. Before split command is persisted in all new tablets, although new writes are written to new tablets, but tablet index in raft db of region 2 should still point to the old tablet and old tablet should be kept. After all tablets have persisted the split command, region 2 can point to the new tablet, and old tablet can be removed after all queries are finished.


what do you mean by split command is persisted in all new tablets?

Data is flushed to SST.

How does it know all tablets have persisted the split command

The same mechanism used for removing WAL. More precisely, only all tablets have persisted the split comand, can WAL (raft log) be deleted. Only after that, can the old tablet be deleted. An easy implementation is to freeze all the memtables and wait until all frozen memtables are flushed.

Signed-off-by: Jay Lee <[email protected]>

Connor1996

LGTM

nolouch

lgtm

Signed-off-by: Jay Lee <[email protected]> Signed-off-by: pingyu <[email protected]>

* RFC: RawKV Batch Export (#76) Signed-off-by: pingyu <[email protected]> * rawkv bulk load: add description for pause merge (#74) * rawkv bulk load: add description for pause merge Signed-off-by: Peng Guanwen <[email protected]> * Update text/0072-online-bulk-load-for-rawkv.md Co-authored-by: Liangliang Gu <[email protected]> Signed-off-by: Peng Guanwen <[email protected]> * Add future improvements Signed-off-by: Peng Guanwen <[email protected]> Co-authored-by: Liangliang Gu <[email protected]> Signed-off-by: pingyu <[email protected]> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <[email protected]> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <[email protected]> * remove raw cf Signed-off-by: Andy Lok <[email protected]> Signed-off-by: pingyu <[email protected]> * update Signed-off-by: Andy Lok <[email protected]> Signed-off-by: pingyu <[email protected]> * update pd design Signed-off-by: andylokandy <[email protected]> Signed-off-by: pingyu <[email protected]> * revert to keyspace_next_id Signed-off-by: andylokandy <[email protected]> Signed-off-by: pingyu <[email protected]> * RFC: Improve the Scalability of TSO Service (#78) Signed-off-by: pingyu <[email protected]> * make region size dynamic (#82) Signed-off-by: Jay Lee <[email protected]> Signed-off-by: pingyu <[email protected]> * update pd url Signed-off-by: andylokandy <[email protected]> Signed-off-by: pingyu <[email protected]> * address comment Signed-off-by: andylokandy <[email protected]> Signed-off-by: pingyu <[email protected]> * resolve pd flashback problem Signed-off-by: andylokandy <[email protected]> Signed-off-by: pingyu <[email protected]> * update rfcs Signed-off-by: Andy Lok <[email protected]> Signed-off-by: pingyu <[email protected]> * RFC: In-memory Pessimistic Locks (#77) * RFC: In-memory Pessimistic Locks Signed-off-by: Yilin Chen <[email protected]> * clarify where to delete memory locks after writing a lock CF KV Signed-off-by: Yilin Chen <[email protected]> * Elaborate transfer leader handlings and add correctness section Signed-off-by: Yilin Chen <[email protected]> * add an addition step of proposing pessimistic locks before transferring leader Signed-off-by: Yilin Chen <[email protected]> * clarify about new leaders of region split Signed-off-by: Yilin Chen <[email protected]> * Add tracking issue link Signed-off-by: Yilin Chen <[email protected]> * update design and correctness analysis of lock migration Signed-off-by: Yilin Chen <[email protected]> * add configurations Signed-off-by: Yilin Chen <[email protected]> Signed-off-by: pingyu <[email protected]> * propose online unsafe recovery (#91) Signed-off-by: Connor1996 <[email protected]> Signed-off-by: pingyu <[email protected]> * physical isolation between region (#93) Signed-off-by: Jay Lee <[email protected]> Signed-off-by: pingyu <[email protected]> * wip Signed-off-by: pingyu <[email protected]> * update Signed-off-by: pingyu <[email protected]> * update Signed-off-by: pingyu <[email protected]> * Apply suggestions from code review Co-authored-by: Xiaoguang Sun <[email protected]> Signed-off-by: pingyu <[email protected]> * fix case Signed-off-by: pingyu <[email protected]> Signed-off-by: pingyu <[email protected]> Signed-off-by: Andy Lok <[email protected]> Signed-off-by: andylokandy <[email protected]> Signed-off-by: Jay Lee <[email protected]> Signed-off-by: Yilin Chen <[email protected]> Signed-off-by: Connor1996 <[email protected]> Co-authored-by: Liangliang Gu <[email protected]> Co-authored-by: Peng Guanwen <[email protected]> Co-authored-by: Andy Lok <[email protected]> Co-authored-by: JmPotato <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Yilin Chen <[email protected]> Co-authored-by: Connor <[email protected]> Co-authored-by: Xiaoguang Sun <[email protected]>

BusyJay added 2 commits June 1, 2022 17:02

physical isolation between region

513a927

Signed-off-by: Jay Lee <[email protected]>

update numbers

2fbc8bc

Signed-off-by: Jay Lee <[email protected]>

nolouch reviewed Jun 15, 2022

View reviewed changes

BusyJay mentioned this pull request Jun 16, 2022

Support physical isolation between regions tikv/tikv#12842

Open

14 tasks

BornChanger reviewed Jul 15, 2022

View reviewed changes

zhangjinpeng87 approved these changes Jul 15, 2022

View reviewed changes

BornChanger approved these changes Jul 17, 2022

View reviewed changes

tonyxuqqi reviewed Jul 18, 2022

View reviewed changes

BornChanger reviewed Jul 20, 2022

View reviewed changes

address comment

a03a5f9

Signed-off-by: Jay Lee <[email protected]>

Connor1996 reviewed Aug 1, 2022

View reviewed changes

BusyJay added the Final Comment Period This RFC is in the final comment period, and has a limited amount of time to give input on. label Aug 16, 2022

BusyJay and others added 2 commits August 25, 2022 10:45

update tracking issue

9eaa184

Signed-off-by: Jay Lee <[email protected]>

Merge branch 'master' into dynamic-region

210d786

Connor1996 approved these changes Aug 25, 2022

View reviewed changes

nolouch approved these changes Aug 25, 2022

View reviewed changes

BusyJay merged commit fe50384 into tikv:master Aug 25, 2022

BusyJay deleted the dynamic-region branch August 25, 2022 05:14

pingyu pushed a commit to pingyu/tikv-rfcs that referenced this pull request Nov 4, 2022

physical isolation between region (tikv#93)

88fff35

Signed-off-by: Jay Lee <[email protected]> Signed-off-by: pingyu <[email protected]>


		### Data layout

		A directory “tablets” will be created inside the data directory to store the data of the tablet. The tablet will be named as ${region_id}_${tablet_index}. tablet_index is the raft log index that the tablet is corresponding to when being created. For example, when region 2 is created by split, its index is initialized to 5, so its name is 2_5. A replica of region 2 is catching up data by snapshot, and the snapshot index is 11, then after applying the snapshot, the tablet name becomes 2_11. Adding tablet_index is to allow quickly cleaning and applying of data. For example, there are still queries running on 2_5 while follower is about to apply a snapshot at index 11, the two directories can be kept and 2_5 can be deleted after all existing queries are finished.


		Replication works almost the same as v6.x that uses the raft algorithm to do log replications and snapshot replications.

		The tablet itself is a complete snapshot of a region, so we don’t need to scan and generate anymore. To generate a snapshot, just use the checkpoint API from rocksdb. Because atomic flush is enabled, the checkpoint result is still complete and correct without flushing. Sending a snapshot needs a new protocol as we need to send a rocksdb instead of several SST files. In v6.x, snapshots are generated by scanning and writing, so all generated files are probably in system cache, there is no need to do flow control. But with this RFC, all files are just hardlinked, reading those files may introduce IO if they are generated by compactions a long time ago. Hence we need to do flow control with sending.


		Because bloom filters are allocated immediately by a factor of capacity when creating a memtable, if there are a lot of tablets, the memory consumption can be large even without any workload. In fact, configuration of v6.1 is set for mix storage, for separate storage, we can choose a smaller memtable size as it only contains data from one region, and the factor can also be reduced. We change it to 0.05 from 0.1 in the prototype.

		By separating RocksDB instances, there are only apply state and region state in the raft cf of KV DB. Writing two states every time is inefficient and triggers unnecessary compaction. We will move these stats to raft engine after disabling KV WAL.


		For example, if region 2 is split into region 2 and region 3 at index 11, then create tablets 2_11 and 3_5. Before split command is persisted in all new tablets, although new writes are written to new tablets, but tablet index in raft db of region 2 should still point to the old tablet and old tablet should be kept. After all tablets have persisted the split command, region 2 can point to the new tablet, and old tablet can be removed after all queries are finished.

		FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.


		### Merge

		The merge algorithm is almost the same as v6.1. To make it easy, we introduce a new interface: FreezeAndMerge. It will freeze the mutable memtables of the two tablet, and then merge their SST files by level and frozen memtables. After being merged, the new tablet need to adjust the Log Sequence Number (LSN) to the maximum number of two original tablets. To guarantee correctness, merge should be triggered only after clean up has finished in both two tablets, which should only contain the data within their region. This can be done by an additional check. For raft cf, there is no need to merge, only the target tablet’s version needs to be kept. After merge, there may be LSN regression in frozen memtables. This can only happen when two frozen memtables are from two tablets. As explained, the two memtables can’t overlap with each other, so LSN regression should not break correctness.


		#### Upgrade

		The storage architecture is very different between v6.1 and dynamic regions. Changes of metadata can be updated every easily during rolling update, however migrating data is nearly impossible. There are two problems:


		FreezeAndClone can create multiple references to the same frozen memtables, which will cause duplicated flush. To avoid unnecessary IO, RocksDb needs to support flush filter, which will filter out all keys that are not in the current region and rebuild the bloom filter. Because split will create a new tablet, so flush filter will not corrupt existing rocksdb snapshot. After applying flush filter, every tablet will only write data in its own range, only little extra CPU cost, and no extra IO cost.

		FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.


		FreezeAndClone will hardlink all SST files in the original tablet, so every tablet has data that is not in its own range. After split, DeleteFilesInRange can be used to quickly free hard links. And then compaction filter + manual compaction can be used to do cleanup. Also because of new tablets, existing snapshots won’t be corrupted. The cleanup can be done in background slowly.

		The maximum size of a tablet is supposed to be 10GiB, its file count can usually be 20 ~ 60. Hard link tens of files can only take hundreds of microseconds, which is smaller than a network roundtrip. A tablet only contains about 2 or 3 layers, cleanup may only need to rewrite about 2 or 3 files.

physical isolation between region #93

physical isolation between region #93

Conversation

BusyJay commented Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nolouch Jul 20, 2022 • edited Loading

Choose a reason for hiding this comment

BornChanger left a comment

Choose a reason for hiding this comment

BusyJay commented Jul 15, 2022

zhangjinpeng87 left a comment

Choose a reason for hiding this comment

BornChanger commented Jul 17, 2022

tonyxuqqi Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyxuqqi Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyxuqqi Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

tonyxuqqi Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Connor1996 left a comment

Choose a reason for hiding this comment

nolouch left a comment

Choose a reason for hiding this comment

BusyJay commented Jun 2, 2022 •

edited

Loading

nolouch Jul 20, 2022 •

edited

Loading

tonyxuqqi Jul 18, 2022 •

edited

Loading

tonyxuqqi Jul 18, 2022 •

edited

Loading

tonyxuqqi Jul 19, 2022 •

edited

Loading

tonyxuqqi Jul 19, 2022 •

edited

Loading