-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compaction: skip output level files with no data overlap #6021
Conversation
The idea is to skip output level files that are not overlapping with the data of start level files on compaction. By an output level file overlapping with the data of a start level file, I mean that there is at least one key in the start level file that is inside the range of the output level file. For example, an output level file *O* has range ["e", "f"] and keys "e" and "f", a start level file *S* has range ["a", "z"] and keys "a" and "z", although the range of file O overlaps with the range of file S, file O does not overlap with the data of file S. So when is this idea useful? We know that when we do sequential writes, all generated SST files don't overlap with each other and all compactions are just trivial moves, which is perfect. However, if we do concurrent sequential writes in multiple ranges, life gets hard. Take a relational database as an example. A common construction of the record keys is a table ID prefix concatenating with an auto-increment record ID (e.g. "1_1" means table 1, record 1). Now let's see what happens if we insert records into three tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2", "2_2", "1_3", "2_3", "1_4", "2_4" ... Assume that RocksDB uses level compaction and each memtable and SST file contains at most two keys. After putting eight keys, we get four level 0 files: L0: ["1_1", "2_1"], ["1_2", "2_2"], ["1_3", "2_3"], ["1_4", "2_4"] L1: Then a level 0 compaction is triggered and we this: L0: L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"] Then after putting four more keys: L0: ["1_5", "2_5"], ["1_6", "2_6"], ["1_7", "2_7"], ["1_8", "2_8"] L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"] Now if a level 0 compaction is triggered, according to the current implementation, the start level inputs will be all files in level 0, which cover range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and ["2_3", "2_4"] because these two files overlap with the range of the start level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data of the start level inputs at all. So can we compact the start level inputs without rewriting these two output level files? The answer is yes, as long as we ensure that newly generated files don't overlap with existing files in the output level. We can use the ranges of skipped output level files as split points for the compaction output files. For this compaction, "2_1" will be a split point, which prevents the compaction from generating a file like ["1_8", "2_5"]. With this optimization, we reduce two file reads and writes, which is 1/3 of the IO in this compaction. While the above example seems a bit artificial, I also experimented on a real-world database with this idea. A simple sysbench insert benchmark on TiDB shows more than 30% compaction IO reduction in some cases. I think other similar databases can benefit from this optimization too. Note that the current change is ugly, so just consider it as a proof of concept implementation for now.
@huachaohuang I considered this technique. It appeared to have bad interactions with range deletes. I did not prove or disprove a range delete problem. I simply took a different approach. Simply suggesting you consider that potential problem. |
@huachaohuang this idea is very similar with my proposal tikv/rust-rocksdb#375 |
@matthewvon can you give more details about the problem? |
@zhangjinpeng1987 cool, I don't notice that before. |
This idea is cool. But |
@Little-Wallace that's a good point, just consider it an easy hack for now :) |
@huachaohuang Range deletes for compactions are written within FinishCompactionOutputFiles(). There is no equivalent for flushes (I am rewriting that function to work with both flush and compaction). FinishCompactionOutputFiles() takes care of making sure range delete objects appropriately cover the key range of each .sst file being finished. My read of the code suggests that omitting files from the middle of a large compaction could leave the key range omitted without range delete coverage. Hence, the range delete is "lost" for those files simply removed from the large compaction. |
How about disable this optimization when there is range deletion in the range? |
I had this PR: #1963 which I believe is to solve a similar problem with a different approach. I hesitated in pushing it through because I worried that in some special cases, we are creating a lot of small files and they may not be able to eventually compacted together. I think this PR may have the same risk. I think the risk can be mitigated by looking at the size of the current output file. If the file is too small, then we skip this optimization. |
So, I think we all understand the problem we want to solve here and we have four different PRs to do it now. Let me put them together and see what we should proceed:
There are actually two paths here. The first path is to provide some options to enable users to make the decision about how to cut their compation files. Since users have more application knowledge, they can do better than RocksDB inside. But it also relies on users to understand their pattern and do it right. As for the implementation, #5201 is more flexible and #6016 is more convenient, I think we can do both. #5201 is a mechanism and #6016 is a strategy, like The second path is to let RocksDB handle the problem inside so that users don't need to worry about it. As for the implementation, #1963 seems more straightforward since RocksDB already checks overlapped bytes at the grandparent level and #6021 needs to build an iterator of input level files somewhere. Both paths can result in creating a lot of small files depending on the data pattern. IMO, the problems of small files:
OK, these my opinions so far. I just try to clear my mind here but I actually have no direct interest in this problem, so I'm not going to work on it right now. |
The idea is to skip output level files that are not overlapping with the data of
start level files on compaction. By an output level file overlapping with the
data of a start level file, I mean that there is at least one key in the start
level file that is inside the range of the output level file. For example, an
output level file O has range ["e", "f"] and keys "e" and "f", a start level
file S has range ["a", "z"] and keys "a" and "z", although the range of file O
overlaps with the range of file S, file O does not overlap with the data of file
S.
So when is this idea useful? We know that when we do sequential writes, all
generated SST files don't overlap with each other and all compactions are just
trivial moves, which is perfect. However, if we do concurrent sequential writes in
multiple ranges, life gets hard.
Take a relational database as an example. A common construction of the record
keys is a table ID prefix concatenating with an auto-increment record ID (e.g.
"1_1" means table 1, record 1). Now let's see what happens if we insert records
into two tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2",
"2_2", "1_3", "2_3", "1_4", "2_4" ...
Assume that RocksDB uses level compaction and each memtable and SST file
contains at most two keys. After putting eight keys, we get four level 0 files:
Then a level 0 compaction is triggered and we get this:
Then after putting four more keys:
Now if a level 0 compaction is triggered, according to the current
implementation, the start level inputs will be all files in level 0, which cover
range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and
["2_3", "2_4"] because these two files overlap with the range of the start
level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data
of the start level inputs at all. So can we compact the start level inputs
without rewriting these two output level files? The answer is yes, as long as we
ensure that newly generated files don't overlap with existing files in the
output level. We can use the ranges of skipped output level files as split
points for the compaction output files. For this compaction, "2_1" will be a
split point, which prevents the compaction from generating a file like
["1_8", "2_5"]. With this optimization, we reduce two file reads and writes,
which is 1/3 of the IO in this compaction.
While the above example seems a bit artificial, I also experimented on a
real-world database with this idea. A simple sysbench insert benchmark on TiDB
shows more than 30% compaction IO reduction in some cases. I think other similar
databases can benefit from this optimization too.
Note that the current change is ugly, so just consider it as a proof of concept
implementation for now.
Possible related: #5201 #6016 @yiwu-arbug @zhangjinpeng1987 @matthewvon