Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local cas: achieve chunk deduplication for nydus. #1399

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

xwb1136021767
Copy link
Contributor

Relevant Issue (if applicable)

If there are Issues related to this PullRequest, please list it.

Details

Original version from #956.
The previous version of local cas was static dedup, which only modified the chunk information in bootstrap. There is a serious problem: it may reuse chunks that cannot be obtained by the backend of the current image, resulting in the container being unable to load the corresponding chunk data on demand during runtime. To address this issue, dynamic dedup was introduced. When nydusd initializes the blob cache, it reads the corresponding backend configuration information of the blob from the CAS database, enabling the blob cache to read chunk data from other backend.

What's more, the mismatch between dynamic dedup and nydus' chunk amplification can result in a larger cache size after dedup than without dedup. Because chunk amplification can cause reused chunks to be pulled multiple times, resulting in a larger cache size after dedup is enabled than when dedup is not enabled. To address this issue, a dedup_bitmap was introduced. When initializing rafs, dedup_bitmap is generated based on the chunk information in blob. The determination of whether a chunk in a blob is ready requires both the chunk map and deduplication bitmap to make a joint decision.

Types of changes

What types of changes does your PullRequest introduce? Put an x in all the boxes that apply:

  • [ x] Bug fix (non-breaking change which fixes an issue)
  • [ x] New feature (non-breaking change which adds functionality)
  • [ x] Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation Update (if none of the other choices apply)

Checklist

Go over all the following points, and put an x in all the boxes that apply.

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.

@codecov
Copy link

codecov bot commented Aug 9, 2023

Codecov Report

Merging #1399 (6c5147f) into master (0916979) will decrease coverage by 0.90%.
The diff coverage is 31.95%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1399      +/-   ##
==========================================
- Coverage   62.70%   61.81%   -0.90%     
==========================================
  Files         123      125       +2     
  Lines       43248    44530    +1282     
  Branches    43248    44530    +1282     
==========================================
+ Hits        27120    27526     +406     
- Misses      14817    15660     +843     
- Partials     1311     1344      +33     
Files Coverage Δ
builder/src/core/v6.rs 75.65% <100.00%> (ø)
builder/src/lib.rs 64.79% <ø> (ø)
utils/src/lib.rs 98.87% <ø> (ø)
storage/src/cache/dummycache.rs 94.33% <80.00%> (-0.21%) ⬇️
utils/src/compress/mod.rs 97.83% <0.00%> (ø)
rafs/src/metadata/mod.rs 70.99% <0.00%> (ø)
utils/src/crypt.rs 93.04% <0.00%> (ø)
utils/src/digest.rs 92.06% <0.00%> (ø)
builder/src/compact.rs 80.32% <0.00%> (-0.25%) ⬇️
rafs/src/metadata/cached_v5.rs 80.76% <0.00%> (-0.30%) ⬇️
... and 21 more

... and 1 file with indirect coverage changes

@xwb1136021767 xwb1136021767 force-pushed the master branch 3 times, most recently from 8383358 to 4be06b9 Compare August 9, 2023 09:03
Cargo.lock Outdated Show resolved Hide resolved
@xwb1136021767 xwb1136021767 force-pushed the master branch 3 times, most recently from 10ee6ca to 0292485 Compare August 15, 2023 15:31
@imeoer imeoer added the feature label Aug 17, 2023
@adamqqqplay adamqqqplay requested a review from mofishzz August 24, 2023 03:12
@imeoer
Copy link
Collaborator

imeoer commented Oct 24, 2023

@xwb1136021767 Sorry for the delayed reply, please help to add some docs about background and usage for users.

@@ -29,6 +29,8 @@ pub struct ConfigV2 {
pub cache: Option<CacheConfigV2>,
/// Configuration information for RAFS filesystem.
pub rafs: Option<RafsConfigV2>,
/// Configuration information for image deduplication.
pub deduplication: Option<DeduplicationConfigV2>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add smoke test and update doc

timing_tracer!(
{
tree.walk_bfs(true, &mut |n| {
n.lock_node().dedup_chunk_for_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_chunk_ofs seems not implemented with v5. Is dedup_chunk_for_node able to work with v5? If not please use is_v6 to avoid it.

)
)
.subcommand(
App::new("dedup")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke test is required. Please also update test doc.

@@ -1611,4 +1750,89 @@ mod tests {
assert_eq!(size, iovec.size());
assert_eq!(chunk_count, iovec.len() as u32);
}

#[test]
fn test_serialize_cipher_object() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertions is necessary that ensures cipher ok.

println!("{}", algo);
println!("origin blob_info: {:?}", blob_info);

println!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these debug information for merge

@isidentical
Copy link

isidentical commented Nov 6, 2023

This is a particularly cool feature that I think would benefit everyone who deals with multiple random images that essentially share the same stuff thats used at runtime (e.g. libpython3.so) but built by different methodologies (different base images etc) and would optimize cold starts in a very intriguing way. Any chance on knowing what are the next steps for this PR to get it streamlined aside from the existing reviews (which all seems easily addressable, so am not sure if they are a blocker)? Is the concept itself is OK for nydus maintainers? Are there any help needed which I could contribute to (am not super knowledgeable on nydus side, but would be happy to dive deep into fixing the easy stuff like the existing reviews) to push this forward?

@xwb1136021767 xwb1136021767 force-pushed the master branch 2 times, most recently from f2368b1 to 33b2206 Compare November 16, 2023 02:04
@xwb1136021767 xwb1136021767 force-pushed the master branch 3 times, most recently from 837b6e7 to b7f1f3d Compare November 20, 2023 02:02
@imeoer
Copy link
Collaborator

imeoer commented Nov 20, 2023

@xwb1136021767 Can we also give a performance test result in the doc to give users confidence?

@@ -17,6 +17,7 @@ libc = "0.2"
log = "0.4"
lz4-sys = "1.9.4"
lz4 = "1.24.0"
rusqlite = { version = "0.29.0", features = ["bundled"] }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update to latest "0.30.0".

// create blob and chunk table for nydus v5
conn.execute(
"CREATE TABLE IF NOT EXISTS BlobInfos_V5 (
BlobId TEXT NOT NULL PRIMARY KEY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud it be CHAR(64)?

(),
)?;
conn.execute(
"CREATE INDEX IF NOT EXISTS BlobIndex_V5 ON BlobInfos_V5(BlobId)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlobInfos_V5(BlobId) is a primary key, should it have already been indexed?

ChunkId INTEGER PRIMARY KEY,
ChunkDigestValue TEXT NOT NULL,
ChunkInfo TEXT NOT NULL,
BlobId TEXT NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use an auto-incr integer as primary key for BlobInfos_V5, so we can store an integer instead of a text here.

let cas_mgr = CasMgr::new(path).unwrap();
let vec = cas_mgr.get_all_blobs(false).unwrap();

println!("v5 blobs");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The println! lines should be removed

@jiangliu
Copy link
Collaborator

This is a particularly cool feature that I think would benefit everyone who deals with multiple random images that essentially share the same stuff thats used at runtime (e.g. libpython3.so) but built by different methodologies (different base images etc) and would optimize cold starts in a very intriguing way. Any chance on knowing what are the next steps for this PR to get it streamlined aside from the existing reviews (which all seems easily addressable, so am not sure if they are a blocker)? Is the concept itself is OK for nydus maintainers? Are there any help needed which I could contribute to (am not super knowledgeable on nydus side, but would be happy to dive deep into fixing the easy stuff like the existing reviews) to push this forward?

This idea is warmly welcomed:)
The only issue is a little bigger for review, will put more effort on this.

xwb1136021767 and others added 4 commits November 27, 2023 16:12
Add the dedup option to nydus-image to support block level deduplication
between local images. A local database cas.db was introduced, which records
all chunks saved locally. Then We can redirect the chunks to those that already
exist locally to avoid duplicate downloads.

There is still some work needs to be done, including:
1. Insert the on-demand downloaded chunk into the database to ensure that
the database stores locally existing chunks.
2. Snapshotter's GC mechanism with the reference relationship of bootstrap
after deduplication.

Signed-off-by: Huang Jianan <[email protected]>
Signed-off-by: xwb1136021767 <[email protected]>
The previous version of local cas was static dedup, which only modified
the chunk information in bootstrap. There is a serious problem: it may
reuse chunks that cannot be obtained by the backend of the current
image, resulting in the container being unable to load the corresponding
chunk data on demand during runtime.

To address this issue, dynamic dedup was introduced. When nydusd
initializes the blob cache, it reads the corresponding backend
configuration information of the blob from the CAS database, enabling
the blob cache to read chunk data from other backend.

Signed-off-by: xwb1136021767 <[email protected]>
…namic dedup.

The mismatch between dynamic dedup and nydus' chunk amplification can
result in a larger cache size after dedup than without dedup. Because
chunk amplification can cause reused chunks to be pulled multiple
times, resulting in a larger cache size after dedup is enabled than when
dedup is not enabled.

To address this issue, a dedup_bitmap was introduced. When initializing
rafs, dedup_bitmap is generated based on the chunk information in blob.
The determination of whether a chunk in a blob is ready requires both
the chunk map and deduplication bitmap to make a joint decision.

Signed-off-by: xwb1136021767 <[email protected]>
@jiangliu
Copy link
Collaborator

@xwb1136021767 When reviewing the PR, I have worked out some enhancements/fixes over your work. How about holding on for while, then I will try to submit my patches to your repo?

@xwb1136021767
Copy link
Contributor Author

@xwb1136021767 When reviewing the PR, I have worked out some enhancements/fixes over your work. How about holding on for while, then I will try to submit my patches to your repo?

Thank you so much for your help!

true
}

pub fn get_enable(&self) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pub field, still need this? Also better be is_enable.

/// Whether to enable image dedup
#[serde(default)]
pub enable: bool,
/// Work fir for image dedup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo dir.

@@ -2,16 +2,34 @@
// Copyright (C) 2023 Alibaba Cloud. All rights reserved.
//
// SPDX-License-Identifier: Apache-2.0

use anyhow::{Context, Error, Result};
#![allow(unused_variables, unused_imports)]
Copy link
Collaborator

@imeoer imeoer Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

} else {
(old_blob_table_offset, bootstrap_end)
};
//write rs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space after //.

sb.set_extended_blob_table_entries(u32::try_from(blob_table.extended.entries())?);
writer.seek(SeekFrom::Start(0))?;
sb.store(writer).context("failed to store superblock")?;
//rewrite blob table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

.context("failed to write 0 to padding of bootstrap's end")?;
writer.flush().context("failed to flush bootstrap")?;

//prepare devtable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -0,0 +1,166 @@
// // Copyright (C) 2022 Alibaba Cloud. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Copyright (C) 2023 Alibaba Cloud. All rights reserved.

Arg::new("output-bootstrap")
.long("output")
.short('O')
.help("Bootstrap to output, default is source bootstrap add suffix .debup")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo dedup.

Some(s) => PathBuf::from(s),
};

//config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -0,0 +1,450 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

// create blob and chunk table for nydus v5
conn.execute(
"CREATE TABLE IF NOT EXISTS BlobInfos_V5 (
BlobId TEXT NOT NULL PRIMARY KEY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better be blob_id, same below.

conn: Connection,
}

impl CasMgr {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement trait ChunkDict for CasMgr ?

Copy link
Collaborator

@imeoer imeoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand why we need to rebuild a new bootstrap, can we do runtime de-duplication by enhancing the blob cache? i.e. for example using a global (chunk_digest, chunk_info) db records.

@xwb1136021767
Copy link
Contributor Author

I can't understand why we need to rebuild a new bootstrap, can we do runtime de-duplication by enhancing the blob cache? i.e. for example using a global (chunk_digest, chunk_info) db records.

I have also considered the solution you mentioned before. My previous concern was whether runtime deduplication would make the CAS database a performance bottleneck in IO intensive scenarios. There are indeed problems with the current solution, and personally, I think the biggest problem is how to combine it with the GC mechanism of the image, especially in localfs or localdisk mode, because nydus snapshot cannot prevent the deletion operation of the image.

@imeoer
Copy link
Collaborator

imeoer commented Nov 29, 2023

I have also considered the solution you mentioned before. My previous concern was whether runtime deduplication would make the CAS database a performance bottleneck in IO intensive scenarios. There are indeed problems with the current solution, and personally, I think the biggest problem is how to combine it with the GC mechanism of the image, especially in localfs or localdisk mode, because nydus snapshot cannot prevent the deletion operation of the image.

This may require some testing, full dedup and rebuilding the whole bootstrap (not only chunk infos?) on booting may also affect cold start performance. Either full dedup or on-demand dedup, there will be GC issues as containerd checks if an image is referencing a blob based on the image manifest, references:

https://github.com/containerd/nydus-snapshotter/blob/5009c522df583cdf76ee37ae2a3c6440d5d79797/snapshot/snapshot.go#L583

https://github.com/containerd/nydus-snapshotter/blob/5009c522df583cdf76ee37ae2a3c6440d5d79797/pkg/cache/manager.go#L70

@jiangliu
Copy link
Collaborator

jiangliu commented Dec 7, 2023

A simplified version to implement chunk deduplication at runtime instead of at build time.
#1507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants