-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Correctly map deduplicated chunk indices when setting content #54
base: master
Are you sure you want to change the base?
Conversation
fix: Correctly map deduplicated chunk indices when setting content
I'm a bit worried that now the implementations deviate between Rust and JS and was trying to understand whether this was happening in Rust too. As in, is this a small difference between that should be fixed, or whether the algorithm had a flaw that you've uncovered. I haven't managed to replicate it in rust by doing the following (which is slightly different to what you did): let content = randombytes_deterministic(10 * 1024, &[0; 32]); // 10kb of pseuedorandom data
// Flatten it to 40kb of repeating 10kb chunks
let content = vec![content.clone(), content.clone(), content.clone(), content].concat(); And this works when I set and get in a test that I thought was similar to what you did. I also adjusted your tests to not need tho stringifying just so it's simpler (will make an inline comment in a sec), and your tests still fails. So I suspect it's just in JS. Now I wonder: what's the different with the Rust impl and the JS one, and can we keep them mostly identical for ease of maintenance? |
I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate?
Oh, interesting. I wonder why it would happen here but not happen there. I also tried the exact test in Rust, but I neglected to verify that things actually get deduplicated.
Anyhow, I'll take a better look here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies again for the delay, it's just a very dangerous piece of code to get wrong and wanted to take multiple passes at it.
const chunkKeys = [...uniqueChunksMap.keys()]; | ||
|
||
// Change the original (shuffled) indices to point at the deduplicated chunks | ||
const newIndices = indices.map((i) => { | ||
const [id] = chunks[i]; | ||
return chunkKeys.indexOf(id); | ||
}); | ||
|
||
chunks = [...uniqueChunksMap.values()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me so long to review this. It's a very intrusive change in a very sensitive place.
I think the change is overall correct, though one thing that I'm very concerned about is this highlighted code I'm responding to. I'm not sure that per spec keys and values are necessarily in the same order and that it's deterministic. It could very well be, though I'm unsure. I'd much rather we created both in one go.
I'm also not sure what uniqueChunksMap actually does to the ordering.
I'll revert back to Rust, this is how it's done there, do you think that makes sense?
// Filter duplicates and construct the indice list.
let mut uid_indices: HashMap<String, usize> = HashMap::new();
chunks = chunks
.into_iter()
.enumerate()
.filter_map(|(i, chunk)| {
let uid = &chunk.0;
match uid_indices.get(uid) {
Some(previous_index) => {
indices[i] = *previous_index;
None
}
None => {
uid_indices.insert(uid.to_string(), i);
Some(chunk)
}
}
})
.collect();
// If we have more than one chunk we need to encode the mapping header in the last chunk
if indices.len() > 1 {
// We encode it in an array so we can extend it later on if needed
let buf = rmp_serde::to_vec_named(&(indices,))?;
let hash = to_base64(&crypto_manager.0.calculate_mac(&buf)?)?;
chunks.push(ChunkArrayItem(hash, Some(buf)));
}
Previously, if a chunk was removed as a duplicate, the indices would be incorrect and getContent wouldn't be able to piece them back together.
This keeps the same logic, but ensures that the indices correctly point to the elements in the new chunks array. Also adds tests to illustrate the issue.
#51