fix: Correctly map deduplicated chunk indices when setting content #54

cooper667 · 2022-06-02T08:28:48Z

Previously, if a chunk was removed as a duplicate, the indices would be incorrect and getContent wouldn't be able to piece them back together.

This keeps the same logic, but ensures that the indices correctly point to the elements in the new chunks array. Also adds tests to illustrate the issue.

#51

fix: Correctly map deduplicated chunk indices when setting content

tasn · 2022-06-04T15:20:28Z

I'm a bit worried that now the implementations deviate between Rust and JS and was trying to understand whether this was happening in Rust too. As in, is this a small difference between that should be fixed, or whether the algorithm had a flaw that you've uncovered.

I haven't managed to replicate it in rust by doing the following (which is slightly different to what you did):

    let content = randombytes_deterministic(10 * 1024, &[0; 32]); // 10kb of pseuedorandom data
    // Flatten it to 40kb of repeating 10kb chunks
    let content = vec![content.clone(), content.clone(), content.clone(), content].concat();

And this works when I set and get in a test that I thought was similar to what you did.

I also adjusted your tests to not need tho stringifying just so it's simpler (will make an inline comment in a sec), and your tests still fails. So I suspect it's just in JS.

Now I wonder: what's the different with the Rust impl and the JS one, and can we keep them mostly identical for ease of maintenance?

src/Etebase.test.ts

cooper667 · 2022-06-04T17:37:28Z

I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate?

tasn

I added a test in Rust similar to yours and it seems like the chunking that happens above 'never' results in duplicate chunks, so there isn't anything to deduplicate?

Oh, interesting. I wonder why it would happen here but not happen there. I also tried the exact test in Rust, but I neglected to verify that things actually get deduplicated.

Anyhow, I'll take a better look here. Thanks!

package.json

tasn

Apologies again for the delay, it's just a very dangerous piece of code to get wrong and wanted to take multiple passes at it.

tasn · 2022-06-27T10:37:24Z

src/EncryptedModels.ts

+      const chunkKeys = [...uniqueChunksMap.keys()];
+
+      // Change the original (shuffled) indices to point at the deduplicated chunks
+      const newIndices = indices.map((i) => {
+        const [id] = chunks[i];
+        return chunkKeys.indexOf(id);
+      });
+
+      chunks = [...uniqueChunksMap.values()];


Sorry it took me so long to review this. It's a very intrusive change in a very sensitive place.

I think the change is overall correct, though one thing that I'm very concerned about is this highlighted code I'm responding to. I'm not sure that per spec keys and values are necessarily in the same order and that it's deterministic. It could very well be, though I'm unsure. I'd much rather we created both in one go.

I'm also not sure what uniqueChunksMap actually does to the ordering.

I'll revert back to Rust, this is how it's done there, do you think that makes sense?

// Filter duplicates and construct the indice list. let mut uid_indices: HashMap<String, usize> = HashMap::new(); chunks = chunks .into_iter() .enumerate() .filter_map(|(i, chunk)| { let uid = &chunk.0; match uid_indices.get(uid) { Some(previous_index) => { indices[i] = *previous_index; None } None => { uid_indices.insert(uid.to_string(), i); Some(chunk) } } }) .collect(); // If we have more than one chunk we need to encode the mapping header in the last chunk if indices.len() > 1 { // We encode it in an array so we can extend it later on if needed let buf = rmp_serde::to_vec_named(&(indices,))?; let hash = to_base64(&crypto_manager.0.calculate_mac(&buf)?)?; chunks.push(ChunkArrayItem(hash, Some(buf))); }

cooper667 and others added 2 commits June 1, 2022 16:57

fix: Correctly map deduplicated chunk indices when setting content

146116d

Merge pull request #2 from gliff-ai/fix_dupe_chunks

0d6835e

fix: Correctly map deduplicated chunk indices when setting content

cooper667 mentioned this pull request Jun 2, 2022

Duplicate chunks throw an error #51

Open

tasn reviewed Jun 4, 2022

View reviewed changes

src/Etebase.test.ts Outdated Show resolved Hide resolved

fix: simplify test

9422fb1

fix: lint

4535362

tasn reviewed Jun 7, 2022

View reviewed changes

package.json Outdated Show resolved Hide resolved

fix: reset version

713bffc

tasn reviewed Jun 27, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correctly map deduplicated chunk indices when setting content #54

fix: Correctly map deduplicated chunk indices when setting content #54

cooper667 commented Jun 2, 2022 •

edited

Loading

tasn commented Jun 4, 2022

cooper667 commented Jun 4, 2022

tasn left a comment

tasn left a comment

tasn Jun 27, 2022

fix: Correctly map deduplicated chunk indices when setting content #54

Are you sure you want to change the base?

fix: Correctly map deduplicated chunk indices when setting content #54

Conversation

cooper667 commented Jun 2, 2022 • edited Loading

tasn commented Jun 4, 2022

cooper667 commented Jun 4, 2022

tasn left a comment

Choose a reason for hiding this comment

tasn left a comment

Choose a reason for hiding this comment

tasn Jun 27, 2022

Choose a reason for hiding this comment

cooper667 commented Jun 2, 2022 •

edited

Loading