[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

kszucs · 2025-01-27T12:40:52Z

Rationale for this change

I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.

A purpose built evaluation tool is available at https://github.com/kszucs/de

Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet

❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/openfoodfacts/product-database.git ~/Datasets/product-database
❯ de revisions -d /tmp/food ~/Datasets/product-database/food.parquet
food.parquet has 32 revisions
Checking out 2e19b51
Checking out 1f84d31
Checking out d31d108
Checking out 9cd809c
Checking out 41e5f38
Checking out 9a30ddd
...

❯ de stats /tmp/food 
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃            ┃            ┃            ┃             ┃             ┃ Compressed ┃
┃            ┃            ┃            ┃  Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃  Chunk Size ┃       Ratio ┃      Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parquet    │  182.6 GiB │  148.0 GiB │   140.5 GiB │         81% │        77% │
│ CDC Snappy │  178.3 GiB │   75.5 GiB │    73.0 GiB │         42% │        41% │
│ CDC ZSTD   │  109.6 GiB │   55.9 GiB │    55.6 GiB │         51% │        51% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘

What changes are included in this PR?

Are these changes tested?

Not yet.

Are there any user-facing changes?

There is a new parquet writer property called content_defined_chunking which is subject to renaming.

github-actions · 2025-01-27T12:41:16Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

kszucs · 2025-01-27T14:32:03Z

cpp/src/parquet/column_chunker.h

+  bool Roll(const T value) {
+    constexpr size_t BYTE_WIDTH = sizeof(T);
+    chunk_size_ += BYTE_WIDTH;
+    // if (chunk_size_ < min_len_) {


Skipping bytes until the min size is reached speeds up the boundary detection quite a lot.

kszucs · 2025-01-27T14:32:56Z

cpp/src/parquet/column_chunker.h

+const uint64_t MASK = 0xffff00000000000;
+// const int MIN_LEN = 65536 / 8;
+// const int MAX_LEN = 65536 * 2;
+const int64_t MIN_LEN = 256 * 1024;


These defaults values are subject to change, especially because the default maximum page size is 1Mi.

mapleFU

Is cdc a part of the parquet spec? Or is it a poc?

kszucs · 2025-01-28T14:57:07Z

Is cdc a part of the parquet spec? Or is it a poc?

It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split.

…quet writer

kszucs · 2025-02-07T08:28:32Z

cpp/src/parquet/column_chunker.h

+namespace parquet {
+namespace internal {
+
+constexpr uint64_t GEAR_HASH_TABLE[8][256] = {


Need to find a better way to embed these constants, I used the following script to generate this:

import hashlib import sys def gear(n: int, seed: int): value = bytes([seed] * 64 + [n] * 64) hasher = hashlib.md5(value) return hasher.hexdigest()[:16] def print_table(seed: int, length=256, comma=True): table = [gear(n, seed=seed) for n in range(length)] print(f"{{ // seed = {seed}") for i in range(0, length, 4): print(" ", end="") values = [f"0x{value}" for value in table[i:i + 4]] values = ", ".join(values) print(f" {values}", end=",\n" if i < length - 4 else "\n") print(" }", end=", " if comma else "") if __name__ == "__main__": print("{") n = int(sys.argv[1]) for seed in range(n): print_table(seed, comma=seed < n) print("}")

kszucs · 2025-02-07T08:29:50Z

cpp/src/parquet/column_chunker.h

+  FastCDC(const LevelInfo& level_info, uint64_t avg_len, uint8_t granurality_level = 5)
+      : level_info_(level_info),
+        avg_len_(avg_len == 0 ? AVG_LEN : avg_len),
+        min_len_(static_cast<uint64_t>(avg_len_ * 0.6)),


Maybe we should expose min and max chunk size again, these factors try to normalize the chunk size.

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2025

mapleFU reviewed Jan 28, 2025

View reviewed changes

kszucs force-pushed the content-defined-chunking branch from d1076a8 to c7a0b3a Compare January 30, 2025 18:33

kszucs added 7 commits February 6, 2025 16:49

[C++][Python][Parquet] Implement Content-Defined Chunking for the Par…

aa25c19

…quet writer

always roll values

f88f8a3

add faster paths for flat arrays

3de26f9

normalize chunk sizes according to fastcdc algorithm

2ca6ce9

missing header and fix level_offset incrementation

3e057d4

don't use normalization by default

6307c90

use contexpr for gear hash tables

6fc058d

kszucs force-pushed the content-defined-chunking branch from 9919b4f to 6fc058d Compare February 6, 2025 15:49

kszucs added 2 commits February 6, 2025 17:23

don't include loging

cc7a166

please msvc

e22f4ca

kszucs commented Feb 7, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 7, 2025

kszucs commented Feb 7, 2025

View reviewed changes

increase the min/max bands around the avg chunk size

8a04930

github-actions bot removed the awaiting changes Awaiting changes label Feb 7, 2025

github-actions bot added the awaiting change review Awaiting change review label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

kszucs commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025

kszucs Jan 27, 2025

kszucs Jan 27, 2025

mapleFU left a comment

kszucs commented Jan 28, 2025 •

edited

Loading

kszucs Feb 7, 2025

kszucs Feb 7, 2025

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Are you sure you want to change the base?

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Conversation

kszucs commented Jan 27, 2025 • edited Loading

Rationale for this change

Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jan 27, 2025

kszucs Jan 27, 2025

Choose a reason for hiding this comment

kszucs Jan 27, 2025

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

kszucs commented Jan 28, 2025 • edited Loading

kszucs Feb 7, 2025

Choose a reason for hiding this comment

kszucs Feb 7, 2025

Choose a reason for hiding this comment

kszucs commented Jan 27, 2025 •

edited

Loading

kszucs commented Jan 28, 2025 •

edited

Loading