-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
cpp/src/parquet/column_chunker.h
Outdated
bool Roll(const T value) { | ||
constexpr size_t BYTE_WIDTH = sizeof(T); | ||
chunk_size_ += BYTE_WIDTH; | ||
// if (chunk_size_ < min_len_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping bytes until the min size is reached speeds up the boundary detection quite a lot.
cpp/src/parquet/column_chunker.h
Outdated
const uint64_t MASK = 0xffff00000000000; | ||
// const int MIN_LEN = 65536 / 8; | ||
// const int MAX_LEN = 65536 * 2; | ||
const int64_t MIN_LEN = 256 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These defaults values are subject to change, especially because the default maximum page size is 1Mi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cdc a part of the parquet spec? Or is it a poc?
It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split. |
d1076a8
to
c7a0b3a
Compare
9919b4f
to
6fc058d
Compare
namespace parquet { | ||
namespace internal { | ||
|
||
constexpr uint64_t GEAR_HASH_TABLE[8][256] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to find a better way to embed these constants, I used the following script to generate this:
import hashlib
import sys
def gear(n: int, seed: int):
value = bytes([seed] * 64 + [n] * 64)
hasher = hashlib.md5(value)
return hasher.hexdigest()[:16]
def print_table(seed: int, length=256, comma=True):
table = [gear(n, seed=seed) for n in range(length)]
print(f"{{ // seed = {seed}")
for i in range(0, length, 4):
print(" ", end="")
values = [f"0x{value}" for value in table[i:i + 4]]
values = ", ".join(values)
print(f" {values}", end=",\n" if i < length - 4 else "\n")
print(" }", end=", " if comma else "")
if __name__ == "__main__":
print("{")
n = int(sys.argv[1])
for seed in range(n):
print_table(seed, comma=seed < n)
print("}")
cpp/src/parquet/column_chunker.h
Outdated
FastCDC(const LevelInfo& level_info, uint64_t avg_len, uint8_t granurality_level = 5) | ||
: level_info_(level_info), | ||
avg_len_(avg_len == 0 ? AVG_LEN : avg_len), | ||
min_len_(static_cast<uint64_t>(avg_len_ * 0.6)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should expose min and max chunk size again, these factors try to normalize the chunk size.
Rationale for this change
I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.
A purpose built evaluation tool is available at https://github.com/kszucs/de
Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet
What changes are included in this PR?
Are these changes tested?
Not yet.
Are there any user-facing changes?
There is a new parquet writer property called
content_defined_chunking
which is subject to renaming.