-
Notifications
You must be signed in to change notification settings - Fork 62
Manifest Splitting #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Manifest Splitting #767
Changes from all commits
Commits
Show all changes
76 commits
Select commit
Hold shift + click to select a range
de9ab79
[WIP] manifest sharding
dcherian 14d9048
WIP
dcherian 6aa7e29
More condition parsing work
dcherian 70412ec
thread config through
dcherian 267c9cf
clippied
dcherian acf8fa5
WIP proptest
dcherian 2542412
Revert "WIP proptest"
dcherian d816f8b
Simple test passes!
dcherian a64252a
Cleanup
dcherian 8ad20e8
Better test
dcherian ac42185
More tests
dcherian 34da81b
Lossen type
dcherian d935131
Update gc test
dcherian d27a2a2
DimensionName(regex)
dcherian 83fef67
more test
dcherian b47788c
Revert "Lossen type"
dcherian a56ccce
Add condition parsing tests
dcherian 87af732
WIP
dcherian 63fdc6d
Add notes
dcherian d491c3e
Iterator -> Stream
dcherian fc965d7
Optimize reads!
dcherian b2167f7
Python config
dcherian dd95c32
Add reprs
dcherian fb89826
Add from_dict
dcherian edb1d13
clippy
dcherian 4732a61
[revert] comment out bad test
dcherian 5b70006
Add doctest to Just
dcherian 0ce517e
Add doctest
dcherian dd9b77d
Merge branch 'main' into split-manifests
dcherian fc19e7f
Fix appends
dcherian 22deffe
Add updating config on read test
dcherian 2121f9a
Fix types
dcherian 56158c5
add comment to clarify monkey patch
dcherian 1a0215c
Update icechunk-python/python/icechunk/_icechunk_python.pyi
dcherian deb5a7e
Fix docstring types
dcherian d749f41
Update type, docstring
dcherian e648329
ShardDimCondition::Any -> Rest
dcherian aa82355
Minor cleanup
dcherian 5d2318b
More complex tests
dcherian 10e3b7e
Aggregate extents while grouping shards.
dcherian 137f283
Update reprs
dcherian 1119f81
Benchmarks cleanup
dcherian f3db41b
Add write benchmark
dcherian ac6a0ef
Add read benchmark
dcherian 2f42e71
Merge branch 'main' into split-manifests
dcherian ea50452
Add rust test for large numbers of refs
dcherian 2ea8e2a
Merge branch 'main' into split-manifests
dcherian 33966cb
Add to test_can_read_old.py
dcherian 80f8f5d
shard → split
dcherian 94bbf9e
Merge branch 'main' into split-manifests
dcherian 6fc7eb6
one more rename
dcherian a76558d
Merge branch 'main' into split-manifests
dcherian c35b589
Address minor comments.
dcherian f6156b9
Comment out handling sessionerror.
dcherian 48bebf2
Rest -> Any
dcherian 0d7e01f
Assert len(manifestextents) > 0
dcherian 8c4cc59
New ManifestSplitDim struct
dcherian 872a522
lint
dcherian 77329bf
Merge branch 'main' into split-manifests
dcherian 8d24d01
Remove unneeded Index
dcherian 5d6cefa
Add property test
dcherian 5e34e6c
Add docs
dcherian 315d1df
Add ManifestSplitCondition.AnyArray
dcherian 9954cda
fix docs
dcherian 0b3581d
Add Or condition test
dcherian c1c7688
fix docs
dcherian 9e625cf
fix docs
dcherian 39125a8
fix docs
dcherian b5812d5
Try speeding up docs build
dcherian 02c95b6
fix docs
dcherian 8051f92
try fixing docs rendering
dcherian bf8936e
Apply suggestions from code review
dcherian 371c045
more docs
dcherian 5a955f5
Fix test
dcherian 163d31c
tweak docs build
dcherian 0f44af8
Merge branch 'main' into split-manifests
dcherian File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # Performance | ||
|
|
||
| !!! info | ||
|
|
||
| This is advanced material, and you will need it only if you have arrays with more than a million chunks. | ||
| Icechunk aims to provide an excellent experience out of the box. | ||
|
|
||
| ## Preloading manifests | ||
|
|
||
| Coming Soon. | ||
dcherian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Splitting manifests | ||
|
|
||
| Icechunk stores chunk references in a chunk manifest file stored in `manifests/`. | ||
| For very large arrays (millions of chunks), these files can get quite large. | ||
| By default, Icechunk stores all chunk references in a single manifest file per array. | ||
| Requesting even a single chunk requires downloading the entire manifest. | ||
| In some cases, this can result in a slow time-to-first-byte or large memory usage. | ||
|
|
||
| !!! note | ||
|
|
||
| Note that the chunk sizes in the following examples are tiny for demonstration purposes. | ||
|
|
||
| To avoid that, Icechunk lets you split the manifest files by specifying a ``ManifestSplittingConfig``. | ||
|
|
||
| ```python exec="on" session="perf" source="material-block" | ||
| import icechunk as ic | ||
| from icechunk import ManifestSplitCondition, ManifestSplittingConfig, ManifestSplitDimCondition | ||
|
|
||
| split_config = ManifestSplittingConfig.from_dict( | ||
| { | ||
| ManifestSplitCondition.AnyArray(): { | ||
| ManifestSplitDimCondition.DimensionName("time"): 365 * 24 | ||
| } | ||
| } | ||
| ) | ||
| repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config)) | ||
| ``` | ||
|
|
||
| Then pass the config to `Repository.open` or `Repository.create` | ||
| ```python | ||
| repo = ic.Repository.open(..., config=repo_config) | ||
| ``` | ||
|
|
||
| This particular example splits manifests so that each manifest contains `365 * 24` chunks along the time dimension, and every chunk along every other dimension in a single file. | ||
|
|
||
| Options for specifying the arrays whose manifest you want to split are: | ||
|
|
||
| 1. [`ManifestSplitCondition.name_matches`](./reference.md#icechunk.ManifestSplitCondition.name_matches) takes a regular expression used to match an array's name; | ||
| 2. [`ManifestSplitCondition.path_matches`](./reference.md#icechunk.ManifestSplitCondition.path_matches) takes a regular expression used to match an array's path; | ||
| 3. [`ManifestSplitCondition.and_conditions`](./reference.md#icechunk.ManifestSplitCondition.and_conditions) to combine (1), (2), and (4) together; and | ||
| 4. [`ManifestSplitCondition.or_conditions`](./reference.md#icechunk.ManifestSplitCondition.or_conditions) to combine (1), (2), and (3) together. | ||
|
|
||
|
|
||
| `And` and `Or` may be used to combine multiple path and/or name matches. For example, | ||
| ```python exec="on" session="perf" source="material-block" | ||
| array_condition = ManifestSplitCondition.or_conditions( | ||
| [ | ||
| ManifestSplitCondition.name_matches("temperature"), | ||
| ManifestSplitCondition.name_matches("salinity"), | ||
| ] | ||
| ) | ||
| sconfig = ManifestSplittingConfig.from_dict( | ||
| {array_condition: {ManifestSplitDimCondition.DimensionName("longitude"): 3}} | ||
| ) | ||
| ``` | ||
|
|
||
| Options for specifying how to split along a specific axis or dimension are: | ||
|
|
||
| 1. [`ManifestSplitDimCondition.Axis`](./reference.md#icechunk.ManifestSplitDimCondition.Axis) takes an integer axis; | ||
| 2. [`ManifestSplitDimCondition.DimensionName`](./reference.md#icechunk.ManifestSplitDimCondition.DimensionName) takes a regular expression used to match the dimension names of the array; | ||
| 3. [`ManifestSplitDimCondition.Any`](./reference.md#icechunk.ManifestSplitDimCondition.Any) matches any _remaining_ dimension name or axis. | ||
|
|
||
|
|
||
| For example, for an array with dimensions `time, latitude, longitude`, the following config | ||
| ```python exec="on" session="perf" source="material-block" | ||
| from icechunk import ManifestSplitDimCondition | ||
|
|
||
| { | ||
| ManifestSplitDimCondition.DimensionName("longitude"): 3, | ||
| ManifestSplitDimCondition.Axis(1): 2, | ||
| ManifestSplitDimCondition.Any(): 1, | ||
| } | ||
| ``` | ||
| will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file. | ||
dcherian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| !!! note | ||
|
|
||
| Python dictionaries preserve insertion order, so the first condition encountered takes priority. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.