-
Notifications
You must be signed in to change notification settings - Fork 540
PhysicsNemo Datapipes #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
coreyjadams
wants to merge
13
commits into
NVIDIA:main
Choose a base branch
from
coreyjadams:datapipes-rebased
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
PhysicsNemo Datapipes #1304
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
80fa0e5
Passing tests for datapipes
coreyjadams 036a979
Fix datapipe tests
coreyjadams ec7cd21
First pass at examples, not quite ready yet.
coreyjadams 322a40c
Add datapipe tutorials.
coreyjadams 94f3dd1
Update migration guide.
coreyjadams 0180ecb
Update codeowners for datapipes.
coreyjadams 639e2fa
Merge branch 'main' into datapipes-rebased
coreyjadams 2e028c6
Address greptile review comments
coreyjadams 9902ace
Fix failing datapipe tests
coreyjadams 745dbaf
Update physicsnemo/datapipes/core/transforms/base.py
coreyjadams 6003c7d
Update physicsnemo/datapipes/core/readers/hdf5.py
coreyjadams 5c4c4ce
Merge branch 'NVIDIA:main' into datapipes-rebased
coreyjadams 437ff79
Merge branch 'NVIDIA:main' into datapipes-rebased
coreyjadams File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| *.json | ||
| *.npz | ||
| *.npy | ||
| *.zarr |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,209 @@ | ||
| # PhysicsNeMo DataPipes | ||
|
|
||
| Dataloading is critical to SciML applications, both for training and inference, | ||
| and the physicsnemo datapipe infrastructure aims to deliver a flexible and configurable | ||
| set of tools to enable your application. | ||
|
|
||
| There are plenty of tools in the python eco system for loading, preprocessing, and | ||
| preparing your data for training or inference. To compare / contrast some of these | ||
| tools with the ecosystem available, and see if the physicsnemo datapipe interface | ||
| might be valuable to your workload, consider the following design principles | ||
| we followed when building the physicsnemo datapipes: | ||
|
|
||
| 1. **GPU-first** - Many scientific datasets are *large* for even a single example: | ||
| the data is high resolution and the preprocessing needs benefit from GPU acceleration. | ||
| Compare this to other methods where the data preprocessing is predominantly CPU-based, | ||
| such as the pytorch Dataloader: whereas CPU-based preprocessing may introduce GPU | ||
| pipeline stalls on high resolution data, GPU-based preprocessing will maximize | ||
| throughput. | ||
|
|
||
| 2. **Threading over Multiprocessing** - In python, true concurrency is typically only | ||
| available via multiprocessing or when offloading to compiled libraries or GPU kernels. | ||
| For this reason, many data loaders leverage multiprocessing for data concurrency: | ||
| load images in separate processes, and collate a batch on the main thread. | ||
| For simplicity, with a GPU-first paradigm, the physicsnemo datapipe focuses on GPU | ||
| concurrency via asynchronous execution and stream-based parallelism. IO is coordinated | ||
| in multiple threads, instead of multiple processes, and streams enable multiple | ||
| preprocessing pipelines to execute concurrently on the GPU. | ||
|
|
||
| 3. **Unambiguous Configuration and Serialization** - Datapipes can be a particularly | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be worth defining "datapipes" here for new readers
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, on it. |
||
| frustrating component in reproducibility of AI results - the preprocessing, sampling, | ||
| batching and other parameters can be hard to infer from training scripts. Here, | ||
| we make a deliberate design choice to enable datapipe configuration serialization | ||
| as a first-class citizen. PhysicsNeMo Datapipes can be built directly in Python, | ||
| but also instantiated from hydra yaml files for version control and distribution. | ||
|
|
||
| 4. **Familiar Interfaces** - We built our tools from scratch, but they are meant | ||
| to look familiar and inter-operate with the tools you already know. Use | ||
| physicsnemo DataLoaders as a replacement for PyTorch's Dataloader; tools like | ||
| DistributedSampler will still work. Users of `torchvision` will be familiar | ||
| with the concept of chaining transformations together. | ||
|
|
||
| 5. **Extensibility out of the box** - We want to provide a data pipeline that gives | ||
| great performance and usability immediately - but it will never be the case that | ||
| one codebase covers all possible data needs out of the box. Therefore, the | ||
| physicsnemo datapipe is extensible: you can build custom data readers for | ||
| new dataformats, and plug them in to datasets; you can build new transforms | ||
| for your data that we might not have, and simply plug them into a transformation | ||
| pipeline. You can even package all of this up as a pip-installable extension: Using | ||
| the built in registry enables you to still instantiate and version control datapipes, | ||
| when the components are not even part of PhysicsNeMo. | ||
|
|
||
| ## When should I use PhysicsNeMo datapipes over X/Y/Z data utility? | ||
|
|
||
| In general, the physicsnemo datapipe utility is built to deliver good performance | ||
| on data that is large, per example, like most scientific data is. If you want a | ||
| batch size of 512 small images, it may be more performant to use a CPU-centric | ||
| tool. | ||
|
|
||
| Another advantage of the PhysicsNeMo datapipe is the ability to build datapipes | ||
| directly from configuration files, allowing serializable and version-controlled | ||
| data configuration. This isn't the only tool that can do this, of course. | ||
|
|
||
| ## Core Datapipe Design | ||
|
|
||
| Think of datasets as a hierarchy of data: at the highest level, an entire **dataset** | ||
| consists of independent **examples**. Each example has one or more **tensor components**: | ||
| image data may have input images and target labels; CFD data may have positions, | ||
| target pressures, a mesh object, boundary conditions, etc.; weather data may contain | ||
| sensor readings as a function of time. Each example may be the same size as the others, | ||
| or each example may be a unique size. Even the components of an example can be variable, | ||
| though this can require extra care in reading and using the dataset. | ||
|
|
||
| The PhysicsNeMo datapipe consists of the following components: | ||
|
|
||
| - `reader` objects contain the logic to understand a **dataset** on disk, and | ||
| load examples into CPU memory. | ||
|
|
||
| - The `dataset` object, which contains a `reader`, orchestrates threads that preload | ||
| data **examples** from disk and move it to GPU. On the GPU, a `dataset` can apply a | ||
| series of transformations to each **example**. Each example is stored in `tensordict` | ||
| format. The dataset will also track metadata, for understanding where each **example** | ||
| came from (index, filepath, etc.). | ||
|
|
||
| - A `transform` is a callable class that accepts a tensordict as input, and returns | ||
| a `tensordict` as output. Chaining transformations together is the core way to | ||
| manipulate data examples on the fly in a datapipe. | ||
|
|
||
| - The `dataloader` is a drop-in replacement for the PyTorch DataLoader, with additional | ||
| optimizations for the GPU-centric processing here. The `dataloader` handles | ||
| stream concurrency, batch collation, and triggering preloading of datasets. | ||
|
|
||
| --- | ||
|
|
||
| ## Tutorials | ||
|
|
||
| This directory contains progressive tutorials that teach you how to use the | ||
| PhysicsNeMo datapipe infrastructure effectively. Note that some of the tutorials | ||
| are repetitive and verbose, to highlight different features of the datapipe | ||
| ecosystem. We'll give some overview of what you can learn in each tutorial, | ||
| but they are meant to be run interactively and explored. | ||
|
|
||
| ### Data Prerequisites | ||
|
|
||
| You do not need to have any specific data in hand for the tutorials. You can | ||
| generate synthetic data with the scripts `generate_regular_data.py` and | ||
| `generate_variable_points_data.py`. | ||
|
|
||
| ### Tutorial 1: Getting Started with DataPipes | ||
|
|
||
| **File:** `tutorial_01_getting_started.py` | ||
|
|
||
| Learn the core concepts of data loading from disk: | ||
|
|
||
| - Creating a Reader to load data from files | ||
| - Understanding the `(TensorDict, metadata)` return format | ||
| - Wrapping a reader in a Dataset | ||
| - Iterating with a DataLoader | ||
| - Accessing batch data via TensorDict keys | ||
|
|
||
| ```bash | ||
| # Generate tutorial data first | ||
| python generate_regular_data.py -n 100 \ | ||
| -s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \ | ||
| -b zarr -o output/tutorial_data/ | ||
|
|
||
| # Run the tutorial | ||
| python tutorial_01_getting_started.py | ||
| ``` | ||
|
|
||
| ### Tutorial 2: Transforms and Data Preprocessing | ||
|
|
||
| **File:** `tutorial_02_transforms.py` | ||
|
|
||
| Build preprocessing pipelines with transforms: | ||
|
|
||
| - Apply a single transform (Normalize) | ||
| - Compose multiple transforms together | ||
| - Subsample point clouds with SubsamplePoints | ||
| - Use geometric transforms (Translate, ReScale) | ||
| - Save/load normalization statistics from files | ||
| - Denormalize data with the `inverse()` method | ||
|
|
||
| ```bash | ||
| # Generate regular grid data (for most sections) | ||
| # Note: Tutorial 2 can reuse the data from Tutorial 1 | ||
| python generate_regular_data.py -n 100 \ | ||
| -s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \ | ||
| -b zarr -o output/tutorial_data/ | ||
|
|
||
| # Generate point cloud data (for subsampling sections) | ||
| python generate_variable_points_data.py -n 100 \ | ||
| -s "coords:3 features:8" --min-points 50000 \ | ||
| --max-points 100000 -b zarr -o output/pointcloud_data/ | ||
|
|
||
| # Run the tutorial | ||
| python tutorial_02_transforms.py | ||
| ``` | ||
|
|
||
| ### Tutorial 3: Custom Collation for GNNs | ||
|
|
||
| **File:** `tutorial_03_custom_gnn_datapipe.py` | ||
|
|
||
| Build a GNN-ready data pipeline with custom collation: | ||
|
|
||
| - Build a custom Transform for computing KNN graph edges | ||
| - Implement a custom Collator for PyG-style graph batching | ||
| - Understand how PyG batches graphs (offset edges, concatenate features, batch tensor) | ||
| - Put it all together in a complete GNN training pipeline | ||
|
|
||
| ```bash | ||
| # Generate point cloud data with coordinates and features (can be reused from tutorial 2) | ||
| python generate_variable_points_data.py -n 100 \ | ||
| -s "coords:3 features:8" --min-points 50000 \ | ||
| --max-points 100000 -b zarr -o output/pointcloud_data/ | ||
|
|
||
| # Run the tutorial | ||
| python tutorial_03_custom_gnn_datapipe.py | ||
| ``` | ||
|
|
||
| ### Tutorial 4: Hydra Configuration for DataPipes | ||
|
|
||
| **File:** `tutorial_04_hydra_config.py` | ||
|
|
||
| Build entire datapipes from YAML configuration with minimal Python code: | ||
|
|
||
| - Define reader, transforms, dataset, and dataloader in YAML | ||
| - Use `hydra.utils.instantiate()` to build components | ||
| - Override any parameter from the command line | ||
| - Switch between configurations easily | ||
|
|
||
| ```bash | ||
| # Generate tutorial data (from tutorials 2 and 3) | ||
| python generate_variable_points_data.py -n 100 -s \ | ||
| "coords:3 features:8" --min-points 50000 \ | ||
| --max-points 100000 -b zarr -o output/pointcloud_data/ | ||
|
|
||
| # Run with default config | ||
| python tutorial_04_hydra_config.py | ||
|
|
||
| # Override from command line | ||
| python tutorial_04_hydra_config.py dataloader.batch_size=8 dataloader.dataset.device=cuda | ||
|
|
||
| # Use point cloud configuration (this is the default) | ||
| python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud | ||
|
|
||
| # Override transform parameters | ||
| python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud \ | ||
| subsample.n_points=5000 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-FileCopyrightText: All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| defaults: | ||
| - reader/zarr |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-FileCopyrightText: All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
39 changes: 39 additions & 0 deletions
39
examples/minimal/datapipes/conf/reader/tensorstore_zarr.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-FileCopyrightText: All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # TensorStore Zarr Reader Configuration | ||
| # High-performance async reader for Zarr files using TensorStore | ||
| _target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader | ||
| path: ??? | ||
| group_pattern: "*.zarr" | ||
| fields: null | ||
| default_values: null | ||
| cache_bytes_limit: 10000000 # 10 MB cache | ||
| data_copy_concurrency: 72 | ||
| file_io_concurrency: 72 | ||
| pin_memory: false | ||
| include_index_in_metadata: true | ||
|
|
||
| # Coordinated subsampling configuration (optional) | ||
| # Reads contiguous chunks from large tensors to reduce IO bandwidth. | ||
| # Set to null to disable, or configure n_points and target_keys. | ||
| coordinated_subsampling: null | ||
| # Example: | ||
| # coordinated_subsampling: | ||
| # n_points: 10000 | ||
| # target_keys: | ||
| # - volume_coords | ||
| # - volume_fields |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-FileCopyrightText: All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # _target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader | ||
| _target_: physicsnemo.datapipes.core.readers.ZarrReader | ||
| path: ??? | ||
| fields: null | ||
| default_values: null | ||
| group_pattern: "*.zarr" | ||
| pin_memory: false | ||
| include_index_in_metadata: true | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES. | ||
| # SPDX-FileCopyrightText: All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Normalize Transform Configuration | ||
| # Normalizes specified fields using mean-std or min-max scaling | ||
| # | ||
| # Method options: | ||
| # - mean_std: Applies (x - mean) / std | ||
| # - min_max: Applies (x - center) / half_range, normalizing to [-1, 1] | ||
| _target_: physicsnemo.datapipes.core.transforms.Normalize | ||
| _convert_: all | ||
| input_keys: | ||
| - features | ||
| method: mean_std # Required: "mean_std" or "min_max" | ||
| means: | ||
| features: 0.0 | ||
| stds: | ||
| features: 0.6 | ||
| eps: 1.0e-8 | ||
|
|
||
| # For min_max method, use these instead of means/stds: | ||
| # mins: | ||
| # features: -1.0 | ||
| # maxs: | ||
| # features: 1.0 |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat minor, but I notice that
python/physicsnemo/pytorchare consistently not capitalized throughout this *.md and others. My preference would be to capitalize them in any plaintext docs ("Python", "PhysicsNeMo", "PyTorch").There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll fix it. At least you know that if an LLM had written this, it would have capitalized it :)