Skip to content
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions examples/minimal/datapipes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*.json
*.npz
*.npy
*.zarr
209 changes: 209 additions & 0 deletions examples/minimal/datapipes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# PhysicsNeMo DataPipes

Dataloading is critical to SciML applications, both for training and inference,
and the physicsnemo datapipe infrastructure aims to deliver a flexible and configurable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat minor, but I notice that python / physicsnemo / pytorch are consistently not capitalized throughout this *.md and others. My preference would be to capitalize them in any plaintext docs ("Python", "PhysicsNeMo", "PyTorch").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll fix it. At least you know that if an LLM had written this, it would have capitalized it :)

set of tools to enable your application.

There are plenty of tools in the python eco system for loading, preprocessing, and
preparing your data for training or inference. To compare / contrast some of these
tools with the ecosystem available, and see if the physicsnemo datapipe interface
might be valuable to your workload, consider the following design principles
we followed when building the physicsnemo datapipes:

1. **GPU-first** - Many scientific datasets are *large* for even a single example:
the data is high resolution and the preprocessing needs benefit from GPU acceleration.
Compare this to other methods where the data preprocessing is predominantly CPU-based,
such as the pytorch Dataloader: whereas CPU-based preprocessing may introduce GPU
pipeline stalls on high resolution data, GPU-based preprocessing will maximize
throughput.

2. **Threading over Multiprocessing** - In python, true concurrency is typically only
available via multiprocessing or when offloading to compiled libraries or GPU kernels.
For this reason, many data loaders leverage multiprocessing for data concurrency:
load images in separate processes, and collate a batch on the main thread.
For simplicity, with a GPU-first paradigm, the physicsnemo datapipe focuses on GPU
concurrency via ascynchronous execution and stream-based parallelism. IO is coordinated
in multiple threads, instead of multiple processes, and streams enable multiple
preprocessing pipelines to execute concurrently on the GPU.

3. **Unambiguous Configuration and Serialization** - Datapipes can be a particularly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be worth defining "datapipes" here for new readers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, on it.

frustrating component in reproducability of AI results - the preprocessing, sampling,
batching and other parameters can be hard to infer from training scripts. Here,
we make a deliberate design choice to enable datapipe configuration serialization
as a first-class citizen. PhysicsNeMo Datapipes can be built directly in Python,
but also instantiated from hydra yaml files for version control and distribution.

4. **Familiar Interfaces** - We built our tools from scratch, but they are meant
to look familiar and inter-operate with the tools you already know. Use
physicsnemo DataLoaders as a replacement for PyTorch's Dataloader; tools like
DistributedSampler will still work. Users of `torchvision` will be familiar
with the concept of chaining transformations together.

5. **Extensibility out of the box** - We want to provide a data pipeline that gives
great performance and usability immediately - but it will never be the case that
one codebase covers all possible data needs out of the box. Therefore, the
physicsnemo datapipe is extensible: you can build custom data readers for
new dataformats, and plug them in to datasets; you can build new transforms
for your data that we might not have, and simply plug them into a transformation
pipeline. You can even package all of this up as a pip-installable extension: Using
the built in registry enables you to still instantiate and version control datapipes,
when the components are not even part of PhysicsNeMo.

## When should I use PhysicsNeMo datapipes over X/Y/Z data utility?

In general, the physicsnemo datapipe utility is built to deliver good performance
on data that is large, per example, like most scientific data is. If you want a
batch size of 512 small images, it may be more performant to use a CPU-centric
tool.

Another advantage of the PhysicsNeMo datapipe is the ability to build datapipes
directly from configuration files, allowing serializable and version-controlled
data configuration. This isn't the only tool that can do this, of course.

## Core Datapipe Design

Think of datasets as a hierarchy of data: at the highest level, an entire **dataset**
consists of independent **examples**. Each example has one or more **tensor components**:
image data may have input images and target labels; CFD data may have positions,
target pressures, a mesh object, boundary conditions, etc.; weather data may contain
sensor readings as a function of time. Each example may be the same size as the others,
or each example may be a unique size. Even the components of an example can be variable,
though this can require extra care in reading and using the dataset.

The PhysicsNeMo datapipe consists of the following components:

- `reader` objects contain the logic to understand a **dataset** on disk, and
load examples into CPU memory.

- The `dataset` object, which contains a `reader`, orchestrates threads that preload
data **examples** from disk and move it to GPU. On the GPU, a `dataset` can apply a
series of transformations to each **example**. Each example is stored in `tensordict`
format. The dataset will also track metadata, for understand where each **example**
came from (index, filepath, etc.).

- A `transform` is a callable class that accepts a tensordict as input, and returns
a `tensordict` as output. Chaining transformations together is the core way to
manipulate data examples on the fly in a datapipe.

- The `dataloader` is a drop-in replacement for the PyTorch DataLoader, with additional
optimizations for the GPU-centric processing here. The `dataloader` handles
stream concurrency, batch collation, and triggering preloading of datasets.

---

## Tutorials

This directory contains progressive tutorials that teach you how to use the
PhysicsNeMo datapipe infrastructure effectively. Note that some of the tutorials
are repetitive and verbose, to highlight different features of the datapipe
ecosystem. We'll give some overview of what you can learn in each tutorial,
but they are meant to be run interactively and explored.

### Data Prerequisites

You do not need to have any specific data in hand for the tutorials. You can
generate synthetic data with the scripts `generate_regular_data.py` and
`generate_variable_points_data.py`.

### Tutorial 1: Getting Started with DataPipes

**File:** `tutorial_01_getting_started.py`

Learn the core concepts of data loading from disk:

- Creating a Reader to load data from files
- Understanding the `(TensorDict, metadata)` return format
- Wrapping a reader in a Dataset
- Iterating with a DataLoader
- Accessing batch data via TensorDict keys

```bash
# Generate tutorial data first
python generate_regular_data.py -n 100 \
-s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \
-b zarr -o output/tutorial_data/

# Run the tutorial
python tutorial_01_getting_started.py
```

### Tutorial 2: Transforms and Data Preprocessing

**File:** `tutorial_02_transforms.py`

Build preprocessing pipelines with transforms:

- Apply a single transform (Normalize)
- Compose multiple transforms together
- Subsample point clouds with SubsamplePoints
- Use geometric transforms (Translate, ReScale)
- Save/load normalization statistics from files
- Denormalize data with the `inverse()` method

```bash
# Generate regular grid data (for most sections)
# Note: Tutorial 2 can reuse the data from Tutorial 1
python generate_regular_data.py -n 100 \
-s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \
-b zarr -o output/tutorial_data/

# Generate point cloud data (for subsampling sections)
python generate_variable_points_data.py -n 100 \
-s "coords:3 features:8" --min-points 50000 \
--max-points 100000 -b zarr -o output/pointcloud_data/

# Run the tutorial
python tutorial_02_transforms.py
```

### Tutorial 3: Custom Collation for GNNs

**File:** `tutorial_03_custom_gnn_datapipe.py`

Build a GNN-ready data pipeline with custom collation:

- Build a custom Transform for computing KNN graph edges
- Implement a custom Collator for PyG-style graph batching
- Understand how PyG batches graphs (offset edges, concatenate features, batch tensor)
- Put it all together in a complete GNN training pipeline

```bash
# Generate point cloud data with coordinates and features (can be reused from tutorial 2)
python generate_variable_points_data.py -n 100 \
-s "coords:3 features:8" --min-points 50000 \
--max-points 100000 -b zarr -o output/pointcloud_data/

# Run the tutorial
python tutorial_03_custom_gnn_datapipe.py
```

### Tutorial 4: Hydra Configuration for DataPipes

**File:** `tutorial_04_hydra_config.py`

Build entire datapipes from YAML configuration with minimal Python code:

- Define reader, transforms, dataset, and dataloader in YAML
- Use `hydra.utils.instantiate()` to build components
- Override any parameter from the command line
- Switch between configurations easily

```bash
# Generate tutorial data (from tutorials 2 and 3)
python generate_variable_points_data.py -n 100 -s \
"coords:3 features:8" --min-points 50000 \
--max-points 100000 -b zarr -o output/pointcloud_data/

# Run with default config
python tutorial_04_hydra_config.py

# Override from command line
python tutorial_04_hydra_config.py dataloader.batch_size=8 dataloader.dataset.device=cuda

# Use point cloud configuration (this is the default)
python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud

# Override transform parameters
python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud \
subsample.n_points=5000
```
18 changes: 18 additions & 0 deletions examples/minimal/datapipes/conf/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

defaults:
- reader/zarr
15 changes: 15 additions & 0 deletions examples/minimal/datapipes/conf/reader/npz.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
39 changes: 39 additions & 0 deletions examples/minimal/datapipes/conf/reader/tensorstore_zarr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TensorStore Zarr Reader Configuration
# High-performance async reader for Zarr files using TensorStore
_target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader
path: ???
group_pattern: "*.zarr"
fields: null
default_values: null
cache_bytes_limit: 10000000 # 10 MB cache
data_copy_concurrency: 72
file_io_concurrency: 72
pin_memory: false
include_index_in_metadata: true

# Coordinated subsampling configuration (optional)
# Reads contiguous chunks from large tensors to reduce IO bandwidth.
# Set to null to disable, or configure n_points and target_keys.
coordinated_subsampling: null
# Example:
# coordinated_subsampling:
# n_points: 10000
# target_keys:
# - volume_coords
# - volume_fields
25 changes: 25 additions & 0 deletions examples/minimal/datapipes/conf/reader/zarr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# _target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader
_target_: physicsnemo.datapipes.core.readers.ZarrReader
path: ???
fields: null
default_values: null
group_pattern: "*.zarr"
pin_memory: false
include_index_in_metadata: true

38 changes: 38 additions & 0 deletions examples/minimal/datapipes/conf/transforms/normalize.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Normalize Transform Configuration
# Normalizes specified fields using mean-std or min-max scaling
#
# Method options:
# - mean_std: Applies (x - mean) / std
# - min_max: Applies (x - center) / half_range, normalizing to [-1, 1]
_target_: physicsnemo.datapipes.core.transforms.Normalize
_convert_: all
input_keys:
- features
method: mean_std # Required: "mean_std" or "min_max"
means:
features: 0.0
stds:
features: 0.6
eps: 1.0e-8

# For min_max method, use these instead of means/stds:
# mins:
# features: -1.0
# maxs:
# features: 1.0
26 changes: 26 additions & 0 deletions examples/minimal/datapipes/conf/transforms/subsample.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# SubsamplePoints Transform Configuration
# Subsamples point clouds to a fixed number of points
# Useful for handling variable-size point cloud data in batched training
_target_: physicsnemo.datapipes.core.transforms.SubsamplePoints
input_keys:
- coords
- features
n_points: 10000
algorithm: uniform # Options: "uniform" or "poisson_fixed"
weights_key: null # Optional: key for weighted sampling (e.g., "surface_areas")
Loading
Loading