NVIDIA · coreyjadams · Dec 23, 2025 · Jan 2, 2026 · Jan 2, 2026 · Jan 5, 2026
diff --git a/examples/minimal/datapipes/.gitignore b/examples/minimal/datapipes/.gitignore
@@ -0,0 +1,4 @@
+*.json
+*.npz
+*.npy
+*.zarr
diff --git a/examples/minimal/datapipes/README.md b/examples/minimal/datapipes/README.md
@@ -0,0 +1,209 @@
+# PhysicsNeMo DataPipes
+
+Dataloading is critical to SciML applications, both for training and inference,
+and the physicsnemo datapipe infrastructure aims to deliver a flexible and configurable
+set of tools to enable your application.
+
+There are plenty of tools in the python eco system for loading, preprocessing, and
+preparing your data for training or inference.  To compare / contrast some of these
+tools with the ecosystem available, and see if the physicsnemo datapipe interface
+might be valuable to your workload, consider the following design principles
+we followed when building the physicsnemo datapipes:
+
+1. **GPU-first** - Many scientific datasets are *large* for even a single example:
+the data is high resolution and the preprocessing needs benefit from GPU acceleration.
+Compare this to other methods where the data preprocessing is predominantly CPU-based,
+such as the pytorch Dataloader: whereas CPU-based preprocessing may introduce GPU
+pipeline stalls on high resolution data, GPU-based preprocessing will maximize
+throughput.
+
+2. **Threading over Multiprocessing** - In python, true concurrency is typically only
+available via multiprocessing or when offloading to compiled libraries or GPU kernels.
+For this reason, many data loaders leverage multiprocessing for data concurrency:
+load images in separate processes, and collate a batch on the main thread.
+For simplicity, with a GPU-first paradigm, the physicsnemo datapipe focuses on GPU
+concurrency via ascynchronous execution and stream-based parallelism.  IO is coordinated
+in multiple threads, instead of multiple processes, and streams enable multiple
+preprocessing pipelines to execute concurrently on the GPU.
+
+3. **Unambiguous Configuration and Serialization** - Datapipes can be a particularly
+frustrating component in reproducability of AI results - the preprocessing, sampling,
+batching and other parameters can be hard to infer from training scripts.  Here,
+we make a deliberate design choice to enable datapipe configuration serialization
+as a first-class citizen.  PhysicsNeMo Datapipes can be built directly in Python,
+but also instantiated from hydra yaml files for version control and distribution.
+
+4. **Familiar Interfaces** - We built our tools from scratch, but they are meant
+to look familiar and inter-operate with the tools you already know.  Use
+physicsnemo DataLoaders as a replacement for PyTorch's Dataloader; tools like
+DistributedSampler will still work. Users of `torchvision` will be familiar
+with the concept of chaining transformations together.
+
+5. **Extensibility out of the box** - We want to provide a data pipeline that gives
+great performance and usability immediately - but it will never be the case that
+one codebase covers all possible data needs out of the box.  Therefore, the
+physicsnemo datapipe is extensible: you can build custom data readers for
+new dataformats, and plug them in to datasets; you can build new transforms
+for your data that we might not have, and simply plug them into a transformation
+pipeline. You can even package all of this up as a pip-installable extension: Using
+the built in registry enables you to still instantiate and version control datapipes,
+when the components are not even part of PhysicsNeMo.
+
+## When should I use PhysicsNeMo datapipes over X/Y/Z data utility?
+
+In general, the physicsnemo datapipe utility is built to deliver good performance
+on data that is large, per example, like most scientific data is.  If you want a
+batch size of 512 small images, it may be more performant to use a CPU-centric
+tool.
+
+Another advantage of the PhysicsNeMo datapipe is the ability to build datapipes
+directly from configuration files, allowing serializable and version-controlled
+data configuration.  This isn't the only tool that can do this, of course.
+
+## Core Datapipe Design
+
+Think of datasets as a hierarchy of data: at the highest level, an entire **dataset**
+consists of independent **examples**.  Each example has one or more **tensor components**:
+image data may have input images and target labels; CFD data may have positions,
+target pressures, a mesh object, boundary conditions, etc.; weather data may contain
+sensor readings as a function of time.  Each example may be the same size as the others,
+or each example may be a unique size.  Even the components of an example can be variable,
+though this can require extra care in reading and using the dataset.
+
+The PhysicsNeMo datapipe consists of the following components:
+
+- `reader` objects contain the logic to understand a **dataset** on disk, and
+  load examples into CPU memory.  
+
+- The `dataset` object, which contains a `reader`, orchestrates threads that preload
+  data **examples** from disk and move it to GPU.  On the GPU, a `dataset` can apply a
+  series of transformations to each **example**.  Each example is stored in `tensordict`
+  format.  The dataset will also track metadata, for understand where each **example**
+  came from (index, filepath, etc.).
+
+- A `transform` is a callable class that accepts a tensordict as input, and returns
+  a `tensordict` as output.  Chaining transformations together is the core way to
+  manipulate data examples on the fly in a datapipe.
+
+- The `dataloader` is a drop-in replacement for the PyTorch DataLoader, with additional
+  optimizations for the GPU-centric processing here.  The `dataloader` handles
+  stream concurrency, batch collation, and triggering preloading of datasets.
+
+---
+
+## Tutorials
+
+This directory contains progressive tutorials that teach you how to use the
+PhysicsNeMo datapipe infrastructure effectively.  Note that some of the tutorials
+are repetitive and verbose, to highlight different features of the datapipe
+ecosystem.  We'll give some overview of what you can learn in each tutorial,
+but they are meant to be run interactively and explored.
+
+### Data Prerequisites
+
+You do not need to have any specific data in hand for the tutorials.  You can
+generate synthetic data with the scripts `generate_regular_data.py` and
+`generate_variable_points_data.py`.
+
+### Tutorial 1: Getting Started with DataPipes
+
+**File:** `tutorial_01_getting_started.py`
+
+Learn the core concepts of data loading from disk:
+
+- Creating a Reader to load data from files
+- Understanding the `(TensorDict, metadata)` return format
+- Wrapping a reader in a Dataset
+- Iterating with a DataLoader
+- Accessing batch data via TensorDict keys
+
+```bash
+# Generate tutorial data first
+python generate_regular_data.py -n 100 \
+-s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \
+-b zarr -o output/tutorial_data/
+
+# Run the tutorial
+python tutorial_01_getting_started.py
+```
+
+### Tutorial 2: Transforms and Data Preprocessing
+
+**File:** `tutorial_02_transforms.py`
+
+Build preprocessing pipelines with transforms:
+
+- Apply a single transform (Normalize)
+- Compose multiple transforms together
+- Subsample point clouds with SubsamplePoints
+- Use geometric transforms (Translate, ReScale)
+- Save/load normalization statistics from files
+- Denormalize data with the `inverse()` method
+
+```bash
+# Generate regular grid data (for most sections)
+# Note: Tutorial 2 can reuse the data from Tutorial 1
+python generate_regular_data.py -n 100 \
+-s "velocity:128,128,128,3 pressure:128,128,128,1 position:128,128,128,3" \
+-b zarr -o output/tutorial_data/
+
+# Generate point cloud data (for subsampling sections)
+python generate_variable_points_data.py -n 100 \
+-s "coords:3 features:8" --min-points 50000 \
+--max-points 100000 -b zarr -o output/pointcloud_data/
+
+# Run the tutorial
+python tutorial_02_transforms.py
+```
+
+### Tutorial 3: Custom Collation for GNNs
+
+**File:** `tutorial_03_custom_gnn_datapipe.py`
+
+Build a GNN-ready data pipeline with custom collation:
+
+- Build a custom Transform for computing KNN graph edges
+- Implement a custom Collator for PyG-style graph batching
+- Understand how PyG batches graphs (offset edges, concatenate features, batch tensor)
+- Put it all together in a complete GNN training pipeline
+
+```bash
+# Generate point cloud data with coordinates and features (can be reused from tutorial 2)
+python generate_variable_points_data.py -n 100 \
+-s "coords:3 features:8" --min-points 50000 \
+--max-points 100000 -b zarr -o output/pointcloud_data/
+
+# Run the tutorial
+python tutorial_03_custom_gnn_datapipe.py
+```
+
+### Tutorial 4: Hydra Configuration for DataPipes
+
+**File:** `tutorial_04_hydra_config.py`
+
+Build entire datapipes from YAML configuration with minimal Python code:
+
+- Define reader, transforms, dataset, and dataloader in YAML
+- Use `hydra.utils.instantiate()` to build components
+- Override any parameter from the command line
+- Switch between configurations easily
+
+```bash
+# Generate tutorial data (from tutorials 2 and 3)
+python generate_variable_points_data.py -n 100 -s \
+"coords:3 features:8" --min-points 50000 \
+--max-points 100000 -b zarr -o output/pointcloud_data/
+
+# Run with default config
+python tutorial_04_hydra_config.py
+
+# Override from command line
+python tutorial_04_hydra_config.py dataloader.batch_size=8 dataloader.dataset.device=cuda
+
+# Use point cloud configuration (this is the default)
+python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud
+
+# Override transform parameters
+python tutorial_04_hydra_config.py --config-name tutorial_04_pointcloud \
+    subsample.n_points=5000
+```
diff --git a/examples/minimal/datapipes/conf/config.yaml b/examples/minimal/datapipes/conf/config.yaml
@@ -0,0 +1,18 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+defaults:
+  - reader/zarr
diff --git a/examples/minimal/datapipes/conf/reader/npz.yaml b/examples/minimal/datapipes/conf/reader/npz.yaml
@@ -0,0 +1,15 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/examples/minimal/datapipes/conf/reader/tensorstore_zarr.yaml b/examples/minimal/datapipes/conf/reader/tensorstore_zarr.yaml
@@ -0,0 +1,39 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# TensorStore Zarr Reader Configuration
+# High-performance async reader for Zarr files using TensorStore
+_target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader
+path: ???
+group_pattern: "*.zarr"
+fields: null
+default_values: null
+cache_bytes_limit: 10000000  # 10 MB cache
+data_copy_concurrency: 72
+file_io_concurrency: 72
+pin_memory: false
+include_index_in_metadata: true
+
+# Coordinated subsampling configuration (optional)
+# Reads contiguous chunks from large tensors to reduce IO bandwidth.
+# Set to null to disable, or configure n_points and target_keys.
+coordinated_subsampling: null
+# Example:
+# coordinated_subsampling:
+#   n_points: 10000
+#   target_keys:
+#     - volume_coords
+#     - volume_fields
diff --git a/examples/minimal/datapipes/conf/reader/zarr.yaml b/examples/minimal/datapipes/conf/reader/zarr.yaml
@@ -0,0 +1,25 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# _target_: physicsnemo.datapipes.core.readers.TensorStoreZarrReader
+_target_: physicsnemo.datapipes.core.readers.ZarrReader
+path: ???
+fields: null
+default_values: null
+group_pattern: "*.zarr"
+pin_memory: false
+include_index_in_metadata: true
+
diff --git a/examples/minimal/datapipes/conf/transforms/normalize.yaml b/examples/minimal/datapipes/conf/transforms/normalize.yaml
@@ -0,0 +1,38 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Normalize Transform Configuration
+# Normalizes specified fields using mean-std or min-max scaling
+#
+# Method options:
+#   - mean_std: Applies (x - mean) / std
+#   - min_max: Applies (x - center) / half_range, normalizing to [-1, 1]
+_target_: physicsnemo.datapipes.core.transforms.Normalize
+_convert_: all
+input_keys:
+    - features
+method: mean_std # Required: "mean_std" or "min_max"
+means:
+    features: 0.0
+stds:
+    features: 0.6
+eps: 1.0e-8
+
+# For min_max method, use these instead of means/stds:
+# mins:
+#   features: -1.0
+# maxs:
+#   features: 1.0
diff --git a/examples/minimal/datapipes/conf/transforms/subsample.yaml b/examples/minimal/datapipes/conf/transforms/subsample.yaml
@@ -0,0 +1,26 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-FileCopyrightText: All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SubsamplePoints Transform Configuration
+# Subsamples point clouds to a fixed number of points
+# Useful for handling variable-size point cloud data in batched training
+_target_: physicsnemo.datapipes.core.transforms.SubsamplePoints
+input_keys:
+    - coords
+    - features
+n_points: 10000
+algorithm: uniform # Options: "uniform" or "poisson_fixed"
+weights_key: null # Optional: key for weighted sampling (e.g., "surface_areas")
-Original file line number
+Diff line change
@@ -0,0 +1,4 @@
+    *.json
+    *.npz
+    *.npy
+    *.zarr