Data is essential in Machine Learning, and PyTorch offers a very Pythonic solution to load complex and heterogeneous dataset. However, data loading is merely the first step: preprocessing|batching|sampling|partitioning|augmenting. This tutorial explores the internals of torch.utils.data, and describes patterns and best practices for elegant data solutions in Machine learning with PyTorch.
Data processing is at the heart of every Machine Learning (ML) model training&evaluation loop; and PyTorch has revolutionised the way in which data is managed. Very Pythonic Dataset
and DataLoader
classes substitutes substitutes (nested) list
of Numpy ndarray
. However data loading
is merely the first step. Data preprocessing|sampling|batching|partitioning
are fundamental operations that are usually required in a complete ML pipeline.
If not properly managed, this could ultimately lead to lots of boilerplate code, re-inventing the wheel ™.
This tutorial will dig into the internals of torch.utils.data
to present patterns and best practice to load heterogeneous and custom dataset in the most elegant and Pythonic way.
The tutorial is organised in four parts, each focusing on specific patterns for ML data and scenarios. These parts will share the same internal structure: (I) general introduction; (II) case study. The first section will provide a technical introduction of the problem, and a description of the torch
internals. Case studies are then used to deliver concrete examples, and application, as well as engaging with the audience, and fostering the discussion. Off-the-shelf and/or custom heterogeneuous datasets will be used to comply with the broadest possible interests from the audience (e.g. Images, Text, Mixed-Type Datasets).
- Intro to
Dataset
andDataLoader
torch.utils.data.Dataset
at a glance- Type of Dataset:
IterableDataset
and Map-Style Dataset - Case study: File-base vs Database Dataset
- Streaming data from MongoDB
- Dataset Composition:
Concat
,Chain
,__add__
- Data PreProcessing and Transformation
torchvision
transformers- Case Study: Custom transformers
- Transformer pipelines with
torchvision.transforms.Compose
- Transformer pipelines with
- Data Partitioning (training / validation / test ): the PyTorch way
- One Dataset is One
Dataset
- Subset and
random_split
- Case Study:
Dataset
and Cross-Validation- How to combine
torch.utils.data.Dataset
andsklearn.model_selection.KFold
(without usingskorch
) - Combining Data Partitioning and Transformers
- How to combine
- One Dataset is One
- Data Loading and Sampling
torch.utils.data.DataLoader
and data batching- Single- and Multi-processing Data Loading
- Data sampling:
SequentialSampling
,RandomSumpling
- Case Study: Cross Validation Partitioning Reviewed:
- Subset & Sampling with
SequentialSubsetSampling
andRandomSubsetSampling
- Subset & Sampling with
Basic concepts of Machine/Deep learning Data processing are required to attend this tutorial. Similarly, proficiency with the Python language and the Python Object Model is also required. Basic knowledge of the PyTorch main features is preferable.