Beautiful (ML) Data: Patterns&Best Practice for effective Data solutions with PyTorch

Abstract

Data is essential in Machine Learning, and PyTorch offers a very Pythonic solution to load complex and heterogeneous dataset. However, data loading is merely the first step: preprocessing|batching|sampling|partitioning|augmenting. This tutorial explores the internals of torch.utils.data, and describes patterns and best practices for elegant data solutions in Machine learning with PyTorch.

Description

Data processing is at the heart of every Machine Learning (ML) model training&evaluation loop; and PyTorch has revolutionised the way in which data is managed. Very Pythonic Dataset and DataLoader classes substitutes substitutes (nested) list of Numpy ndarray. However data loading is merely the first step. Data preprocessing|sampling|batching|partitioning are fundamental operations that are usually required in a complete ML pipeline.

If not properly managed, this could ultimately lead to lots of boilerplate code, re-inventing the wheel ™. This tutorial will dig into the internals of torch.utils.data to present patterns and best practice to load heterogeneous and custom dataset in the most elegant and Pythonic way.

The tutorial is organised in four parts, each focusing on specific patterns for ML data and scenarios. These parts will share the same internal structure: (I) general introduction; (II) case study. The first section will provide a technical introduction of the problem, and a description of the torch internals. Case studies are then used to deliver concrete examples, and application, as well as engaging with the audience, and fostering the discussion. Off-the-shelf and/or custom heterogeneuous datasets will be used to comply with the broadest possible interests from the audience (e.g. Images, Text, Mixed-Type Datasets).

Outline

Intro to Dataset and DataLoader
- torch.utils.data.Dataset at a glance
- Type of Dataset: IterableDataset and Map-Style Dataset
- Case study: File-base vs Database Dataset
  - Streaming data from MongoDB
  - Dataset Composition: Concat, Chain, __add__
Data PreProcessing and Transformation
- torchvision transformers
- Case Study: Custom transformers
  - Transformer pipelines with torchvision.transforms.Compose
Data Partitioning (training / validation / test ): the PyTorch way
- One Dataset is One Dataset
- Subset and random_split
- Case Study: Dataset and Cross-Validation
  - How to combine torch.utils.data.Dataset and sklearn.model_selection.KFold (without using skorch)
  - Combining Data Partitioning and Transformers
Data Loading and Sampling
- torch.utils.data.DataLoader and data batching
- Single- and Multi-processing Data Loading
- Data sampling: SequentialSampling, RandomSumpling
- Case Study: Cross Validation Partitioning Reviewed:
  - Subset & Sampling with SequentialSubsetSampling and RandomSubsetSampling

Pre-requisites

Basic concepts of Machine/Deep learning Data processing are required to attend this tutorial. Similarly, proficiency with the Python language and the Python Object Model is also required. Basic knowledge of the PyTorch main features is preferable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial_pytorch_data.md

tutorial_pytorch_data.md

Beautiful (ML) Data: Patterns&Best Practice for effective Data solutions with PyTorch

Abstract

Description

Outline

Pre-requisites

Files

tutorial_pytorch_data.md

Latest commit

History

tutorial_pytorch_data.md

File metadata and controls

Beautiful (ML) Data: Patterns&Best Practice for effective Data solutions with PyTorch

Abstract

Description

Outline

Pre-requisites