local dl = require 'dataload'
A collection of Torch dataset loaders. The library provides the following generic data loader classes :
- DataLoader : an abstract class inherited by the following classes;
- TensorLoader : for tensor or nested (i.e. tables of) tensor datasets;
- ImageClass : for image classification datasets stored in a flat folder structure;
- AsyncIterator : decorates a
DataLoader
for asynchronou multi-threaded iteration; - SequenceLoader : for sequence datasets like language or time-series;
- MultiSequence : for shuffled sets of sequence datasets like shuffled sentences.
The library also provides functions for downloading specific datasets and preparing them using the above loaders :
- loadMNIST : load the MNIST handwritten digit dataset for image classification;
- loadImageNet : load the ILSVRC2014 dataset for image classification;
- loadPTB : load the Penn Tree Bank corpus for language modeling;
- loadGBW : load the Google Billion Words corpus for language modeling.
Also, we try to provide some useful preprocessing functions :
- fitImageNormalize : normalize images by channel.
dataloader = dl.DataLoader()
An abstract class inherited by all DataLoader
instances.
It wraps a data set to provide methods for accessing
inputs
and targets
. The data itself may be loaded from disk or memory.
Returns the number of samples in the dataloader
.
Returns the size
of inputs
. When excludedim
is 1 (the default),
the batch dimension is excluded from size
.
When inputs
is a tensor, the returned size
is
a table of numbers. When it is a table of tensors, the returned size
is a table of table of numbers.
Returns the size
of targets
. When excludedim
is 1 (the default),
the batch dimension is excluded from size
.
When targets
is a tensor, the returned size
is
a table of numbers. When it is a table of tensors, the returned size
is a table of table of numbers.
Returns inputs
and targets
containing samples indexed by indices
.
So for example :
indices = torch.LongTensor{1,2,3,4,5}
inputs, targets = dataloader:index(indices)
would return a batch of inputs
and targets
containing samples 1 through 5.
When inputs
and targets
are provided as arguments, they are used as
memory buffers for the returned inputs
and targets
,
i.e. their allocated memory is reused.
Returns inputs
and targets
containing batchsize
random samples.
This method is equivalent to :
indices = torch.LongTensor(batchsize):random(1,dataloader:size())
inputs, targets = dataloader:index(indices)
Returns inputs
and targets
containing stop-start+1
samples between start
and stop
.
This method is equivalent to :
indices = torch.LongTensor():range(start, stop)
inputs, targets = dataloader:index(indices)
Internally shuffles the inputs
and targets
. Note that not all
subclasses support this method.
Splits the dataloader
into two new DataLoader
instances
where ds1
contains the first math.floor(ratio x dataloader:size())
samples,
and ds2
contains the remainder.
Useful for splitting a training set into a new training set and validation set.
Returns an iterator over a validation and test sets. Each iteration returns 3 values :
k
: the number of samples processed so far. Each iteration returns a maximum ofbatchsize
samples.inputs
: a tensor (or nested table thereof) containing a maximum ofbatchsize
inputs.targets
: a tensor (or nested table thereof) containing targets for the commensurate inputs.
The iterator will return batches of inputs
and targets
of size at most batchsize
until
epochsize
samples have been returned.
Note that the default implementation of this iterator is to call sub for each batch. Sub-classes may over-write this behavior.
Example :
local dl = require 'dataload'
inputs, targets = torch.range(1,5), torch.range(1,5)
dataloader = dl.TensorLoader(inputs, targets)
local i = 0
for k, inputs, targets in dataloader:subiter(2,6) do
i = i + 1
print(string.format("batch %d, nsampled = %d", i, k))
print(string.format("inputs:\n%stargets:\n%s", inputs, targets))
end
Output :
batch 1, nsampled = 2
inputs:
1
2
[torch.DoubleTensor of size 2]
targets:
1
2
[torch.DoubleTensor of size 2]
batch 2, nsampled = 4
inputs:
3
4
[torch.DoubleTensor of size 2]
targets:
3
4
[torch.DoubleTensor of size 2]
batch 3, nsampled = 5
inputs:
5
[torch.DoubleTensor of size 1]
targets:
5
[torch.DoubleTensor of size 1]
batch 4, nsampled = 6
inputs:
1
[torch.DoubleTensor of size 1]
targets:
1
[torch.DoubleTensor of size 1]
Note how the last two batches are of size 1 while those before are of size batchsize = 2
.
The reason for this is that the dataloader
only has 5 samples.
So the last batch is split between the last sample and the first.
Returns an iterator over a training set. Each iteration returns 3 values :
k
: the number of samples processed so far. Each iteration returns a maximum ofbatchsize
samples.inputs
: a tensor (or nested table thereof) containing a maximum ofbatchsize
inputs.targets
: a tensor (or nested table thereof) containing targets for the commensurate inputs.
The iterator will return batches of inputs
and targets
of size at most batchsize
until
epochsize
samples have been returned.
Note that the default implementation of this iterator is to call sample for each batch. Sub-classes may over-write this behavior.
Example :
local dl = require 'dataload'
inputs, targets = torch.range(1,5), torch.range(1,5)
dataloader = dl.TensorLoader(inputs, targets)
local i = 0
for k, inputs, targets in dataloader:sampleiter(2,6) do
i = i + 1
print(string.format("batch %d, nsampled = %d", i, k))
print(string.format("inputs:\n%stargets:\n%s", inputs, targets))
end
Output :
batch 1, nsampled = 2
inputs:
1
2
[torch.DoubleTensor of size 2]
targets:
1
2
[torch.DoubleTensor of size 2]
batch 2, nsampled = 4
inputs:
4
2
[torch.DoubleTensor of size 2]
targets:
4
2
[torch.DoubleTensor of size 2]
batch 3, nsampled = 6
inputs:
4
1
[torch.DoubleTensor of size 2]
targets:
4
1
[torch.DoubleTensor of size 2]
Resets all internal counters such as those used for iterators.
Called by AsyncIterator before serializing the DataLoader
to threads.
Collect garbage every self.gccdelay
times this method is called.
Returns a deep copy
clone of self
.
dataloader = dl.TensorLoader(inputs, targets)
The TensorLoader
can be used to encapsulate tensors of inputs
and targets
.
As an example, consider a dummy 3 x 8 x 8
image classification dataset consisting of 1000 samples and 10 classes:
inputs = torch.randn(1000, 3, 8, 8)
targets = torch.LongTensor(1000):random(1,10)
dataloader = dl.TensorLoader(inputs, targets)
The TensorLoader
can also be used to encapsulate nested tensors of inputs
and targets
.
It uses recursive functions to handle nestings of arbitrary depth. As an example, let us
modify the above example to include x,y
GPS coordinates in the inputs
and
a parallel set of classification targets
(7 classes):
inputs = {torch.randn(1000, 3, 8, 8), torch.randn(1000, 2)}
targets = {torch.LongTensor(1000):random(1,10), torch.LongTensor(1000):random(1,7)}
dataloader = dl.TensorLoader(inputs, targets)
dataloader = dl.ImageClass(datapath, loadsize, [samplesize, samplefunc, sortfunc, verbose])
For loading an image classification data set stored in a flat folder structure :
(datapath)/(classdir)/(imagefile).(jpg|png|etc)
So directory classdir
is expected to contain the all images belonging to that class.
All image files are indexed into an efficient CharTensor
during initialization.
Images are only loaded into inputs
and targets
tensors upon calling
batch sampling methods like index, sample and sub.
Note that for asynchronous loading of images (i.e. loading batches of images in different threads),
the ImageClass
loader can be decorated with an AsyncIterator.
Images on disk can have different height, width and number of channels.
Constructor arguments are as follows :
datapath
: one or many paths to directories of images;loadsize
: initialize size to load the images to. Example :{3, 256, 256}
;samplesize
: consistent sample size to resize the images to. Defaults toloadsize
;samplefunc
:function f(self, dst, path)
used to create a sample(s) from an image path. Stores them inCharTensor
dst
. Strings"sampleDefault"
(the default),"sampleTrain"
or"sampleTest"
can also be provided as they refer to existing functionsverbose
: display verbose message (default istrue
);sortfunc
: comparison operator used for sortingclassdir
to get class indices. Defaults to the<
operator.
dataloader = dl.AsyncIterator(dataloader, [nthread, verbose])
This DataLoader
subclass overwrites the subiter
and sampleiter
iterator methods. The implementation uses the threads package to
build a pool of nthread
worker threads. The main thread delegates the tasks of building inputs
and targets
tensors
to the workers. The workers each have a deep copy of the decorated dataloader
.
When a task is received from the main thread through the Queue, they call sample
or sub
to build the batch and return the inputs
and targets
to the
main thread. The iteration is asynchronous as the first iteration will fill the Queue with nthread
tasks.
Note that when nthread > 1
the order of tensors is not deterministic.
This loader is well suited for decorating a dl.ImageClass
instance and other
such I/O and CPU bound loaders.
dataloader = dl.SequenceLoader(sequence, batchsize, [bidirectional])
This DataLoader
subclass can be used to encapsulate a sequence
for training time-series or language models.
The sequence
is a tensor where the first dimension indexes time.
Internally, the loader will split the sequence
into batchsize
subsequences.
Calling the sub(start, stop, inputs, targets)
method will return
inputs
and targets
of size seqlen x batchsize [x inputsize]
where stop - start + 1 <= seqlen
.
See RNNLM training script for an example.
The bidirectional
argument should be set
to true
for bidirectional models like BRNN/BLSTMs. In which case,
the returned inputs
and targets
will be aligned.
For example, using batchsize = 3
and seqlen = 5
:
print(inputs:t(), targets:t())
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]
When bidirectional
is false
(the default), the targets
will
be one step in the future with respect to the inputs :
For example, using batchsize = 3
and seqlen = 5
:
print(inputs:t(), targets:t())
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]
1516 853 94 1376 719
433 553 805 521 27
434 57 1029 1962 49
[torch.IntTensor of size 3x5]
dataloader = dl.MultiSequence(sequences, batchsize)
This DataLoader subclass is used by the Billion Words dataset to encapsulate unordered sentences.
The sequences
arguments is a table or tds.Vec of tensors.
Each such tensors is a single sequence independent of the others.
When calling sub(start, stop)
or subiter(seqlen)
methods,
a column of the returned inputs
and targets
tensors (of size seqlen x batchsize
) could
contain multiple sequences. For example, a character-level language model could look like:
target : [ ] E L L O [ ] C R E E N ...
input : [ ] H E L L [ ] S C R E E ...
where HELLO
and SCREEN
would be two independent sequences.
Note that [ ]
is a zero mask used to seperate independent sequences.
For most cases, the [ ]
token is a 0.
Except for 1D targets
, where it is a 1 (so that it works with ClassNLLCriterion
).
train, valid, test = dl.loadMNIST([datapath, validratio, scale, srcurl])
Returns the training, validation and testing sets as 3 TensorLoader
instances.
Each such loader encapsulates a part of the MNIST dataset which is
located in datapath
(defaults to dl.DATA_PATH/mnist
).
The validratio
argument, a number between 0 and 1,
specifies the ratio of the 60000 training samples
that will be allocated to the validation set.
The scale
argument specifies range within which pixel values will be scaled (defaults to {0,1}
).
The srcurl
specifies the URL from where the raw data can be downloaded from
if not located on disk.
train, valid, test = dl.loadPTB(batchsize, [datapath, srcurl])
Returns the training, validation and testing sets as 3 SequenceLoader
instance
Each such loader encapsulates a part of the Penn Tree Bank dataset which is
located in datapath
(defaults to dl.DATA_PATH/PennTreeBank
).
If the files aren't found in the datapath
, they will be automatically downloaded
from the srcurl
URL.
The batchsize
specifies the number of samples that will be returned when
iterating through the dataset. If specified as a table, its elements
specify the batchsize
of commensurate train
, valid
and test
tables.
We recommend a batchsize
of 1 for evaluation sets (e.g. {50,1,1}
).
See RNNLM training script for an example.
Ref.: A. http://image-net.org/challenges/LSVRC/2014/download-images-5jj5.php
train, valid = dl.loadImageNet(datapath, [nthread, loadsize, samplesize, verbose])
Returns the training and validation sets of the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) image classification dataset (commonly known as ImageNet). The dataset hasn't changed from 2012-2014.
The returned train
and valid
loaders do not read all images into memory when first loaded.
Each dataset is implemented using an ImageClass loader decorated by an AsyncIterator.
The datapath
should point to a directory containing the outputs of the downloadimagenet.lua
and
harmonizeimagenet.lua
scripts (see bellow).
Due to its size, the data first needs to be prepared offline. Use downloadimagenet.lua to download and extract the data :
th downloadimagenet.lua --savePath '/path/to/diskspace/ImageNet'
The entire process requires about 360 GB of disk space to complete the download and extraction process.
This can be reduced to about 150 GB if the training set is downloaded and extracted first,
and all the .tar
files are manually deleted. Repeat for the validation set, devkit and metadata.
If you still don't have enough space in one partition, you can divide the data among different partitions.
We recommend a good internet connection (>60Mbs download) and a Solid-State Drives (SSD).
Use harmonizeimagenet.lua to harmonize the train and validation sets:
th harmonizeimagenet.lua --dataPath /path/to/diskspace/ImageNet --progress --forReal
Each set will then contain a directory of images for each class with name class[id]
where [id]
is a class index, between 1 and 1000, used for the ILVRC2014 competition.
Then we need to install graphicsmagick :
luarocks install graphicsmagick
As in the famous (Krizhevsky et al. 2012) paper, the ImageNet training dataset samples images cropped from random 224x224 patches from the images resizes so that the smallest dimension has size 256. As for the validation set, ten 224x224 patches are cropped per image, i.e. center, four corners and their horizontal flips, and their predictions are averaged.
train, valid, test = dl.loadGBW(batchsize, [trainfile, datapath, srcurl, verbose])
Loads the Google Billion Words corpus as MultiSequence loaders.
The preprocessing specified in
Google Billion Words language modeling benchmark
was applied to training-monolingual.tokenized/news.20??.en.shuffled.tokenized
to generate the different subsets.
These subsets are automatically downloaded when not found on disk.
The task consists in predicting the next word given the previous ones.
The corpus contains approximately 30 million sentences of an average length of about 25 words.
In total, there are about 800 thousand (unique) words in the vocabulary, which makes it a very memory intensive problem.
ppf = dl.fitImageNormalize(trainset, [nsample, cachepath, verbose])
Returns a ppf
preprocessing function that can be used to in-place normalize a batch of images (inputs
)
channel-wise :
ppf(inputs)
The trainset
argument is a DataLoader instance
containing image inputs
. The mean and standard deviation will be measured
on nsample
images (default 10000). When cachepath
is provided, the
mean and standard deviation are saved for the next function call.