In fact, it was translated,
official website: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
In Pytorch's data loading tool, the torch.utils.data.DataLoader
class plays a central role. It is a Python iterator on the data set and supports the following:
- Map and iterator type data sets;
- Custom data loading instructions;
- Automatic batching
- Single-process and multi-process data loading;
- Automatic memory pinning
These options by DataLoader
configuring constructor parameters, referred to as a class constructor:
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None, *, prefetch_factor=2,
persistent_workers=False)
The following chapters will describe the effects and usage of these options in detail.
1. Data set type
The DataLoader
most important of which was constructor parameter Dataset
that specifies the data set for instance to load data. Two different data set types are provided in Pytorch:
- Map type data set
- Iterator type data set
1.1. Map type data set
Map type of data set is realized __getitem__()
with __len__()
a class that is a mapping of a key or index to the data sample.
For example, dataset[idx]
it represents the first reading from disk folder i
of images and associated tags.
Refer to Dataset for details .
1.2. Data set of iterator type
Iterator type data set is IterableDataset
a subset of the class that implements an interface function __iter__()
, represents an iterator on the data samples. For those situations where random read and write is difficult or even unlikely, and where the batch size depends on the data to be fetched, this type of data set is more suitable.
For example, iter(dataset)
you can return a data stream read from a remote database or even a real-time log.
For details, refer to IterableDataset .
note :
When the data loading process by a multi-scheme usedIterableDataset
during the class, for the same data, different work processes will each generate an identical replica, the data must be arranged in different ways, to eliminate duplicate data. The specific operation method can refer to the IterableDataset document.
2. Data loading instructions and sampling tools
For the data set of the iterator type, the data load instruction is completely determined by the user-defined iterator. It can make the implementation of reading large blocks of data or dynamically setting the batch size very simple (for example, one batch of samples can be selected (yield) at a time).
The rest of this section will mainly discuss Map type data sets. torch.utils.data.Sampler
The class is used to specify the index or key value sequence used in the data loading process. It represents an iterable object from index to data set. For example, when using SGD, Sampler
you can randomly arrange the indexes and select (yield) one sample at a time, or select (yield) a small group of samples for mini-batch SGD.
Based on DataLoader
parameters shuffle
automatically creates a scrambled sequence or value of the sampling tool. Further, the user parameters can also sampler
specify a custom where Sampler
instance, to obtain at each sampling a key or index.
batch_sampler
Parameters can be used to set up a custom sampling tool that can yield a set of index values each time. Automatic operation can be batch parameters batch_size
and drop_last
details see the next section is provided.
Note :
Because iterator type of dataset and not necessarily index keys, the parametersbatch_size
anddrop_last
not for an iterative type of data set.
2.1 Load batched or non-batch data
DataLoader
Parameter batch_size
, drop_last
and batch_sampler
automatically sampled data integrated into a batch.
2.1.1. Automatic batching (default)
This item is the most commonly used, which is equivalent to taking some of the smallest batch data and integrating them into the same batch of samples. That is to say, one of the dimensions in the tensor is the batch dimension (usually the first item).
When the parameter batch_size
(the default is 1) not None
, the data is returned batch loading tool data, instead of separate data. Parameters batch_size
and drop_last
data for specifying data batch loading tool acquisition mode key. For Map type data, the user can also set batch_sampler
it to return a list of keys at a time.
note
parametersbatch_size
anddrop_last
is in fact based on the parameterssampler
constructedbatch_sampler
. Map types of data sets,sampler
may be set by the user, or based on a parametershuffle
construct; iterator type of data sets,sampler
it is a virtual unlimited sampler. For details on the sampler, please refer to here .
note
When manner iterative process multiple types of data taken from the data set, in each process, if the number of the last batch of data is less than the parameterbatch_size
values set, the parametersdrop_last
will discard the batch.
After obtaining a list of sample through the sampling tool, the parameters collate_fn
set by the function responsible for sorting a list of data into the batch.
Setting parameters collate_fn
after the manner of loading data from the centralized type Map data is equivalent to:
for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
The way to load data from the data set of the iterator type is equivalent to:
dataset_iter = iter(dataset)
for indices in batch_sampler:
yield collate_fn([next(dataset_iter) for _ in indices])
Parameters passed collate_fn
function can be customized. For example, fill the sequence data to the maximum length of the batch. Parameter regarding collate_fn
details, reference may here .
2.1.2. Cancel automatic batching
Sometimes, users may want to manually control batch operations in the data set code, or just want to load isolated samples. For example, it may be more convenient to directly load batches of data (for example, reading a large batch of data from a database or reading continuous data from memory), or the size of the batch is dependent on the data, or the program is on an isolated sample Working. In these scenarios, may no longer apply automatic batch (it uses collate_fn
finishing samples), each of the operation by the tool load data directly return data set of the instances may be more convenient.
When batch_size
the batch_sampler
set simultaneously None
when the ( batch_sampler
default None
), the batch will be closed automatically. Samples from the data set are acquired by the parameter collate_fn
processing function is provided.
After the automatic batch function is turned off, the default collate_fn
only will NumPy array into a PyTorch in Tensor type, the other unchanged.
At this point, the way to load data from the Map type data set is equivalent to:
for index in sampler:
yield collate_fn(dataset[index])
The way to load data from the data set of the iterator type is equivalent to:
for data in iter(dataset):
yield collate_fn(data)
Parameter regarding collate_fn
details, reference may here .
3. collate_fn
Use
In the case of automatic turn-on and turn-off in batches, collate_fn
the effect is different.
When the automatic batching is turned off, collate_fn
the function is called by the isolated data sample, and the result is returned through the data loading iterator.
When the automatic batching is turned on, collate_fn
the function is called by the list of data samples, and it integrates the input data samples into one batch to facilitate the return from the data loading iterator. The next discussion of default in automatic batch open collate_fn
performance.
In one example described, if each sample of the image data, and an integer tag 3 channels, each one returns a tuple data set i.e. (图像,标签)
, a default collate_fn
list integration of these tuples into a tuple, which The tuple consists of the batched image tensor and the batched label tensor. The default collate_fn
has the following features:
- Will create a dimension for the batch at the first position of the tensor
- Always convert NumPy arrays and PyTorch numeric data to PyTorch's Tensor type
- Keep the data structure unchanged. For example, if each sample is of dictionary type, it will return a dictionary that contains the same key set, but uses the batched tensor as the value (if the original value set cannot be quantized, Then use the list as the value of the dictionary). Lists, tuples, named tuples, etc. are similar.
Users can use custom ones collate_fn
. For example: sorting along dimensions other than the first dimension, filling sequences with different lengths, adding support for custom types, and so on.
4. Single-process and multi-process data loading
DataLoader
Single process is used by default.
The GIL in Python makes thread concurrency pseudo-concurrency. To prevent the occurrence of the code block when loading data, PyTorch provides a simple way to achieve multi-process data loading, only the parameters required to use num_workers
a positive integer.
4.1. Single-process data loading
In this mode, data is acquired in the initialization DataLoader
is completed in the same process. Therefore, data loading may block calculations. However, in scenarios where data is shared between processes (shared memory, data descriptors, etc.), or in scenarios where the amount of data is small enough to be fully loaded into memory, this mode may be more appropriate. In addition, when the single-process loading method encounters an abnormal interruption, the error information provided is easier to understand and facilitate debugging.
4.2. Multi-process data loading
With a positive integer parameter settings num_workers
, it returns a specified number of loading process Multi-process loading tool.
In this mode, each time creating a DataLoader
iterator (as in calling enumerate(dataloader)
), it will create num_workers
a working process. dataset
, collate_fn
And worker_init_fn
transferred into each of the process, in turn, it is used to initialize and access the data. This means that data access and transformation along with internal IO are performed in the work process.
In the work executed in the child process torch.utils.data.get_worker_info()
can obtain its information (such as process ID, copied into data set, initialization seeds, etc.), at main
the time of the calling process will return None
. The user can set the data code or worker_init_fn
calling the function, to ensure that each sub-process can operate independently of its own copy of the data set example, some code may be used to determine whether to run in a child process. For example, this function can be useful when slicing a data set.
For a data set of Map type, the main process uses a sampler to generate a list of sequence numbers and pass them to the child process. Therefore, the operation of disrupting the sequence is completed in the main process, and then the main process will guide the loading process by assigning the sequence number of the loading.
For the data set of the iterator type, because each worker process will get a copy of the data set instance, this will lead to repeated results in a simple multi-process. By using torch.utils.data.get_worker_info()
or worker_init_fn
, the user may independently configure each copy. (See IterableDataset
documentation .) Similarly, in multi-process load, the parameters drop_last
will be dissatisfied that each process in the last batch capacity batch discard.
After the iteration is completed or the iterator is garbage collected, the child process will be closed.
Warning
is not recommended to return CUDA tensor in multi-process loading process, because there are many fine places (refer to multiple processes or shared when using CUDA CUDA tensor CUDA multi-process ). When multi-process loading, it is recommended to use automatic memory pinning (iepin_memory=True
), which will transfer the data to a graphics card that can use CUDA.
4.2.1. Different performance on the platform
Even if the child process relies on Python's multi-process, the performance on Windows and Unix platforms is still different.
- The default multi-process startup program on Unix is
fork()
. It allows the child process to directly access the data set and Python parameter functions through the copied address space. - On Windows the default multi-process program is started
spawn()
, it will start another interpreter, and in which the execution of the main script, then interpreter through thepickle
module serialization way to receivedataset
,collate_fn
as well as internal sub-process function of other parameters.
This separate serialization means that to ensure compatibility with Windows when multi-process data is loaded, you should take two steps:
- The main script
if __name__ == '__main__':
wrapped up, to ensure that it will not be executed multiple times after the child process is loaded. For example, you can set the data andDataLoader
create a logical instance on here. - Ensure a custom
collate_fn
,worker_init_fn
ordataset
code is defined in the top and is__main__
outside verification of the statement, which can ensure their availability in the child process.
4.2.2. Randomness in multi-process data loading
By default, each sub-process is using base_seed + worker_id
mode random seed is provided, which base_seed
is generated using a long integer RNG module by the main process. However, in the process of initializing the child process, the seeds of other libraries (such as NumPy) will also be copied in, which will cause the random number returned by each child process to be the same. (Resolved by reference herein )
in worker_init_fn
, you can torch.utils.data.get_worker_info().seed
or torch.initial_seed()
get PyTorch seed, and before the data is loaded to the seed in other packages.
5. Memory pinning (pinning)
When using a fixed memory area (such as a page lock area) to copy data from the host to the graphics card, the speed will be faster. Please refer to here for the usage method .
Loading data for DataLoader
setting pin_memory=True
automatically the selected data tensor placed in the fixed memory, so that when transferring data to the CUDA-enabled graphics faster.
The default fixed memory logic can only recognize tensors or Map/Iterable types that contain tensors. If the batch is of a custom type ( collate_fn
returned is of a custom type) or every item in it is of a custom type, the immobilization logic cannot identify them, and the returned result is not fixed in memory . If you want to implement support for custom batch type or types of data need to be implemented in a custom type pin_memory()
method.
for example
# 自定义类
class SimpleCustomBatch:
def __init__(self, data):
transposed_data = list(zip(*data))
self.inp = torch.stack(transposed_data[0], 0)
self.tgt = torch.stack(transposed_data[1], 0)
# 自定义内存 pinning 方法
def pin_memory(self):
self.inp = self.inp.pin_memory()
self.tgt = self.tgt.pin_memory()
return self
def collate_wrapper(batch):
return SimpleCustomBatch(batch)
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
pin_memory=True)
for batch_ndx, sample in enumerate(loader):
print(sample.inp.is_pinned())
print(sample.tgt.is_pinned())
The second half of the official website lists and introduces classes related to the data set, among which use cases are cited for some classes .
The classes with use cases are:
- torch.utils.data.IterableDataset
- torch.utils.data.BufferedShuffleDataset(dataset, buffer_size)
- torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True, generator=None)
- torch.utils.data.BatchSampler(sampler, batch_size, drop_last)
- torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)