torch.utils.data study notes

In fact, it was translated,
official website: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

  In Pytorch's data loading tool, the torch.utils.data.DataLoaderclass plays a central role. It is a Python iterator on the data set and supports the following:

  • Map and iterator type data sets;
  • Custom data loading instructions;
  • Automatic batching
  • Single-process and multi-process data loading;
  • Automatic memory pinning

  These options by DataLoaderconfiguring constructor parameters, referred to as a class constructor:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

  The following chapters will describe the effects and usage of these options in detail.

1. Data set type

  The DataLoadermost important of which was constructor parameter Datasetthat specifies the data set for instance to load data. Two different data set types are provided in Pytorch:

  • Map type data set
  • Iterator type data set

1.1. Map type data set

  Map type of data set is realized __getitem__()with __len__()a class that is a mapping of a key or index to the data sample.
  For example, dataset[idx]it represents the first reading from disk folder iof images and associated tags.
  Refer to Dataset for details .

1.2. Data set of iterator type

  Iterator type data set is IterableDataseta subset of the class that implements an interface function __iter__(), represents an iterator on the data samples. For those situations where random read and write is difficult or even unlikely, and where the batch size depends on the data to be fetched, this type of data set is more suitable.
  For example, iter(dataset)you can return a data stream read from a remote database or even a real-time log.
  For details, refer to IterableDataset .

note :
  When the data loading process by a multi-scheme used IterableDatasetduring the class, for the same data, different work processes will each generate an identical replica, the data must be arranged in different ways, to eliminate duplicate data. The specific operation method can refer to the IterableDataset document.

2. Data loading instructions and sampling tools

  For the data set of the iterator type, the data load instruction is completely determined by the user-defined iterator. It can make the implementation of reading large blocks of data or dynamically setting the batch size very simple (for example, one batch of samples can be selected (yield) at a time).
  The rest of this section will mainly discuss Map type data sets. torch.utils.data.SamplerThe class is used to specify the index or key value sequence used in the data loading process. It represents an iterable object from index to data set. For example, when using SGD, Sampleryou can randomly arrange the indexes and select (yield) one sample at a time, or select (yield) a small group of samples for mini-batch SGD.
  Based on DataLoaderparameters shuffleautomatically creates a scrambled sequence or value of the sampling tool. Further, the user parameters can also samplerspecify a custom where Samplerinstance, to obtain at each sampling a key or index.
  batch_samplerParameters can be used to set up a custom sampling tool that can yield a set of index values ​​each time. Automatic operation can be batch parameters batch_sizeand drop_lastdetails see the next section is provided.

Note :
  Because iterator type of dataset and not necessarily index keys, the parameters batch_sizeand drop_lastnot for an iterative type of data set.

2.1 Load batched or non-batch data

  DataLoaderParameter batch_size, drop_lastand batch_samplerautomatically sampled data integrated into a batch.

2.1.1. Automatic batching (default)

  This item is the most commonly used, which is equivalent to taking some of the smallest batch data and integrating them into the same batch of samples. That is to say, one of the dimensions in the tensor is the batch dimension (usually the first item).
  When the parameter batch_size(the default is 1) not None, the data is returned batch loading tool data, instead of separate data. Parameters batch_sizeand drop_lastdata for specifying data batch loading tool acquisition mode key. For Map type data, the user can also set batch_samplerit to return a list of keys at a time.

note
  parameters batch_sizeand drop_lastis in fact based on the parameters samplerconstructed batch_sampler. Map types of data sets, samplermay be set by the user, or based on a parameter shuffleconstruct; iterator type of data sets, samplerit is a virtual unlimited sampler. For details on the sampler, please refer to here .

note
  When manner iterative process multiple types of data taken from the data set, in each process, if the number of the last batch of data is less than the parameter batch_sizevalues set, the parameters drop_lastwill discard the batch.

  After obtaining a list of sample through the sampling tool, the parameters collate_fnset by the function responsible for sorting a list of data into the batch.
  Setting parameters collate_fnafter the manner of loading data from the centralized type Map data is equivalent to:

for indices in batch_sampler:
    yield collate_fn([dataset[i] for i in indices])

  The way to load data from the data set of the iterator type is equivalent to:

dataset_iter = iter(dataset)
for indices in batch_sampler:
    yield collate_fn([next(dataset_iter) for _ in indices])

  Parameters passed collate_fnfunction can be customized. For example, fill the sequence data to the maximum length of the batch. Parameter regarding collate_fndetails, reference may here .

2.1.2. Cancel automatic batching

  Sometimes, users may want to manually control batch operations in the data set code, or just want to load isolated samples. For example, it may be more convenient to directly load batches of data (for example, reading a large batch of data from a database or reading continuous data from memory), or the size of the batch is dependent on the data, or the program is on an isolated sample Working. In these scenarios, may no longer apply automatic batch (it uses collate_fnfinishing samples), each of the operation by the tool load data directly return data set of the instances may be more convenient.
  When batch_sizethe batch_samplerset simultaneously Nonewhen the ( batch_samplerdefault None), the batch will be closed automatically. Samples from the data set are acquired by the parameter collate_fnprocessing function is provided.
  After the automatic batch function is turned off, the default collate_fnonly will NumPy array into a PyTorch in Tensor type, the other unchanged.
  At this point, the way to load data from the Map type data set is equivalent to:

for index in sampler:
    yield collate_fn(dataset[index])

  The way to load data from the data set of the iterator type is equivalent to:

for data in iter(dataset):
    yield collate_fn(data)

Parameter regarding collate_fndetails, reference may here .

3. collate_fnUse

  In the case of automatic turn-on and turn-off in batches, collate_fnthe effect is different.
  When the automatic batching is turned off, collate_fnthe function is called by the isolated data sample, and the result is returned through the data loading iterator.
  When the automatic batching is turned on, collate_fnthe function is called by the list of data samples, and it integrates the input data samples into one batch to facilitate the return from the data loading iterator. The next discussion of default in automatic batch open collate_fnperformance.
  In one example described, if each sample of the image data, and an integer tag 3 channels, each one returns a tuple data set i.e. (图像,标签), a default collate_fnlist integration of these tuples into a tuple, which The tuple consists of the batched image tensor and the batched label tensor. The default collate_fnhas the following features:

  • Will create a dimension for the batch at the first position of the tensor
  • Always convert NumPy arrays and PyTorch numeric data to PyTorch's Tensor type
  • Keep the data structure unchanged. For example, if each sample is of dictionary type, it will return a dictionary that contains the same key set, but uses the batched tensor as the value (if the original value set cannot be quantized, Then use the list as the value of the dictionary). Lists, tuples, named tuples, etc. are similar.

  Users can use custom ones collate_fn. For example: sorting along dimensions other than the first dimension, filling sequences with different lengths, adding support for custom types, and so on.

4. Single-process and multi-process data loading

  DataLoaderSingle process is used by default.
The GIL in Python makes thread concurrency pseudo-concurrency. To prevent the occurrence of the code block when loading data, PyTorch provides a simple way to achieve multi-process data loading, only the parameters required to use num_workersa positive integer.

4.1. Single-process data loading

  In this mode, data is acquired in the initialization DataLoaderis completed in the same process. Therefore, data loading may block calculations. However, in scenarios where data is shared between processes (shared memory, data descriptors, etc.), or in scenarios where the amount of data is small enough to be fully loaded into memory, this mode may be more appropriate. In addition, when the single-process loading method encounters an abnormal interruption, the error information provided is easier to understand and facilitate debugging.

4.2. Multi-process data loading

  With a positive integer parameter settings num_workers, it returns a specified number of loading process Multi-process loading tool.
  In this mode, each time creating a DataLoaderiterator (as in calling enumerate(dataloader)), it will create num_workersa working process. dataset, collate_fnAnd worker_init_fntransferred into each of the process, in turn, it is used to initialize and access the data. This means that data access and transformation along with internal IO are performed in the work process.
  In the work executed in the child process torch.utils.data.get_worker_info()can obtain its information (such as process ID, copied into data set, initialization seeds, etc.), at mainthe time of the calling process will return None. The user can set the data code or worker_init_fncalling the function, to ensure that each sub-process can operate independently of its own copy of the data set example, some code may be used to determine whether to run in a child process. For example, this function can be useful when slicing a data set.
  For a data set of Map type, the main process uses a sampler to generate a list of sequence numbers and pass them to the child process. Therefore, the operation of disrupting the sequence is completed in the main process, and then the main process will guide the loading process by assigning the sequence number of the loading.
  For the data set of the iterator type, because each worker process will get a copy of the data set instance, this will lead to repeated results in a simple multi-process. By using torch.utils.data.get_worker_info()or worker_init_fn, the user may independently configure each copy. (See IterableDataset documentation .) Similarly, in multi-process load, the parameters drop_lastwill be dissatisfied that each process in the last batch capacity batch discard.
  After the iteration is completed or the iterator is garbage collected, the child process will be closed.

Warning
  is not recommended to return CUDA tensor in multi-process loading process, because there are many fine places (refer to multiple processes or shared when using CUDA CUDA tensor CUDA multi-process ). When multi-process loading, it is recommended to use automatic memory pinning (ie pin_memory=True), which will transfer the data to a graphics card that can use CUDA.

4.2.1. Different performance on the platform

  Even if the child process relies on Python's multi-process, the performance on Windows and Unix platforms is still different.

  • The default multi-process startup program on Unix is fork(). It allows the child process to directly access the data set and Python parameter functions through the copied address space.
  • On Windows the default multi-process program is started spawn(), it will start another interpreter, and in which the execution of the main script, then interpreter through the picklemodule serialization way to receive dataset, collate_fnas well as internal sub-process function of other parameters.

  This separate serialization means that to ensure compatibility with Windows when multi-process data is loaded, you should take two steps:

  • The main script if __name__ == '__main__':wrapped up, to ensure that it will not be executed multiple times after the child process is loaded. For example, you can set the data and DataLoadercreate a logical instance on here.
  • Ensure a custom collate_fn, worker_init_fnor datasetcode is defined in the top and is __main__outside verification of the statement, which can ensure their availability in the child process.
4.2.2. Randomness in multi-process data loading

  By default, each sub-process is using base_seed + worker_idmode random seed is provided, which base_seedis generated using a long integer RNG module by the main process. However, in the process of initializing the child process, the seeds of other libraries (such as NumPy) will also be copied in, which will cause the random number returned by each child process to be the same. (Resolved by reference herein )
  in worker_init_fn, you can torch.utils.data.get_worker_info().seedor torch.initial_seed()get PyTorch seed, and before the data is loaded to the seed in other packages.

5. Memory pinning (pinning)

  When using a fixed memory area (such as a page lock area) to copy data from the host to the graphics card, the speed will be faster. Please refer to here for the usage method .
  Loading data for DataLoadersetting pin_memory=Trueautomatically the selected data tensor placed in the fixed memory, so that when transferring data to the CUDA-enabled graphics faster.
  The default fixed memory logic can only recognize tensors or Map/Iterable types that contain tensors. If the batch is of a custom type ( collate_fnreturned is of a custom type) or every item in it is of a custom type, the immobilization logic cannot identify them, and the returned result is not fixed in memory . If you want to implement support for custom batch type or types of data need to be implemented in a custom type pin_memory()method.
  for example

# 自定义类
class SimpleCustomBatch:
    def __init__(self, data):
        transposed_data = list(zip(*data))
        self.inp = torch.stack(transposed_data[0], 0)
        self.tgt = torch.stack(transposed_data[1], 0)

    # 自定义内存 pinning 方法
    def pin_memory(self):
        self.inp = self.inp.pin_memory()
        self.tgt = self.tgt.pin_memory()
        return self

def collate_wrapper(batch):
    return SimpleCustomBatch(batch)

inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
dataset = TensorDataset(inps, tgts)

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
                    pin_memory=True)

for batch_ndx, sample in enumerate(loader):
    print(sample.inp.is_pinned())
    print(sample.tgt.is_pinned())

The second half of the official website lists and introduces classes related to the data set, among which use cases are cited for some classes .
The classes with use cases are:

  • torch.utils.data.IterableDataset
  • torch.utils.data.BufferedShuffleDataset(dataset, buffer_size)
  • torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True, generator=None)
  • torch.utils.data.BatchSampler(sampler, batch_size, drop_last)
  • torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)

Guess you like

Origin blog.csdn.net/qq_29695701/article/details/115354889