01_DataSet in pytorch

In pytorch,
Dataset: is used to create a data set;
DataLoader: is used to transfer and obtain a batch of data during the training process;

Here we first introduce the Dataset class in pytorch.
torch.utils.data.dataset.py is an abstract class representing a dataset. Any custom dataset needs to inherit this class and override related methods.

A dataset is actually a class responsible for mapping from index to sample.

insert image description here
As can be seen in torch.utils.data.dataset.py,
pytorch provides two datasets:

  • Map-style data set, in the figure aboveMapDataPipe()
  • Iterable dataset,IterDataPipe()

1. Map-style dataset

That is, in the figure above MapDataPipe(),

1.1 Methods that need to be rewritten

A Map-like data set must be rewritten __getitem__(self, index),
len(self) Two built-in methods are used to represent the mapping (Map) from index to sample.

The role of such a data set dataset is as follows,

  • When using the dataset[idx] command, you can read the idxth image in your dataset and its label (if any) from your hard disk;
  • len(dataset) will return the capacity of this dataset.

1.2 How to use

Example-1: An example written in my own experiment: here our image files are stored in the "./data/faces/" folder, and the name of the image does not start from 1, but is saved from the file final_train_tag_dict.txt Read from the dictionary, label information is also read from this file. You can follow the comments to read this code.

from torch.utils import data
import numpy as np
from PIL import Image


class face_dataset(data.Dataset):
	def __init__(self):
		self.file_path = './data/faces/'
		f=open("final_train_tag_dict.txt","r")
		self.label_dict=eval(f.read())
		f.close()

	def __getitem__(self,index):
		label = list(self.label_dict.values())[index-1]
		img_id = list(self.label_dict.keys())[index-1]
		img_path = self.file_path+str(img_id)+".jpg"
		img = np.array(Image.open(img_path))
		return img,label

	def __len__(self):
		return len(self.label_dict)

2. Iterable data set

An Iterable (iterable) dataset is a subclass of the abstract class data.IterableDataset and overrides __iter__the method to be an iterator.

This kind of dataset is mainly used when the size of the data is unknown, or the input is in the form of a stream, and the local file is not fixed. It is necessary to obtain the sample index in an iterative manner.

An Iterable data set must be rewritten __iter__,

class IterableDataset(Dataset[T_co]):
    r"""An iterable Dataset.

    All datasets that represent an iterable of data samples should subclass it.
    Such form of datasets is particularly useful when data come from a stream.

    All subclasses should overwrite :meth:`__iter__`, which would return an
    iterator of samples in this dataset.

    When a subclass is used with :class:`~torch.utils.data.DataLoader`, each
    item in the dataset will be yielded from the :class:`~torch.utils.data.DataLoader`
    iterator. When :attr:`num_workers > 0`, each worker process will have a
    different copy of the dataset object, so it is often desired to configure
    each copy independently to avoid having duplicate data returned from the
    workers. :func:`~torch.utils.data.get_worker_info`, when called in a worker
    process, returns information about the worker. It can be used in either the
    dataset's :meth:`__iter__` method or the :class:`~torch.utils.data.DataLoader` 's
    :attr:`worker_init_fn` option to modify each copy's behavior.

    """
    def __iter__(self) -> Iterator[T_co]:
        raise NotImplementedError

    def __add__(self, other: Dataset[T_co]):
        return ChainDataset([self, other])

All Datasets that represent an iterable of data samples should subclass it.
This form of dataset is especially useful when the data comes from a stream.

All subclasses should override :meth: __iter__, which returns an iterator over the samples in this dataset.

  • When subclassing ~torch.utils.data.DataLoaderis used with :class:, each item in the dataset will be ~torch.utils.data.DataLoaderyielded from the :class: iterator.
  • When :attr: num_workers > 0ing, each worker process will have a different copy of the dataset object, so it is usually necessary to configure each copy independently to avoid returning duplicate data from worker processes.
  • :func: ~torch.utils.data.get_worker_info, when called within a worker process, returns information about the worker. It can be used in the :meth: method of a dataset __iter__or ~torch.utils.data.DataLoaderin the j-
  • :attr: worker_init_fnoption to modify the behavior of each replica

2.1 The relationship between iterators and generators

insert image description here

2.2 Iterators in python

As the name implies, an iterator is an object used for iterative operations (for loop). Like a list, it can iteratively obtain each element in it. Any object that implements __next__ the method (python2 is next) can be called an iterator.

The difference between it and a list is that when building an iterator, unlike a list that loads all elements into memory at once, it returns elements in a lazy evaluation manner, which is exactly its advantage.

For example, if a list contains 10 million integers, it needs to occupy more than 400M of memory, while an iterator only needs a few tens of bytes of space.
Because it does not load all the elements into the memory, but waits until the next method is called before returning the element
(the method of calling call by need on demand, in essence, the for loop is to continuously call the __next__ method of the iterator).

Take the Fibonacci sequence as an example to implement an iterator:

class Fib:
    def __init__(self, n):
        self.prev = 0
        self.cur = 1
        self.n = n

    def __iter__(self):
        return self

    def __next__(self):
        if self.n > 0:
            value = self.cur
            self.cur = self.cur + self.prev
            self.prev = value
            self.n -= 1
            return value
        else:
            raise StopIteration()
    # 兼容python2
    def next(self):
        return self.__next__()

f = Fib(10)
print([i for i in f])
#[1, 1, 2, 3, 5, 8, 13, 21, 34, 5

2.3 Generators in python

After knowing iterators, you can officially enter the topic of generators.

2.3.1 Why generators are needed

Through list generation, we can directly create a list, but due to memory constraints, the capacity of the list is definitely limited, and creating a list with 1 million elements not only takes up a lot of storage space, if we only need to access the front A few elements, then the space occupied by most of the elements behind is wasted.

So, if the list elements can be calculated according to a certain algorithm, can we continuously calculate the subsequent elements during the loop? In this way, there is no need to create a complete list, which saves a lot of space. In Python, this mechanism of calculating while looping is called a generator: generator

The generator is a special program that can be used to control the iterative behavior of the loop. The generator in python is a kind of iterator. Using the yield return value function, each call to yield will pause, and you can use the next() function and The send() function resumes the generator.

A generator is similar to a function that returns an array. This function can accept parameters and can be called. However, unlike ordinary functions that return an array containing all values ​​​​at once, a generator can only generate one value at a time. The amount of memory consumed will be greatly reduced, and it will allow the calling function to process the first few return values ​​very quickly, so the generator looks like a function, but behaves like an iterator

  • Generator function: It is also defined with def, using the keyword yield to return a result at one time, blocking, and restarting

  • Generator expression: Returns an object that produces a result only when needed

Ordinary functions use return to return a value, which is the same as other languages ​​such as Java. However, there is another function in Python that uses the keyword yield to return a value. This function is called a generator function. When the function is called, it will return a generated The generator object is essentially an iterator, which is also used in iterative operations, so it has the same characteristics as iterators. The only difference is that the implementation method is different, and the generator is more concise.

2.3.2 Generator functions

The simplest generator function:

>>> def func(n):
...     yield n*2
...
>>> func
<function func at 0x00000000029F6EB8>
>>> g = func(5)
>>> g
<generator object func at 0x0000000002908630>
>>>

func is a generator function. When calling this function, the object returned is the generator g. The behavior of this generator object is very similar to that of an iterator, and can be used in scenarios such as for loops. Note that the value corresponding to yield will not be returned immediately when the function is called, but will be returned when the next method is called (essentially, the for loop also calls the next method).

>>> g = func(5)
>>> next(g)
10

>>> g = func(5)
>>> for i in g:
...     print(i)
...
10

So why use a generator? With generators it's less verbose and just as efficient in performance.

The downside is that it needs more understanding.

Let's see how easy it is to implement the Fibonacci sequence with generators.

def fib(n):
    prev, curr = 0, 1
    while n > 0:
        n -= 1
        yield curr
        prev, curr = curr, curr + prev

print([i for i in fib(10)])
#[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

2.3.2 Generator expressions

  • Generator
    expressions are very similar to list comprehensions, but they return different objects. The former returns a generator object, while the latter returns a list object.
>>> g = (x*2 for x in range(10))
>>> type(g)
<type 'generator'>
>>> l = [x*2 for x in range(10)]
>>> type(l)
<type 'list'>

The advantage of the generator is that when iterating a large amount of data, the generator will have more memory.

2.3 The generator is applied in DataLoader

The implementation of the DataLoader module in the deep learning framework PyTorch uses the generator mechanism to generate a batch for training.

The detailed explanation of DataLoader is in another article of the blogger, Dataloader of Pytorch. Here we only talk about the Sampler class module that uses the generator mechanism.

First of all, RandomSampler, iter(randomSampler) will return an iterable object, this iterable object will output the current index to be sampled every time next, the same is true for SequentialSampler, but the generated index is sequential

class RandomSampler(Sampler):

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(torch.randperm(len(self.data_source)).long())

    def __len__(self):
        return len(self.data_source)

BatchSampler is a wrapper of ordinary Sampler. Ordinary Sampler only generates one index at a time, while BatchSampler generates a batch of indices at a time.

class BatchSampler(Sampler):
    """Wraps another sampler to yield a mini-batch of indices. Args: sampler (Sampler): Base sampler. batch_size (int): Size of mini-batch. drop_last (bool): If ``True``, the sampler will drop the last batch if its size would be less than ``batch_size`` Example: >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False)) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]] >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True)) [[0, 1, 2], [3, 4, 5], [6, 7, 8]] """

    def __init__(self, sampler, batch_size, drop_last):
        if not isinstance(sampler, Sampler):
            raise ValueError("sampler should be an instance of "
                             "torch.utils.data.Sampler, but got sampler={}"
                             .format(sampler))
        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size <= 0:
            raise ValueError("batch_size should be a positive integeral value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

reference:

https://chenllliang.github.io/2020/02/04/dataloader/
https://chenllliang.github.io/2020/02/06/PyIter/
https://www.zhihu.com/question/20829330/answer/213544776
https://www.cnblogs.com/wj-1314/p/8490822.html

Guess you like

Origin blog.csdn.net/chumingqian/article/details/131002167