sampler sampler records

After the deep learning data labeling is completed, it is necessary to build a dataset and send it to the dataloader for integration and then send it to the model for learning. After the data enters the dataloader, there will be a sampler sampler to filter the index of the data. This article mainly records several common samplers :

Groupsampler in mmdetection
Imbalanced data – Imbalanced Dataset Sampler
Data Imbalance – Weighted Random Sampler
Random Sampler
Sequential Sampler
Subset Random Sampler

1、sampler

sampler is the basic function in torch, only the definition is not implemented, it is the base class that needs to be inherited to implement other different samplers, the base class has only one initialization function, one iterator __iter__ function and one __len__ function for statistical data length , when implementing other samplers, what must be implemented is the iter function and the len function, where iter continuously returns the index of the data to obtain the corresponding samples.

class Sampler(Generic[T_co]):
    def __init__(self, data_source: Optional[Sized]) -> None:
        pass
        
    def __iter__(self) -> Iterator[T_co]:
        raise NotImplementedError
        
    def __len__(self) -> int:
    	pass

2、Groupsampler

mmdetection implements many data preprocessing methods, and finally unifies pictures of different sizes into the same size through the pad in collate. Reducing the pad area can save computing power. Therefore, SenseTime has implemented its own sampler Groupsampler, which divides unprocessed pictures into two groups according to the ratio (the ratio of length to width). into another group. It is required that every time the dataloader iterates, the batch returned is the same group! In this way, the input of similar ratio images can effectively reduce the pad area.
The source code of its implementation is as follows:

class GroupSampler(Sampler):

    def __init__(self, dataset, samples_per_gpu=1):
        assert hasattr(dataset, 'flag')
        self.dataset = dataset
        self.samples_per_gpu = samples_per_gpu
        self.flag = dataset.flag.astype(np.int64)  
        # flag标志由dataset在初始化时确定，详见customdataset
        # flag只有两个取值，根据ratio是否大于1，分为两组

        self.group_sizes = np.bincount(self.flag)  # 对每组的数量进行计数，详见bincount的使用方法
        self.num_samples = 0 # 作为__len__的返回值
        
        for i, size in enumerate(self.group_sizes):
            self.num_samples += int(np.ceil(size / self.samples_per_gpu)) * self.samples_per_gpu
        # group_size不一定能确保被samples_per_gpu整除，因此需要向上取整
        # 比如分组0的数量是100个，分组1的数量是200个，samples_per_gpu为3
        # 那么num_samples = 102+201 = 303

    def __iter__(self): # 返回迭代器，每次迭代返回一个整数索引
        indices = []
        for i, size in enumerate(self.group_sizes):
            if size == 0:
                continue
            indice = np.where(self.flag == i)[0] # 获得同组的图片下标
            assert len(indice) == size
            np.random.shuffle(indice) # 打乱
            num_extra = int(np.ceil(size / self.samples_per_gpu)) * self.samples_per_gpu - len(indice)
            indice = np.concatenate([indice, np.random.choice(indice, num_extra)])
            indices.append(indice)
        # 还是以"分组0的数量是100个，分组1的数量是200个，samples_per_gpu为3"举例，num_samples = 102+201 = 303        
        # 102大于100，201大于200，所以我们需要还额外增加下标
        # 最后得到303个下标，其中前102个是分组0，后201个是分组1，确保每samples_per_gpu都是同一ratio

        indices = np.concatenate(indices)
       
        indices = [
            indices[i * self.samples_per_gpu:(i + 1) * self.samples_per_gpu]
            for i in np.random.permutation(range(len(indices) // self.samples_per_gpu))
        ] # 将indicesan按照samples_per_gpu个数进行排布
        indices = np.concatenate(indices)
        indices = indices.astype(np.int64).tolist()
        assert len(indices) == self.num_samples
        return iter(indices)

    def __len__(self):
        return self.num_samples

3、Imbalanced Dataset Sampler

In deep learning, we often encounter the problem of unbalanced distribution of data sets, which will lead to a bias in task learning. For example, in the classification of cats and dogs, the total number of data sets is 50. If the proportion of cats is 90% , the proportion of dogs is 10%, and the model will tend to predict the category of cats after learning.
For the above-mentioned extremely unbalanced samples, the common processing method will use the resampling method to solve it. Of course, resampling also has its own defects. For data with too few samples, it is actually equivalent to reusing a certain type of samples multiple times. Exacerbate overfitting. A torch-based unbalanced sampling scheme is implemented in the
Imbalanced Dataset Sampler
repo. Generally speaking, it is a bit like the weightedsampler source code as follows:

from typing import Callable
import pandas as pd
import torch
import torch.utils.data
import torchvision

class ImbalancedDatasetSampler(torch.utils.data.sampler.Sampler):
    """Samples elements randomly from a given list of indices for imbalanced dataset
    Arguments:
        indices: a list of indices
        num_samples: number of samples to draw
        callback_get_label: a callback-like function which takes two arguments - dataset and index
    """

    def __init__(
        self,
        dataset, 
        labels: list = None,
        indices: list = None,
        num_samples: int = None,
        callback_get_label: Callable = None,
    ):
        # if indices is not provided, all elements in the dataset will be considered
        self.indices = list(range(len(dataset))) if indices is None else indices

        # define custom callback
        self.callback_get_label = callback_get_label 

        # if num_samples is not provided, draw `len(indices)` samples in each iteration
        self.num_samples = len(self.indices) if num_samples is None else num_samples

        # distribution of classes in the dataset
        df = pd.DataFrame()
        df["label"] = self._get_labels(dataset) if labels is None else labels
        df.index = self.indices
        df = df.sort_index()

        label_to_count = df["label"].value_counts()

        weights = 1.0 / label_to_count[df["label"]]

        self.weights = torch.DoubleTensor(weights.to_list())

    def _get_labels(self, dataset):
        if self.callback_get_label:
            return self.callback_get_label(dataset)
        elif isinstance(dataset, torch.utils.data.TensorDataset):
            return dataset.tensors[1]
        elif isinstance(dataset, torchvision.datasets.MNIST):
            return dataset.train_labels.tolist()
        elif isinstance(dataset, torchvision.datasets.ImageFolder):
            return [x[1] for x in dataset.imgs]
        elif isinstance(dataset, torchvision.datasets.DatasetFolder):
            return dataset.samples[:][1]
        elif isinstance(dataset, torch.utils.data.Subset):
            return dataset.dataset.imgs[:][1]
        elif isinstance(dataset, torch.utils.data.Dataset):
            return dataset.get_labels()
        else:
            raise NotImplementedError

    def __iter__(self):
        return (self.indices[i] for i in torch.multinomial(self.weights, self.num_samples, replacement=True))

    def __len__(self):
        return self.num_samples

The principle is to record the target of each sample, calculate the proportion of the target of each category, and then obtain the weight information of the sample according to the target of the input sample, and use the torch.multinomial function to obtain the sample. What is obtained is the sample itself, not the index. It should be noted that the _get_labels function only implements several common ways to obtain the target, and the others are not implemented. Update or rewrite according to your own needs.
Question : Here is a question, how to assign weights to each sample when there are multiple defects in one image in target detection

4、Weighted Random Sampler

Weighted Random Sampler is actually also to deal with the problem of sample imbalance. For data with uneven sample distribution, the reciprocal of the proportion of different types of data is used as its weight.
eg:
Assume that in the classification problem, we have 4 categories, namely cats, dogs, pigs, and sheep, and their ratios are [0.1, 0.1, 0.3, 0.5],
and their weights for cats, dogs, pigs, and sheep are [1 /0.1,1/0.1,1/0.3,1/0.5]=[10,10,3,33,2]
So if the data set is [cat, cat, cat, dog, sheep, sheep, sheep, pig, dog , dog]
weights correspond to: [10, 10, 10, 10, 2, 2, 2, 3, 33, 10, 10]

The source code is as follows:

class WeightedRandomSampler(Sampler[int]):
    r"""Samples elements from ``[0,..,len(weights)-1]`` with given probabilities (weights).

    Args:
        weights (sequence)   : a sequence of weights, not necessary summing up to one
        num_samples (int): number of samples to draw
        replacement (bool): if ``True``, samples are drawn with replacement.
            If not, they are drawn without replacement, which means that when a
            sample index is drawn for a row, it cannot be drawn again for that row.
        generator (Generator): Generator used in sampling.

    Example:
        >>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
        [4, 4, 1, 4, 5]
        >>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
        [0, 1, 4, 3, 2]
    """
    weights: Tensor
    num_samples: int
    replacement: bool

    def __init__(self, weights: Sequence[float], num_samples: int,
                 replacement: bool = True, generator=None) -> None:
        if not isinstance(num_samples, _int_classes) or isinstance(num_samples, bool) or \
                num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(num_samples))
        if not isinstance(replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(replacement))
        self.weights = torch.as_tensor(weights, dtype=torch.double)
        self.num_samples = num_samples
        self.replacement = replacement
        self.generator = generator

    def __iter__(self):
        rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
        return iter(rand_tensor.tolist())

    def __len__(self):
        return self.num_samples

5、Random Sampler

The random sampling method, for all samples, distinguishes whether it needs to be put back for sampling. The main difference is whether the replacement is True or not. The main function used is torch.randint()/torch.randperm()

class RandomSampler(Sampler[int]):
    r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
    If with replacement, then user can specify :attr:`num_samples` to draw.

    Args:
        data_source (Dataset): dataset to sample from
        replacement (bool): samples are drawn on-demand with replacement if ``True``, default=``False``
        num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
            is supposed to be specified only when `replacement` is ``True``.
        generator (Generator): Generator used in sampling.
    """
    data_source: Sized
    replacement: bool

    def __init__(self, data_source: Sized, replacement: bool = False,
                 num_samples: Optional[int] = None, generator=None) -> None:
        self.data_source = data_source
        self.replacement = replacement
        self._num_samples = num_samples
        self.generator = generator

        if not isinstance(self.replacement, bool):
            raise TypeError("replacement should be a boolean value, but got "
                            "replacement={}".format(self.replacement))

        if self._num_samples is not None and not replacement:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(self.num_samples))

    @property
    def num_samples(self) -> int:
        # dataset size might change at runtime
        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

    def __iter__(self):
        n = len(self.data_source)
        if self.generator is None:
            generator = torch.Generator()
            generator.manual_seed(int(torch.empty((), dtype=torch.int64).random_().item()))
        else:
            generator = self.generator
        if self.replacement:
            for _ in range(self.num_samples // 32):
                yield from torch.randint(high=n, size=(32,), dtype=torch.int64, generator=generator).tolist()
            yield from torch.randint(high=n, size=(self.num_samples % 32,), dtype=torch.int64, generator=generator).tolist()
        else:
            yield from torch.randperm(n, generator=self.generator).tolist()

    def __len__(self):
        return self.num_samples

6、Sequential Sampler

Sequential sampler, as the name implies, samples according to the order in which the data is arranged, and the index
source code corresponding to the returned sample is as follows:

class SequentialSampler(Sampler[int]):
    r"""Samples elements sequentially, always in the same order.

    Args:
        data_source (Dataset): dataset to sample from
    """
    data_source: Sized

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source))) # 安装顺序返回样本的index

    def __len__(self) -> int:
        return len(self.data_source)

7、Subset Random Sampler

It can be seen that what __iter__() returns is not a random number sequence, but uses the random number sequence as the index of indices, and then returns the scrambled data itself. It should be noted that the sampling is still non-repetitive, and it is also implemented through the randperm() function. According to the information that can be collected on the Internet, Subset Random Sampler should be used to divide the training set, test set and verification set. The following divides the data into two parts: train and val, and again points out that what __iter__() returns is not an index, but is the data corresponding to the index:

class SubsetRandomSampler(Sampler):
    r"""Samples elements randomly from a given list of indices, without replacement.
    Arguments:
        indices (sequence): a sequence of indices
    """
    def __init__(self, indices):
        # 数据集的切片，比如划分训练集和测试集
        self.indices = indices
    def __iter__(self):
        # 以元组形式返回不重复打乱后的“数据”
        return (self.indices[i] for i in torch.randperm(len(self.indices)))
    def __len__(self):
        return len(self.indices)

Reference:
pytorch source code reading (3) Sampler class and 4 sampling methods