Dataset reading and division, ImageFolder(), custom dataset, TensorDataset, StratifiedShuffleSplit

Table of contents

Guide package

data set

download dataset

Analysis of data set characteristics

torchvision.datasets.ImageFolder()

Dataset organization

train of thought

Read tags by image name

Create tag subfolders

Dataset partition

Call the data set processing function

read dataset 

torchvision.datasets.ImageFolder() source code and interpretation

source code

interpret

Features of torchvision.datasets.ImageFolder()

Handwritten ImageFolder()

custom data set

Dataset processing

Convert label to int type

Custom Dataset Functions 

read dataset 

Dataset partition function

train_test_split function

usage

read dataset

existing problems

StratifiedShuffleSplit function

usage

read dataset

Why do reset_index() operation

Analysis

random_split()

usage

to divide

read dataset

Analysis

Other datasets

TensorDataset

source code

 Create a dataset

use dataloader

One problem: wrapping a single tensor


Guide package

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import pandas as pd
import os
import collections
import shutil
import math
from torch.utils.data import DataLoader, Dataset
from PIL import Image


data_dir = 'data\dog-breed-identification'  # 数据集所在文件夹
label_csv = 'labels.csv'  # 标签文件夹名

data set

Original tutorial website: 13.13. Actual Kaggle Competition: Image Classification (CIFAR-10) — Hands-on Deep Learning 2.0.0-beta1 documentation 13.14. Actual Kaggle Competition: Dog Breed Recognition (ImageNet Dogs) — Hands-on Deep Learning 2.0.0 -beta1 documentation 13.13. Practical Kaggle Competition: Image Classification (CIFAR-10) — Hands-On Deep Learning 2.0.0-beta1 documentation

Reference: Hands-on Deep Learning Kaggle: Image Classification (CIFAR-10 and Dog Breed Identification)_iwill323's Blog-CSDN Blog

download dataset

The dataset URL is CIFAR-10 - Object Recognition in Images | Kaggle Dog Breed Identification | Kaggle CIFAR-10 - Object Recognition in Images | Kaggle

Download the dataset, after unzipping the downloaded file in ../data, you will find the entire dataset in the following path:

  •     ../data/dog-breed-identification/labels.csv
  •     ../data/dog-breed-identification/sample_submission.csv
  •     ../data/dog-breed-identification/train
  •     ../data/dog-breed-identification/test

The folders train/ and test/ contain training and test dog images respectively, and labels.csv contains the labels of the training images, where the train folder contains sample images as shown below, and the names of the image files are messy

Analysis of data set characteristics

The competition data set is divided into training set and test set, which contain 10222 and 10357 JPEG images of RGB (color) channels respectively. In the training dataset, there are 120 dog breeds, such as Labrador, Poodle, Dachshund, Samoyed, Husky, Chihuahua, and Yorkshire.

  • read with pandastrainLabels.csv文件
df = pd.read_csv(os.path.join(data_dir, label_csv))
df.head()

  • Number of tags
breeds = df.breed.unique()
len(breeds)
120
  • How many samples of each category are there in the training set
count_train = collections.Counter(df['breed'])
count_train.most_common() 
[('scottish_deerhound', 126),
 ('maltese_dog', 117),
 ('afghan_hound', 116),
 ……
 ('komondor', 67),
 ('brabancon_griffon', 67),
 ('eskimo_dog', 66),
 ('briard', 66)]

torchvision.datasets.ImageFolder()

Dataset organization

train of thought

torchvision.datasets.ImageFolder() requires the establishment of classification label subfolders in the root directory, and the pictures corresponding to the labels are archived in each subfolder, so it is necessary to create a folder for each label, and traverse the samples, and copy each sample to the corresponding in the folder. In this example, when archiving pictures, by the way, the data set is divided into

Read tags by image name

In order to create subfolders by category in the root directory, it is necessary to obtain the corresponding category label label when reading the name of each sample image. However, pandas generally selects data based on the index or number of rows of the table. I haven't found a way to index data in other columns based on the value of a certain column. The tutorial indexes the data of another column according to the data of one column. The following read_csv_labels() function plays such a role. The read_csv_labels function returns a variable in dictionary format, which can index the label according to the name.

def read_csv_labels(fname):
    """读取fname来给标签字典返回一个文件名"""
    with open(fname, 'r') as f:
        # 跳过文件头行(列名)
        lines = f.readlines()[1:]
    tokens = [l.rstrip().split(',') for l in lines]
    return dict(((name, label) for name, label in tokens))

Create tag subfolders

The copyfile function copies the picture from the original location filename to the corresponding folder, just specify target_dir as the name of the label folder.

def copyfile(filename, target_dir):
    """将文件复制到目标目录"""
    os.makedirs(target_dir, exist_ok=True)  # 文件夹不存在则创建
    shutil.copy(filename, target_dir)

Dataset partition

The data set only contains the train and test data sets, and when we are training, it generally also includes the verification set, so we need to divide the verification set for processing. When using a platform like Google Colab, we often compress and upload the training set, test set, and verification set, so sometimes they need to be divided and saved in different folders.

  • Define reorg_train_validthe function to split the validation set from the original training set. The parameter in this function valid_ratiois the ratio of the number of samples in the validation set to the number of samples in the original training set. More specifically, let n equal the number of images in the class with the fewest samples, and r be the ratio. The validation set will split max(⌊nr⌋,1) images for each category. As valid_ratio=0.1an example, since the original training set has 50,000 images, train_valid_test/trainthere will be 45,000 images in the path for training, and the remaining 5,000 images will be used as train_valid_test/validthe validation set in the path.
  • Define the reorg_test function to copy the test set data to a new folder. Note that there must also be a subfolder (unknown) under the test folder as the classification folder, otherwise torchvision.datasets.ImageFolder() will report an error. Because the find_classes() function of ImageFolder() needs to read the name of the folder from the root folder to generate a list of categories, without this list it will cause an error
def reorg_train_valid(data_dir, labels, valid_ratio):
    """将验证集从原始的训练集中拆分出来"""
    # 训练数据集中样本最少的类别中的样本数
    n = collections.Counter(labels.values()).most_common()[-1][1]
    # 验证集中每个类别的样本数
    n_valid_per_label = max(1, math.floor(n * valid_ratio))
    label_count = {}
    for train_file in os.listdir(os.path.join(data_dir, 'train')):
        label = labels[train_file.split('.')[0]] # 根据文件名索引label
        fname = os.path.join(data_dir, 'train', train_file)
        copyfile(fname, os.path.join(data_dir, 'train_valid_test',
                                     'train_valid', label))
        if label not in label_count or label_count[label] < n_valid_per_label:
            copyfile(fname, os.path.join(data_dir, 'train_valid_test',
                                         'valid', label))
            label_count[label] = label_count.get(label, 0) + 1
        else:
            copyfile(fname, os.path.join(data_dir, 'train_valid_test',
                                         'train', label))
    return n_valid_per_label


def reorg_test(data_dir):
    """在预测期间整理测试集,以方便读取"""
    for test_file in os.listdir(os.path.join(data_dir, 'test')):
        copyfile(os.path.join(data_dir, 'test', test_file),
                 os.path.join(data_dir, 'train_valid_test', 'test',
                              'unknown'))

Call the data set processing function

The format of labels.values() is <class 'builtin_function_or_method'>, which can be used in collections.Counter() method 

def reorg_cifar10_data(data_dir, label_csv, valid_ratio):
    labels = read_csv_labels(os.path.join(data_dir, label_csv))
    reorg_train_valid(data_dir, labels, valid_ratio)
    reorg_test(data_dir)


batch_size = 128
valid_ratio = 0.1
reorg_cifar10_data(data_dir, label_csv, valid_ratio)

The effect of code execution is that four folders are created, namely test, train (9502 samples), valid (720 samples) and train_valid, where train_valid is a collection of train and valid. The train_valid folder is created because, after using the verification set to filter out the best hyperparameters, train with train_valid again to get the final model

Under each folder, 120 classified folders are created according to the category, which is the requirement of the torchvision.datasets.ImageFolder() function.

read dataset 

Read a dataset consisting of raw images, where each sample consists of an image and a label. Note that when the verification set is used for model evaluation during hyperparameter tuning, the randomness of image augmentation should not be introduced, so the transform used in the valid data set is transform_test

train_ds, train_valid_ds = [torchvision.datasets.ImageFolder(
    os.path.join(data_dir, 'train_valid_test', folder),
    transform=train_transform) for folder in ['train', 'train_valid']]

valid_ds, test_ds = [torchvision.datasets.ImageFolder(
    os.path.join(data_dir, 'train_valid_test', folder),
    transform=test_transform) for folder in ['valid', 'test']]

train_iter, train_valid_iter = [torch.utils.data.DataLoader(
    dataset, batch_size, shuffle=True, drop_last=True)
    for dataset in (train_ds, train_valid_ds)]

valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False,
                                         drop_last=True)

test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False,
                                        drop_last=False)

The image augmentation used in it:

img_size = 224  # 也可以是其他值
train_transform = transforms.Compose([    
    transforms.RandomResizedCrop(img_size, ratio=(3.0/4.0, 4.0/3.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(30),
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], 
                         [0.229, 0.224, 0.225])
])

test_transform = transforms.Compose([
    transforms.Resize(img_size),
    transforms.CenterCrop(img_size),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], 
                         [0.229, 0.224, 0.225])
])

torchvision.datasets.ImageFolder() source code and interpretation

source code

ImageFolder is a subclass of the DatasetFolder class

IMG_EXTENSIONS = (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")

def pil_loader(path: str) -> Image.Image: # 根据地址读取图像    
    with open(path, "rb") as f:
        img = Image.open(f)
        return img.convert("RGB")

class ImageFolder(DatasetFolder):    
    def __init__(
        self,
        root: str,
        transform: Optional[Callable] = None,
        target_transform: Optional[Callable] = None,
        loader: Callable[[str], Any] = default_loader,
        is_valid_file: Optional[Callable[[str], bool]] = None,
    ):
        super().__init__(
            root,
            loader,
            IMG_EXTENSIONS if is_valid_file is None else None,
            transform=transform,
            target_transform=target_transform,
            is_valid_file=is_valid_file,
        )
        self.imgs = self.samples

loder is a reference to the function pil_loader() defined above. The function of this function is to read the image according to the incoming image address; IMG_EXTENSIONS defines the extension type of the read image file. The rest of the parameters passed in when calling the parent class __init__ method have already been passed in at the outermost point, including root representing the path and transform representing the transformation to be performed on the image. (Look at the parameters passed in in the first paragraph of code)
Next, look at the definition of the DatasetFolder class (source code):

class DatasetFolder(VisionDataset):
    def __init__(
        self,
        root: str,
        loader: Callable[[str], Any],
        extensions: Optional[Tuple[str, ...]] = None,
        transform: Optional[Callable] = None,
        target_transform: Optional[Callable] = None,
        is_valid_file: Optional[Callable[[str], bool]] = None,
    ) -> None:
        super().__init__(root, transform=transform, target_transform=target_transform)
        classes, class_to_idx = self.find_classes(self.root)
        samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)

        self.loader = loader
        self.extensions = extensions

        self.classes = classes
        self.class_to_idx = class_to_idx
        self.samples = samples
        self.targets = [s[1] for s in samples]

    @staticmethod
    def make_dataset(
        directory: str,
        class_to_idx: Dict[str, int],
        extensions: Optional[Tuple[str, ...]] = None,
        is_valid_file: Optional[Callable[[str], bool]] = None,
    ) -> List[Tuple[str, int]]:        
        if class_to_idx is None:
            raise ValueError("The class_to_idx parameter cannot be None.")
        return make_dataset(directory, class_to_idx, extensions=extensions, is_valid_file=is_valid_file)

    def find_classes(self, directory: str) -> Tuple[List[str], Dict[str, int]]:
        return find_classes(directory)

    def __getitem__(self, index: int) -> Tuple[Any, Any]:
        path, target = self.samples[index]
        sample = self.loader(path)
        if self.transform is not None:
            sample = self.transform(sample)
        if self.target_transform is not None:
            target = self.target_transform(target)

        return sample, target

    def __len__(self) -> int:
        return len(self.samples)

The following is the source code of the auxiliary function used

  • The function of the has_file_allowed_extension function is to judge whether the file has the suffix of the required image type extension according to the file name
  • The function of the find_classes function is to obtain several images under the folder according to the input folder address for storing images, and assign a number to each image
  • make_dataset函数会根据地址、图像种类字典以及扩展名列表得到一个列表:(path_to_sample, class)
def has_file_allowed_extension(filename: str, extensions: Union[str, Tuple[str, ...]]) -> bool:
    return filename.lower().endswith(extensions if isinstance(extensions, str) else tuple(extensions))

# Checks if a file is an allowed image extension
def is_image_file(filename: str) -> bool:
    return has_file_allowed_extension(filename, IMG_EXTENSIONS)

def find_classes(directory: str) -> Tuple[List[str], Dict[str, int]]:    
    classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
    if not classes:
        raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")

    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
    return classes, class_to_idx

def make_dataset(
    directory: str,
    class_to_idx: Optional[Dict[str, int]] = None,
    extensions: Optional[Union[str, Tuple[str, ...]]] = None,
    is_valid_file: Optional[Callable[[str], bool]] = None,
) -> List[Tuple[str, int]]:    
    directory = os.path.expanduser(directory)

    if class_to_idx is None:
        _, class_to_idx = find_classes(directory)
    elif not class_to_idx:
        raise ValueError("'class_to_index' must have at least one entry to collect any samples.")

    both_none = extensions is None and is_valid_file is None
    both_something = extensions is not None and is_valid_file is not None
    if both_none or both_something:
        raise ValueError("Both extensions and is_valid_file cannot be None or not None at the same time")

    if extensions is not None:
        def is_valid_file(x: str) -> bool:
            return has_file_allowed_extension(x, extensions)  # type: ignore[arg-type]

    is_valid_file = cast(Callable[[str], bool], is_valid_file)

    instances = []
    available_classes = set()
    for target_class in sorted(class_to_idx.keys()):
        # 第1个for读取类别名称,进入了每个类文件夹中
        class_index = class_to_idx[target_class]
        target_dir = os.path.join(directory, target_class)
        if not os.path.isdir(target_dir):
            continue
        for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
            # 第2个for深度遍历每个类文件夹及其子文件夹,fnames是这些文件夹内的文件
            for fname in sorted(fnames):
                # 第3个for读取每个文件的文件名
                path = os.path.join(root, fname)
                if is_valid_file(path):
                    item = path, class_index
                    instances.append(item)

                    if target_class not in available_classes:
                        available_classes.add(target_class)

    empty_classes = set(class_to_idx.keys()) - available_classes
    if empty_classes:
        msg = f"Found no valid file for the classes {', '.join(sorted(empty_classes))}. "
        if extensions is not None:
            msg += f"Supported extensions are: {extensions if isinstance(extensions, str) else ', '.join(extensions)}"
        raise FileNotFoundError(msg)

    return instances

interpret

 The source code is a bit complicated, here is a simplified version:

def find_classes(dir):
    classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
    classes.sort()
    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
    return classes, class_to_idx

def make_dataset(directory, class_to_idx, extensions) :    
    directory = os.path.expanduser(directory)
    
    instances = []
    available_classes = set()
    for target_class in sorted(class_to_idx.keys()):
        # 第1个for读取类别名称,进入了每个类文件夹中
        class_index = class_to_idx[target_class]
        target_dir = os.path.join(directory, target_class)
        if not os.path.isdir(target_dir):
            continue
        for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
            # 第2个for深度遍历每个类文件夹及其子文件夹,fnames是这些文件夹内的文件
            for fname in sorted(fnames):
                # 第3个for读取每个文件的文件名
                path = os.path.join(root, fname)
                if has_file_allowed_extension(path, IMG_EXTENSIONS):
                    item = path, class_index
                    instances.append(item)

                    if target_class not in available_classes:
                        available_classes.add(target_class)
    
    # 如果有的类型没找到对应的文件,就报错
    empty_classes = set(class_to_idx.keys()) - available_classes
    if empty_classes:
        msg = f"Found no valid file for the classes {', '.join(sorted(empty_classes))}. "
        if extensions is not None:
            msg += f"Supported extensions are: {extensions if isinstance(extensions, str) else ', '.join(extensions)}"
        raise FileNotFoundError(msg)

    return instances

IMG_EXTENSIONS = (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")

def has_file_allowed_extension(filename, extensions):
    return filename.lower().endswith(extensions if isinstance(extensions, str) else tuple(extensions))

The os.walk() function is used, you can refer to the detailed understanding of os.walk() (understand in seconds)_Unbearable Blog-CSDN Blog_os.walk()

>>classes, class_to_idx = find_classes(os.path.join(data_dir, 'train_valid_test', 'train'))
>>classes

(['airplane',  'automobile',  'bird',  'cat',  'deer',  'dog',  'frog',  'horse',  'ship',  'truck'],

>>class_to_idx 

 {'airplane': 0,  'automobile': 1,  'bird': 2,  'cat': 3,  'deer': 4,  'dog': 5,  'frog': 6,  'horse': 7,  'ship': 8,  'truck': 9})

>>samples = make_dataset(os.path.join(data_dir, 'train_valid_test', 'train'), class_to_idx, IMG_EXTENSIONS)
>>samples[:4]

[('.\\data\\cifar-10\\train_valid_test\\train\\airplane\\14469.png', 0), 
('.\\data\\cifar-10\\train_valid_test\\train\\airplane\\14480.png', 0),
('.\\data\\cifar-10\\train_valid_test\\train\\airplane\\14483.png', 0), 
('.\\data\\cifar-10\\train_valid_test\\train\\airplane\\14487.png', 0)]

As can be seen from the output:

  • classes is a list of folder names that store images of each type;
  • class_to_idx is a dictionary consisting of key-value pairs consisting of the class name of each image and the number assigned to it;
  • samples is a list of tuples whose number is equal to the total number of images of all classes. The content in the tuples corresponds to the address of each image and its classification number.

With this information, you can pass the first two lines of code in the __getitem__ method:

path, target = self.samples[index]
sample = self.loader(path)

The image and its corresponding classification are obtained. Moreover, it can be seen from the code that when ImageFolder reads the files in each folder, it must be sorted first. This explains the order of the samples when reading the test set.

Features of torchvision.datasets.ImageFolder()

  • Each element is a tuple

>>len(train_ds)

45000

>>type(train_ds[0]) # Each element of train_ds is a tuple

<class 'tuple'>
  • The torchvision.datasets.ImageFolder() method automatically converts the label of character type to int type:

>>train_ds[0][0].shape # The first element of the ancestor is the image vector

torch.Size([3, 32, 32])

>>train_ds[0][1] # The second element of the ancestor is the label of type int

0
  • After torchvision.datasets.ImageFolder() processing, the dataset automatically generates the category attribute:

>>train_ds.classes

['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
  • The order in which ImageFolder() reads samples is sorted in str order

Handwritten ImageFolder()

Below is a handwritten dataset, which plays the same role as ImageFolder()

class AdvDataset(Dataset):
    def __init__(self, data_dir, transform):
        self.images = []
        self.labels = []
        self.names = []
        '''
        data_dir
        ├── class_dir
        │   ├── class1.png
        │   ├── ...
        │   ├── class20.png
        '''
        for i, class_dir in enumerate(sorted(glob.glob(f'{data_dir}/*'))):
            images = sorted(glob.glob(f'{class_dir}/*'))
            self.images += images
            self.labels += ([i] * len(images))  # 第i个读到的类文件夹,类别就是i
            self.names += [os.path.relpath(imgs, data_dir) for imgs in images]  # 返回imgs相对于data_dir的相对路径
        self.transform = transform
    def __getitem__(self, idx):
        image = self.transform(Image.open(self.images[idx]))
        label = self.labels[idx]
        return image, label
    def __getname__(self):
        return self.names
    def __len__(self):
        return len(self.images)
 
adv_set = AdvDataset(root, transform=transform)
adv_names = adv_set.__getname__()
adv_loader = DataLoader(adv_set, batch_size=batch_size, shuffle=False)
 
print(f'number of images = {adv_set.__len__()}')

custom data set

In a sense, it is handwriting ImageFolder()

Dataset processing

Read the label file for the training images:

df = pd.read_csv(os.path.join(data_dir, label_csv))
df.head()
id breed
0 000bec180eb18c7604dcecc8fe0dba07 boston_bull
1 001513dfcb2ffafc82cccf4d8bbaba97 dingo
2 001cdf01b096e06d78e9e5112d419397 Pekingese
3 00214f311d5d2247d5dfe4fe24b2303d bluetick
4 0021f9ceb3235effd7fcde7f7538ed62 golden_retrieve

The label breed is str type and needs to be converted to int type

Convert label to int type

Get the species list breeds, create a "category-serial number" dictionary based on the breeds, and then get the numeric label column label_idx from the breed column.

breeds = df.breed.unique()  # 长度是120,即类别数
breeds.sort()
breed2idx = dict((breed, i) for i, breed in enumerate(breeds))
df['label_idx'] = [breed2idx[b] for b in df.breed]

 Before sorting, breeds:

array(['boston_bull', 'dingo', 'pekinese', 'bluetick', 'golden_retriever',……])

The order of the elements in the list is the order in which they appear in df.breed, such that boston_bull corresponds to number 0, dingo to number 1, and so on. Generally, it is hoped that the types are arranged in the normal order, so you can sort them and get df:

                    id                               breed           label_idx
0      000bec180eb18c7604dcecc8fe0dba07               boston_bull         19
1      001513dfcb2ffafc82cccf4d8bbaba97                     dingo         37
2      001cdf01b096e06d78e9e5112d419397                  pekinese         85
...                                 ...                       ...        ...
10219  ffe2ca6c940cddfee68fa3cc6c63213f                  airedale          3
10220  ffe5f6d8e2bff356e9482a80a6e29aac        miniature_pinscher         75
10221  fff43b07992508bc822f33d8ffd902ae  chesapeake_bay_retriever         28

Custom Dataset Functions 

The sample name and the corresponding label are saved in df. Using df, you can read and process the picture from the training set folder (requires the path img_path), and return the picture and the corresponding label.

class DogDataset(Dataset):
    def __init__(self, df, img_path, transform=None):
        self.df = df
        self.img_path = img_path
        self.transform = transform
        
    def __len__(self):
        return self.df.shape[0]
    
    def __getitem__(self, idx):
        path = os.path.join(self.img_path, self.df.id[idx]) + '.jpg'        
        img = Image.open(path)
        if self.transform:
            img = self.transform(img)
        
        label = self.df.label_idx[idx]
        return img, label        

For the test set, there is no df available, you need to use os.listdir() to get the list of picture names, and then get the picture name according to idx from the list. Note that the list of picture names should be sorted, so that when saving the prediction results, the (sorted) picture names and the prediction results output by the model can be matched 

class DogDatasetTest(Dataset):
    def __init__(self, img_path, transform=None):            
        self.img_path = img_path
        self.img_list = os.listdir(img_path)
        self.img_list.sort()
        self.transform = transform
        
    def __len__(self):
        return len(self.img_list)
        
    def __getitem__(self, idx):
        path = os.path.join(self.img_path, self.img_list[idx])
        img = Image.open(path)
        if self.transform:     
            img = self.transform(img)
            
        return img

read dataset 

train_val_df = df
train_val_dataset = DogDataset(train_val_df, os.path.join(data_dir, 'train'), train_transform) 
test_dataset = DogDatasetTest(os.path.join(data_dir, 'test'), test_transform)

batch_size = 32
train_val_iter = DataLoader(train_val_dataset, batch_size, shuffle=True, drop_last = True)
test_iter = DataLoader(test_dataset, batch_size, shuffle=False, drop_last = False)

Dataset partition function

train_test_split function

usage

train_test_split is a commonly used function in cross-validation. Its function is to randomly select the train data and test data from the sample in proportion.

X_train, X_test, y_train, y_test = train_test_split(
    train_data,
    train_target,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
)

train_data: the sample feature set to be divided

train_target: the sample result to be divided

test_size: sample proportion, if it is an integer, it is the number of samples

random_state: is the seed of the random number. In fact, it is the serial number of this group of random numbers. When the test needs to be repeated, it is guaranteed to get the same set of random numbers. If you want it to be different every time, the method is to not set the parameters random_state,. Although the ratio of each split is the same, the split results are different.

from sklearn.model_selection import train_test_split
train_id, val_id, train_breed, val_breed=  train_test_split(df.id.values, df.breed.values, test_size=0.1)

>>len(val_id)

1023

>>len(train_id)

9199

>>val_id

['890efbec7147c2887c460be0af763381' 'c7441fba1f18864b59b1d474936def91' 
 '63dd3e15f7fe4b3b3e9a69530e8d36b3' ... 'e24af0affe6c7a51b3e8ed9c30b090 b7' 
 '3d78ff549552e90b9a01eefb12548283' 'cc7ae3da3bebcc4acb10128078cdf29a']

Note that the result of this function is the id column in the table, not the index

read dataset

train_df = pd.DataFrame({'id':train_id})
train_df['label_idx'] = [breed2idx[breed] for breed in train_breed]
val_df = pd.DataFrame({'id':val_id})
val_df['label_idx'] = [breed2idx[breed] for breed in val_breed]
train_dataset = DogDataset(train_df, os.path.join(data_dir, 'train'), train_transform) 
val_dataset = DogDataset(val_df, os.path.join(data_dir, 'train'), test_transform) 

existing problems

Calculate the proportion of each category in the validation set and training set

train_id, val_id, train_breed, val_breed=  train_test_split(df.id.values, df.breed.values, test_size=0.1, random_state=42)
count_train = collections.Counter(df['breed'])
df_val = pd.DataFrame({'breed':val_breed})
count_val = collections.Counter(df_val['breed'])
ratio = []
for i in count_train:
        ratio.append(count_train[i]/ count_val[i])

print(min(ratio), max(ratio))
4.944444444444445 47.0

It can be found that the division of the data set does not consider how much each category is in the training set.

Reference: train_test_split() function_Peng Da Da Da Da Blog-CSDN Blog_train_test_split

StratifiedShuffleSplit function

Analyzing the dataset revealed that the most over-represented species outnumbered the least-represented species almost twice. If the processing method of the torchvision.datasets.ImageFolder() example is used, the number of samples of each type drawn from the training set is the same, and the result is that the sample distribution of the validation set and the training set are inconsistent. Sometimes it is hoped that in the training set, more types with a large number of samples will be removed, and fewer types with a small number of samples will be removed. In this case, the StratifiedShuffleSplit function can be used to divide the verification set from the training set, and the result is the subset train_df and val_df of df.

 Both StratifiedShuffleSplit and train_test_split come from the sklearn.model_selection module and are used to divide the data set (dividing the training set into a training set and a validation set). The difference is that one is stratified sampling and the other is random sampling. can refer to

Detailed understanding of the StratifiedShuffleSplit() function_wang_xuecheng's blog-CSDN blog_stratifiedshufflesplit , here is an intuitive result

usage

StratifiedShuffleSplit(
    n_splits=10,
    *,
    test_size=None,
    train_size=None,
    random_state=None,
)

n_splits represents how many training set-validation set pairs the data set is divided into, and test_size represents the proportion of the validation set. The following code divides the data set df once, and the validation set accounts for 10%.

from sklearn.model_selection import StratifiedShuffleSplit
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=0)
splits = stratified_split.split(df.id, df.breed)
train_split_id, val_split_id = next(iter(splits)) 

>>train_split_id.shape

(9199,)

>>val_split_id.shape

(1023,)

>>train_split_id

[9556 2055 5652 ... 7133  366 4846]

read dataset

train_df = df.iloc[train_split_id].reset_index()
val_df = df.iloc[val_split_id].reset_index()

train_dataset = DogDataset(train_df, os.path.join(data_dir, 'train'), train_transform) 
val_dataset = DogDataset(val_df, os.path.join(data_dir, 'train'), test_transform) 

batch_size = 32
train_iter = DataLoader(train_dataset, batch_size, shuffle=False, drop_last = True)
val_iter = DataLoader(val_dataset, batch_size, shuffle=False, drop_last = True)

Why do reset_index() operation

Note that after the data set is divided, the reset_index() operation is performed. For train_df, before reset_index():

                           id                 breed            label_idx
9556  efbabde6fc97bb48c8c8b6b75bfaea59          eskimo_dog         78
2055  332c413119b474653ecca0f358c85e1f     giant_schnauzer         29
5652  8e7256b23446acbd33967122787c1eb3     tibetan_mastiff        116

after reset_index()

    index  id                                        breed           label_idx
0   9556  efbabde6fc97bb48c8c8b6b75bfaea59          eskimo_dog         78
1   2055  332c413119b474653ecca0f358c85e1f     giant_schnauzer         29
2   5652  8e7256b23446acbd33967122787c1eb3     tibetan_mastiff        116

If the index is not reset, DataLoader will report an error. After creating the dataset train_dataset using train_df, run the following command

>>for i in range(100):train_dataset[i]

This command will report an error at the place of train_dataset[12]

View the index of train_df and val_df:

>>train_df.index.sort_values()[0:25]

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 24, 25, 27],
           dtype='int64')

>>val_df.index.sort_values()[0:25]

Int64Index([ 12,  23,  26,  36,  46,  53,  67,  70,  75,  80, 102, 103, 110,
            121, 122, 125, 133, 137, 145, 154, 165, 169, 177, 181, 209],
           dtype='int64')

There is no index of 12 in train_df, and DataLoader made an error when fetching the 12th element from train_dataset. There will be no problem after the reset_index() operation.

Analysis

train_split_id is the index of the divided training set in the original data set. In fact, only df.id is needed, and even the sample length is enough. Why does the StratifiedShuffleSplit function also use the breed column?

Let's take a look at the proportion of each category in the validation set and training set

df_val = df.iloc[val_split_id]
count_val = collections.Counter(df_val['breed'])
count_train = collections.Counter(df['breed'])
ratio = []
for i in count_train:
        ratio.append(count_val[i] / count_train[i])

print(min(ratio), max(ratio))
0.09333333333333334    0.10606060606060606

Basically all around 0.1. This is why StratifiedShuffleSplit uses the breed column.

And, no matter what the value of random_state is, the ratio calculated above remains unchanged. If the value of random_state is set to a fixed value, you can see that the first few items of df_val are unchanged; if the value of random_state changes, the first few items of df_val will also change. It can be seen that this division is a regular division.

random_split()

usage

torch.utils.data.random_split(dataset, lengths, generator=<torch._C.Generator object>)

Randomly split a dataset into new non-overlapping datasets of a given length. The generator can optionally be fixed for reproducible results (the effect is the same as setting a random seed).

  • dataset (Dataset) – The dataset to partition.
  • lengths (sequence) – The length to divide.
  • generator (Generator) – The generator to use for random permutations.
from torch.utils.data import random_split

a = torch.arange(20)
x, y = random_split(a, [8,12])
print(x, y)
print(np.array(x))
<torch.utils.data.dataset.Subset object at 0x000002338056D2C8> <torch.utils.data.dataset.Subset object at 0x000002338056D308>
[ 4 19 16 14 10  3  5  7]

After using torch.utils.data.dataset.random_split, generate a Subset class that also belongs to the Dataset type,

According to pytorch official website torch.utils.data — PyTorch 1.13 documentation , length can be a ratio:

If a list of fractions that sum up to 1 is given, the lengths will be computed automatically as floor(frac * len(dataset)) for each fraction provided.

After computing the lengths, if there are any remainders, 1 count will be distributed in round-robin fashion to the lengths until there are no remainders left.

An example is also given:

random_split(range(30), [0.3, 0.3, 0.4], generator=torch.Generator().manual_seed(42))

but run error

to divide

valid_set_size = int(valid_ratio * len(df)) 
train_set_size = len(df) - valid_set_size
train_set, valid_set = random_split(df.values, [train_set_size, valid_set_size], generator=torch.Generator().manual_seed(99))
train_array = np.array(train_set)
valid_array = np.array(valid_set)
print(train_array.shape, valid_array.shape)
(9200, 3) (1022, 3)

Take a look at valid_array:

array([['29743dcc4d439615133f2024b50aab15', 'lhasa', 70],
       ['516e9ca19a0fd6c7eb5aa8566b249cb8', 'bloodhound', 14],
       ...,
       ['481f8e13336be2292ba30c45d14daf55', 'saluki', 93],
       ['e4f17a9e5ee1ed5385744cd6e8916a4e', 'bernese_mountain_dog', 11]],
      dtype=object)

read dataset

The data returned by random_split is in numpy.ndarray format after processing. If you want to use it on a custom data set, you need to use pandas to process it

train_df = pd.DataFrame({'id':train_array[:, 0],'breed':train_array[:, 1], 'label_idx':train_array[:, 2]})
valid_df = pd.DataFrame({'id':valid_array[:, 0],'breed':valid_array[:, 1], 'label_idx':valid_array[:, 2]})

train_dataset = DogDataset(train_df, os.path.join(data_dir, 'train'), train_transform) 
val_dataset = DogDataset(val_df, os.path.join(data_dir, 'train'), test_transform) 

batch_size = 32
train_iter = DataLoader(train_dataset, batch_size, shuffle=False, drop_last = True)
val_iter = DataLoader(val_dataset, batch_size, shuffle=False, drop_last = True)

Analysis

Take a look at the division results and find that it is somewhat similar to the train_test_split function. The division results are affected by the seed. Different seeds lead to different division results, and the extraction ratios of different categories are not consistent.

count_train = collections.Counter(df['breed'])
count_valid = collections.Counter(valid_df['breed'])
ratio = []
for i in count_train:
        ratio.append(count_val[i] / count_train[i])

print(min(ratio), max(ratio))
0.043478260869565216 0.17582417582417584

Other datasets

TensorDataset

torch.utils.datain builds a dataset TensorDatasetfrom a sequence of tensors. The shape of these tensors can vary, but the first dimension must have the same size , this is to ensure that a batch of data can be returned normally when using DataLoader.

The parameters in TensorDataset must be tensor 

source code

class TensorDataset(Dataset[Tuple[Tensor, ...]]):
    r"""Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Args:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """
    tensors: Tuple[Tensor, ...]

    def __init__(self, *tensors: Tensor) -> None:
        assert all(tensors[0].size(0) == tensor.size(0)
                   for tensor in tensors), "Size mismatch between tensors"
        self.tensors = tensors

    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in self.tensors)

    def __len__(self):
        return self.tensors[0].size(0)
  • *tensorstells us that a series of tensors are passed in TensorDatasetwhen  , namely:
dataset = TensorDataset(tensor_1, tensor_2, ..., tensor_n)
  • assertIt is used to ensure that among the tensors passed in, the size of each tensor in the first dimension is equal to the size of the first tensor in the first dimension, that is, all tensors are required to have the same size in the first dimension .
  • The result returned by the __getitem__ method is equivalent to

        return tensor_1[index], tensor_2[index], ..., tensor_n[index]

        As can be seen from this line of code, if the n tensors are not all the same size in the first dimension, one of the tensors will necessarily have an IndexError. Ensuring that the size of the first dimension is the same is also to ensure that it can be loaded in a batch when passed into DataLoader later.

  • __len__ Since all tensors have the same size in the first dimension, it is sufficient to directly return the size of the first tensor passed in in the first dimension.
     

 Create a dataset

from torch.utils import data

features = torch.tensor([
[ 0, 1, 2],
[ 1, 2, 3],
[ 2, 3, 4],
[ 3, 4, 5],
[ 4, 5, 6],
[ 5, 6, 7]], dtype=torch.int32)
label = torch.arange(6)
train_dataset = data.TensorDataset(features, label)
train_dataset[0]
(tensor([0, 1, 2], dtype=torch.int32), tensor(0))

use dataloader

train_iter = data.DataLoader(train_dataset, 3, shuffle=False)
for data, label in train_iter:
    print(data, label)
tensor([[0, 1, 2],
        [1, 2, 3],
        [2, 3, 4]], dtype=torch.int32) tensor([0, 1, 2])
tensor([[3, 4, 5],
        [4, 5, 6],
        [5, 6, 7]], dtype=torch.int32) tensor([3, 4, 5])

One problem: wrapping a single tensor

If the test set data is packaged, then the package object has only one tensor, such as the following features

train_dataset = TensorDataset(features)
print(train_dataset[0])
train_iter = DataLoader(train_dataset, 3, shuffle=False)
for data in train_iter:
    print(data)

The first element of the data set is as follows, it can be found that it is still a tuple, but the tuple also has only one element

(tensor([0, 1, 2], dtype=torch.int32),)

The data iterated by the dataloader is as follows. It can be found that the tensor is wrapped by the list. This is a problem. For example, in the prediction function, it will not work to write data = data.to(device) after taking out the data, because data is a list. I didn't expect a very good way, then write it as data = data[0].to(device)

[tensor([[0, 1, 2],
        [1, 2, 3],
        [2, 3, 4]], dtype=torch.int32)]
[tensor([[3, 4, 5],
        [4, 5, 6],
        [5, 6, 7]], dtype=torch.int32)]

Reference: An article to understand TensorDataset_Lareges' blog in PyTorch-CSDN Blog_tensordataset

TensorDataset_anshiquanshu's Blog - CSDN Blog_tensordataset

Guess you like

Origin blog.csdn.net/iwill323/article/details/127684191