内容来自：https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

在机器学习过程中，通常大部分的精力都花在了数据准备上。而pytorch提供了多种工具，使数据读取更为便捷，且提高了代码的可读性。这篇入门教程主要介绍如何从特定的数据集中，载入并预处理相关数据。
需要额外安装以下两个包：

scikit-image : 图像的输入、输出及转换
List item : 便捷地处理csv数据

from __future__ import print_function, division
import os
import torch
import pandas as pd
from skimage import io, transform
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

plt.ion()   # interactive mode

例子中我们要处理的数据集是面部姿态。如下图所示，我们在面部图像中标注68个特征点。
在这里插入图片描述我们使用官网数据集，并且放在目录“data/faces”下。数据集中附有一个csv文件，打开后如下图所示：

其中数据格式可以表示如下：

image_name,part_0_x,part_0_y,part_1_x,part_1_y,part_2_x, ... ,part_67_x,part_67_y
0805personali01.jpg,27,83,27,98, ... 84,134
1084239450_e76e00b7e7.jpg,70,236,71,257, ... ,128,312

即对每张图片的标注结果是一个（N，2）维的数据，其中 N为特征点的个数（68个），2分别为x、y轴坐标。

landmarks_frame = pd.read_csv('data/faces/face_landmarks.csv')

n = 65
img_name = landmarks_frame.iloc[n, 0]
landmarks = landmarks_frame.iloc[n, 1:].as_matrix()
landmarks = landmarks.astype('float').reshape(-1, 2)

print('Image name: {}'.format(img_name))
print('Landmarks shape: {}'.format(landmarks.shape))
print('First 4 Landmarks: {}'.format(landmarks[:4]))

此时我们可以输出：

Image name: person-7.jpg
Landmarks shape: (68, 2)
First 4 Landmarks: [[32. 65.]
 [33. 76.]
 [34. 86.]
 [34. 97.]]

我们可以定义一个简单实用的函数，来表示图像和与之对应的特征点，并用该函数来展示一个示例。

def show_landmarks(image, landmarks):
    """Show image with landmarks""" 
    plt.imshow(image)
    plt.scatter(landmarks[:, 0], landmarks[:, 1], s=10, marker='.', c='r')
    plt.pause(0.001)  # pause a bit so that plots are updated

plt.figure()
show_landmarks(io.imread(os.path.join('data/faces/', img_name)),
               landmarks)
plt.show()

该函数展示结果如下所示。
在这里插入图片描述

数据集类 Dataset Class

torch.utils.data.Dataset是一个表示数据集的抽象类。我们自定义的数据集可以继承该类，并重写以下方法：

__len__ 函数len(dataset)将返回数据集的大小
__getitem__ 支持索引，函数dataset[i]可以得到第i个样本
下面的示例中，就用以上方法，创建一个面部特征数据集的类。我们在__init__中读取csv数据，但在__getitem__中读取图片，这是为了提高内存使用效率，因为图片通常会占据较大空间，并不会存储在内存中，在需要的时候读取即可。
我们的数据集格式为字典{‘image’ : image, ‘landmarks’ : landmarks}。数据集会有一个可选参数transform，通过设置该参数，可以对样本进行相应的处理。下一节我们将展示transform的用途。

class FaceLandmarksDataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, csv_file, root_dir, transform=None):
        """
        Args:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.landmarks_frame = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.landmarks_frame)

    def __getitem__(self, idx):
        img_name = os.path.join(self.root_dir,
                                self.landmarks_frame.iloc[idx, 0])
        image = io.imread(img_name)
        landmarks = self.landmarks_frame.iloc[idx, 1:].as_matrix()
        landmarks = landmarks.astype('float').reshape(-1, 2)
        sample = {'image': image, 'landmarks': landmarks}

        if self.transform:
            sample = self.transform(sample)

        return sample

下面我们将实例化该类，并迭代数据样本。我们将显示出前4张图片及其特征标记。

face_dataset = FaceLandmarksDataset(csv_file='data/faces/face_landmarks.csv',
                                    root_dir='data/faces/')

fig = plt.figure()

for i in range(len(face_dataset)):
    sample = face_dataset[i]

    print(i, sample['image'].shape, sample['landmarks'].shape)

    ax = plt.subplot(1, 4, i + 1)
    plt.tight_layout()
    ax.set_title('Sample #{}'.format(i))
    ax.axis('off')
    show_landmarks(**sample)

    if i == 3:
        plt.show()
        break

在这里插入图片描述
输出结果：

0 (324, 215, 3) (68, 2)
1 (500, 333, 3) (68, 2)
2 (250, 258, 3) (68, 2)
3 (434, 290, 3) (68, 2)

转换参数 transforms

以上过程中我们可以发现一个问题，图像样本的大小并不一致。大多数神经网络处理期望图像是固定尺寸的。因此，我们需要写一段图像预处理代码，因此建立以下三个转换参数：

Rescale 重置图像尺寸。
RandomCrop 随机修剪图像，实现数据增强。
Torensor 将numpy图像形式转换为torch图像形式（需转换坐标轴）
我们三者写为可调用类，而不是简单的函数，这样transform的参数不必在每次调用时都传入。我们只需要完善__call__函数（如果需要）和__init__函数。我们可以像下面一样使用transform：

tsfm = Transform(params)
transformed_sample = tsfm(sample)

下面展示了transforms如果在图像和特征标记上应用：

class Rescale(object):
    """Rescale the image in a sample to a given size.

    Args:
        output_size (tuple or int): Desired output size. If tuple, output is
            matched to output_size. If int, smaller of image edges is matched
            to output_size keeping aspect ratio the same.
    """

    def __init__(self, output_size):
        assert isinstance(output_size, (int, tuple))
        self.output_size = output_size

    def __call__(self, sample):
        image, landmarks = sample['image'], sample['landmarks']

        h, w = image.shape[:2]
        if isinstance(self.output_size, int):
            if h > w:
                new_h, new_w = self.output_size * h / w, self.output_size
            else:
                new_h, new_w = self.output_size, self.output_size * w / h
        else:
            new_h, new_w = self.output_size

        new_h, new_w = int(new_h), int(new_w)

        img = transform.resize(image, (new_h, new_w))

        # h and w are swapped for landmarks because for images,
        # x and y axes are axis 1 and 0 respectively
        landmarks = landmarks * [new_w / w, new_h / h]

        return {'image': img, 'landmarks': landmarks}


class RandomCrop(object):
    """Crop randomly the image in a sample.

    Args:
        output_size (tuple or int): Desired output size. If int, square crop
            is made.
    """

    def __init__(self, output_size):
        assert isinstance(output_size, (int, tuple))
        if isinstance(output_size, int):
            self.output_size = (output_size, output_size)
        else:
            assert len(output_size) == 2
            self.output_size = output_size

    def __call__(self, sample):
        image, landmarks = sample['image'], sample['landmarks']

        h, w = image.shape[:2]
        new_h, new_w = self.output_size

        top = np.random.randint(0, h - new_h)
        left = np.random.randint(0, w - new_w)

        image = image[top: top + new_h,
                      left: left + new_w]

        landmarks = landmarks - [left, top]

        return {'image': image, 'landmarks': landmarks}


class ToTensor(object):
    """Convert ndarrays in sample to Tensors."""

    def __call__(self, sample):
        image, landmarks = sample['image'], sample['landmarks']

        # swap color axis because
        # numpy image: H x W x C
        # torch image: C X H X W
        image = image.transpose((2, 0, 1))
        return {'image': torch.from_numpy(image),
                'landmarks': torch.from_numpy(landmarks)}

构建转换参数 compose transforms

现在我们将转换参数用于样本。
假设我们想将图片短边尺寸变换至256，并随机剪裁出一个224边长的方形。我们需要构建Rescale函数和RandomCrop函数。torchvision.transforms.Compose则是可以帮助我们实现该功能的一个简单可调用类。

scale = Rescale(256)
crop = RandomCrop(128)
composed = transforms.Compose([Rescale(256),
                               RandomCrop(224)])

# Apply each of the above transforms on sample.
fig = plt.figure()
sample = face_dataset[65]
for i, tsfrm in enumerate([scale, crop, composed]):
    transformed_sample = tsfrm(sample)

    ax = plt.subplot(1, 3, i + 1)
    plt.tight_layout()
    ax.set_title(type(tsfrm).__name__)
    show_landmarks(**transformed_sample)

plt.show()

在这里插入图片描述

数据集迭代 Iterating through the dataset

我们可利用转换参数，根据以上信息来创建自己的数据集。每次数据集的采样过程如下：

从文件中读取单张图片
转换参数应用在被读取图片上
对应任意转换参数，数据在采样中被修正
我们利用for i in range循环迭代来创建数据集。

transformed_dataset = FaceLandmarksDataset(csv_file='data/faces/face_landmarks.csv',
                                           root_dir='data/faces/',
                                           transform=transforms.Compose([
                                               Rescale(256),
                                               RandomCrop(224),
                                               ToTensor()
                                           ]))

for i in range(len(transformed_dataset)):
    sample = transformed_dataset[i]

    print(i, sample['image'].size(), sample['landmarks'].size())

    if i == 3:
        break

输出结果：

0 torch.Size([3, 224, 224]) torch.Size([68, 2])
1 torch.Size([3, 224, 224]) torch.Size([68, 2])
2 torch.Size([3, 224, 224]) torch.Size([68, 2])
3 torch.Size([3, 224, 224]) torch.Size([68, 2])

然而，使用简单的for循环迭代采样时，我们错失了很多数据处理特点。特别的，我们措施的机会有：

批量处理数据
数据清洗
使用multiprocessing并行载入数据

torch.utils.data.DataLoader是一个迭代器，可提供这些数据处理方法，使用时以下参数应明确。迭代器的一个参数是collate_fn，来定义采样的具体批量。默认参数应当能够保证大多数情况下的工作正常。

dataloader = DataLoader(transformed_dataset, batch_size=4,
                        shuffle=True, num_workers=4)


# Helper function to show a batch
def show_landmarks_batch(sample_batched):
    """Show image with landmarks for a batch of samples."""
    images_batch, landmarks_batch = \
            sample_batched['image'], sample_batched['landmarks']
    batch_size = len(images_batch)
    im_size = images_batch.size(2)
    grid_border_size = 2

    grid = utils.make_grid(images_batch)
    plt.imshow(grid.numpy().transpose((1, 2, 0)))

    for i in range(batch_size):
        plt.scatter(landmarks_batch[i, :, 0].numpy() + i * im_size + (i + 1) * grid_border_size,
                    landmarks_batch[i, :, 1].numpy() + grid_border_size,
                    s=10, marker='.', c='r')

        plt.title('Batch from dataloader')

for i_batch, sample_batched in enumerate(dataloader):
    print(i_batch, sample_batched['image'].size(),
          sample_batched['landmarks'].size())

    # observe 4th batch and stop.
    if i_batch == 3:
        plt.figure()
        show_landmarks_batch(sample_batched)
        plt.axis('off')
        plt.ioff()
        plt.show()
        break

在这里插入图片描述
输出结果：

0 torch.Size([4, 3, 224, 224]) torch.Size([4, 68, 2])
1 torch.Size([4, 3, 224, 224]) torch.Size([4, 68, 2])
2 torch.Size([4, 3, 224, 224]) torch.Size([4, 68, 2])
3 torch.Size([4, 3, 224, 224]) torch.Size([4, 68, 2])

torchvision

本节中我们主要了解如何构建和使用数据集、转换参数、数据载入器。torchvision包提供了常用的数据集和转换参数。我们甚至不需要额外自定义类。torchvision中更通用的类还有ImageFolder，假设图片按照图下规则排列：

root/ants/xxx.png
root/ants/xxy.jpeg
root/ants/xxz.png
.
.
.
root/bees/123.jpg
root/bees/nsdf3.png
root/bees/asd932_.png

这个’ants’、'bees’等均为类别标签。同样的，我们可以在PIL.Image上使用转换参数RandomHorizontalFlip、Scale。我们可以利用这些来构建数据载入器（数据集下载）：

import torch
from torchvision import transforms, datasets

data_transform = transforms.Compose([
        transforms.RandomSizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])
hymenoptera_dataset = datasets.ImageFolder(root='hymenoptera_data/train',
                                           transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(hymenoptera_dataset,
                                             batch_size=4, shuffle=True,
                                             num_workers=4)

pytorch入门（一）——数据载入和处理

数据集类 Dataset Class

转换参数 transforms

构建转换参数 compose transforms

数据集迭代 Iterating through the dataset

torchvision

猜你喜欢