深度学习之数据处理——如何将图片和标签打乱并划分为训练集和测试集

记录我的第一篇CSDN博客

最近我在网上找到Office31数据集，这个数据集中包含了三个子数据集，分别为：Amazon、dslr、webcam，每个子数据集包含31个类。以下将从数据处理的目标、处理过程、处理结果三个方面来展开。

一、数据处理目标

根据所选择的子数据集(Amazon、dslr、webcam)导入数据集，并对图片进行one-hot编码，最后将数据集分为训练数据集和测试数据集。
数据格式：
train_image=[train_n, w, h, c]
train_label=[train_n, categories_n]
validate_image=[validate_n, w, h, c]
validate_label=[validate_n, categories_n]

二、处理过程

处理过程分为：导入数据集、one-hot编码、划分为训练集和测试集

导入数据集

先贴代码吧

import os
import glob
import cv2

img_label = []
img_feature = []

def get_datasets(args):
    i_categories = -1
    i_picture = -1
    img_path = os.path.join(args.data, args.datasets)
    # print(img_path)
    for name in glob.glob(img_path+'/images/*'):
        i_categories += 1
        # print(name)
        for jpg in glob.glob(name+'/*'):
            i_picture += 1
            # print(jpg)
            img = cv2.imread(jpg)
            img = cv2.resize(img, (300, 300))
            img_feature.append(img)
            img_label.append(i_categories)

    print("Total number of categories:" + str(i_categories + 1))
    print("Total number of pictures:" + str(i_picture + 1))
    
	# one-hot编码
	onehot_label = one_hot(img_label)
	# 将数据集划分为训练集和测试集
    train_set_label, validate_set_label, train_set_feature, validate_set_feature = _split_train_val(onehot_label,
                                                                                                    img_feature, 0.1)

首先是图片路径的导入，我将图片路径存储在一个总的参数申明的argparse中，我的图片的保存路径如下图所示，
在这里插入图片描述
其中images下级包含31个子文件夹，每个文件夹包含一个类的图片

最初的路径为Original——images/amazon/images
之后两个for循环，第一个获取images下所有子文件夹的名称，第二个获取每个子文件夹中图片的路径
opencv获取图片数据，这里有一个十分重要的操作，就是cv2.resize，对于网上下载下来的数据集，其实我们都不会去看图片的大小是否一致，就是因为这一点，浪费了我很多时间，程序一直警告，后来通过resize将所有的图片的大小统一，这个图片大小也可以作为一个全局参数，方便修改程序
最后通过numpy.append连接每一个数据，得到label和feature两个数组

one-hot编码

这个还是十分简单的，一样，先贴代码

from sklearn.preprocessing import LabelBinarizer

def one_hot(label):
    encoder = LabelBinarizer()
    onehot_label = encoder.fit_transform(label)
    return onehot_label

就是运用sklearn中的LabelBinarizer实现的，label的数据形式[picture_n, 1]，最终返回的one-hot label数据形式为[picture_n, categories_n]

划分为训练集和测试集

这个对于pytorch、tensorflow中自带的数据集来说，十分的简单，就是调用一个API就可以搞定，但是对于自己的数据集，就需要更多的精力来弄。先贴代码

def _split_train_val(label, feature, val_fraction=0):
    n_train = int((1. - val_fraction) * len(label))
    n_val = len(label) - n_train
    # print(n_train, n_val)
    indices = np.arange(len(label))
    np.random.shuffle(indices)
    # print(np.shape(indices))
    # print(indices)
    train_label = []
    train_feature = []
    val_label = []
    val_feature = []
    for i in range(len(label)):
        if i < n_train:
            train_label.append(label[indices[i]])
            train_feature.append(feature[indices[i]])
        else:
            val_label.append(label[indices[i]])
            val_feature.append(feature[indices[i]])
    print("==> Split the dataset into train_set and validate_set")
    print("train_set: " + str(n_train), ",validate_set: " + str(n_val))
    return train_label, val_label, train_feature, val_feature

为了能够随机的选择训练集和测试集，首先通过numpy生成一个indices，从0-(picture_n-1),再通过shuffle函数打乱这个indices，这样就可以根据这个indices划分数据集了，indices的数值即为label和feature索引。

处理结果

最终的结果如下图所示
在这里插入图片描述

|
|
|
|
|
参考博客
python glob.glob使用
 numpy.random.shuffle打乱顺序函数的实现