Data processing of deep learning-how to shuffle pictures and labels and divide them into training and test sets

Record my first CSDN blog

Recently I found the Office31 dataset on the Internet. This dataset contains three sub-data sets, namely: Amazon, dslr, and webcam. Each sub-data set contains 31 classes. The following will start from three aspects of data processing goals , processing procedures , and processing results .

1. Data processing goals

Import the data set according to the selected sub-data set (Amazon, dslr, webcam), and perform one-hot encoding on the picture, and finally divide the data set into training data set and test data set.
Data format:
train_image=[train_n, w, h, c]
train_label=[train_n, categories_n]
validate_image=[validate_n, w, h, c]
validate_label=[validate_n, categories_n]

2. Processing

The processing process is divided into: import data set, one-hot encoding, divide into training set and test set

Import data set

Post the code first

import os
import glob
import cv2

img_label = []
img_feature = []

def get_datasets(args):
    i_categories = -1
    i_picture = -1
    img_path = os.path.join(args.data, args.datasets)
    # print(img_path)
    for name in glob.glob(img_path+'/images/*'):
        i_categories += 1
        # print(name)
        for jpg in glob.glob(name+'/*'):
            i_picture += 1
            # print(jpg)
            img = cv2.imread(jpg)
            img = cv2.resize(img, (300, 300))
            img_feature.append(img)
            img_label.append(i_categories)

    print("Total number of categories:" + str(i_categories + 1))
    print("Total number of pictures:" + str(i_picture + 1))
    
	# one-hot编码
	onehot_label = one_hot(img_label)
	# 将数据集划分为训练集和测试集
    train_set_label, validate_set_label, train_set_feature, validate_set_feature = _split_train_val(onehot_label,
                                                                                                    img_feature, 0.1)

The first is the import of the image path. I store the image path in a general parameter declaration argparse. The saving path of my image is as shown in the figure below.
Insert picture description here
The images lower level contains 31 subfolders, and each folder contains a class of image

The initial path is Original-images/amazon/images
After two for loops, the first one gets the names of all subfolders under images, the second one gets the path of the pictures in each subfolder
Opencv obtains image data. There is a very important operation here, which is cv2.resize. For the data set downloaded from the Internet, in fact, we will not check whether the size of the image is the same. Because of this, I wasted a lot of time. The program keeps warning, and later resize the size of all pictures to be unified, this picture size can also be used as a global parameter to facilitate modification of the program
Finally, connect each data through numpy.append to get two arrays of label and feature

one-hot encoding

This is still very simple, the same, first paste the code

from sklearn.preprocessing import LabelBinarizer

def one_hot(label):
    encoder = LabelBinarizer()
    onehot_label = encoder.fit_transform(label)
    return onehot_label

It is implemented using LabelBinarizer in sklearn. The data format of label is [picture_n, 1], and the final returned one-hot label data format is [picture_n, categories_n]

Divided into training set and test set

This is very simple for the data sets that come with pytorch and tensorflow. It can be done by calling an API, but for your own data sets, you need more energy to get it. Post code first

def _split_train_val(label, feature, val_fraction=0):
    n_train = int((1. - val_fraction) * len(label))
    n_val = len(label) - n_train
    # print(n_train, n_val)
    indices = np.arange(len(label))
    np.random.shuffle(indices)
    # print(np.shape(indices))
    # print(indices)
    train_label = []
    train_feature = []
    val_label = []
    val_feature = []
    for i in range(len(label)):
        if i < n_train:
            train_label.append(label[indices[i]])
            train_feature.append(feature[indices[i]])
        else:
            val_label.append(label[indices[i]])
            val_feature.append(feature[indices[i]])
    print("==> Split the dataset into train_set and validate_set")
    print("train_set: " + str(n_train), ",validate_set: " + str(n_val))
    return train_label, val_label, train_feature, val_feature

In order to be able to randomly select the training set and the test set, first generate an index through numpy, from 0-(picture_n-1), and then shuffle the index through the shuffle function, so that the data set can be divided according to the index, and the value of the index It is the label and feature index.

process result

The final result is shown below
Insert picture description here

|
|
|
|
|
Reference blog
python glob.glob uses
numpy.random.shuffle to shuffle the order function implementation