Hands-on learning deep learning pytorch version study notes-Kaggle image classification 2 (ImageNet Dogs)

Dog breed recognition on Kaggle (ImageNet Dogs)

In this section, we will solve the dog breed identification challenge in the Kaggle competition. The website of the competition is https://www.kaggle.com/c/dog-breed-identification. In this competition, we try to identify 120 different Dog. The dataset used in the competition is actually a subset of the famous ImageNet dataset.

# 在本节notebook中,使用后续设置的参数在完整训练集上训练模型,大致需要40-50分钟
# 合理安排GPU时长,尽量只在训练时切换到GPU资源
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
import os
import shutil
import time
import pandas as pd
import random
# 设置随机数种子
random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)

Organize the data set

We can download the dataset from the competition website. The directory structure is:

| Dog Breed Identification
    | train
    |   | 000bec180eb18c7604dcecc8fe0dba07.jpg
    |   | 00a338a92e4e7bf543340dc849230e75.jpg
    |   | ...
    | test
    |   | 00a3edd22dc7859c487a64777fc8d093.jpg
    |   | 00a6892e5c7f92c1f465e213fd904582.jpg
    |   | ...
    | labels.csv
    | sample_submission.csv

The train and test directories contain images of the training set and test set respectively. The training set contains 10,222 images, and the test set contains 10,357 images. The image format is JPEG, and the file name of each image is a unique id. labels.csv contains the labels of the images in the training set. The file contains 10,222 rows. Each row contains two columns. The first column is the image id and the second column is the dog category. There are 120 types of dogs.

We hope to organize the data to facilitate subsequent reading. Our main goals are:

  • The validation data set is divided from the training set to adjust the hyperparameters. After division, the data set should contain 4 parts: divided training set, divided validation set, complete training set, complete test set
  • For the 4 parts, create 4 folders: train, valid, train_valid, test. In the above folders, a folder is created for each category, and images belonging to that category are stored in it. The labels of the first three parts are known, so there are 120 subfolders each, and the label of the test set is unknown, so only one subfolder named unknown is created to store all the test data.

We hope that the organized data set directory structure is:

| train_valid_test
    | train
    |   | affenpinscher
    |   |   | 00ca18751837cd6a22813f8e221f7819.jpg
    |   |   | ...
    |   | afghan_hound
    |   |   | 0a4f1e17d720cdff35814651402b7cf4.jpg
    |   |   | ...
    |   | ...
    | valid
    |   | affenpinscher
    |   |   | 56af8255b46eb1fa5722f37729525405.jpg
    |   |   | ...
    |   | afghan_hound
    |   |   | 0df400016a7e7ab4abff824bf2743f02.jpg
    |   |   | ...
    |   | ...
    | train_valid
    |   | affenpinscher
    |   |   | 00ca18751837cd6a22813f8e221f7819.jpg
    |   |   | ...
    |   | afghan_hound
    |   |   | 0a4f1e17d720cdff35814651402b7cf4.jpg
    |   |   | ...
    |   | ...
    | test
    |   | unknown
    |   |   | 00a3edd22dc7859c487a64777fc8d093.jpg
    |   |   | ...
data_dir = '/home/kesci/input/Kaggle_Dog6357/dog-breed-identification'  # 数据集目录
label_file, train_dir, test_dir = 'labels.csv', 'train', 'test'  # data_dir中的文件夹、文件
new_data_dir = './train_valid_test'  # 整理之后的数据存放的目录
valid_ratio = 0.1  # 验证集所占比例

The code below is actually to copy the files in the created Dog Breed Identification folder (including train, test, labels.csv, sample_submission.csv) to the newly created train_valid_test folder (including train, valid , Train_valid, test subfolders).

def mkdir_if_not_exist(path):
    # 若目录path不存在,则创建目录
    if not os.path.exists(os.path.join(*path)):
        os.makedirs(os.path.join(*path))
        
def reorg_dog_data(data_dir, label_file, train_dir, test_dir, new_data_dir, valid_ratio):
    # 读取训练数据标签
    labels = pd.read_csv(os.path.join(data_dir, label_file))
    id2label = {
    
    Id: label for Id, label in labels.values}  # (key: value): (id: label)

    # 随机打乱训练数据
    train_files = os.listdir(os.path.join(data_dir, train_dir)) # os.listdir() 方法用于返回指定的文件夹包含的文件或文件夹的名字的列表。 由train文件夹里所有图片组成的列表。
    random.shuffle(train_files)   #将train文件里的所有.jpg图片打乱顺序

    # 原训练集
    valid_ds_size = int(len(train_files) * valid_ratio)  # 验证集大小
    for i, file in enumerate(train_files):
        img_id = file.split('.')[0]  # file是形式为id.jpg的字符串
        img_label = id2label[img_id]
        if i < valid_ds_size:
            mkdir_if_not_exist([new_data_dir, 'valid', img_label]) #[]里面是path,新建了valid文件夹
            shutil.copy(os.path.join(data_dir, train_dir, file),
                        os.path.join(new_data_dir, 'valid', img_label)) #复制train文件夹里的某一file到新建的train_valid_test文件夹的valid子文件夹里的子类别文件夹img_label中
        else:
            mkdir_if_not_exist([new_data_dir, 'train', img_label])
            shutil.copy(os.path.join(data_dir, train_dir, file),
                        os.path.join(new_data_dir, 'train', img_label))
        mkdir_if_not_exist([new_data_dir, 'train_valid', img_label])
        shutil.copy(os.path.join(data_dir, train_dir, file),
                    os.path.join(new_data_dir, 'train_valid', img_label))

    # 测试集
    mkdir_if_not_exist([new_data_dir, 'test', 'unknown'])
    for test_file in os.listdir(os.path.join(data_dir, test_dir)): #遍历Dog Breed Identification文件夹中的test子文件夹里.jpg的图片文件
        shutil.copy(os.path.join(data_dir, test_dir, test_file),
                    os.path.join(new_data_dir, 'test', 'unknown')) #复制到新文件夹test/unknown中
reorg_dog_data(data_dir, label_file, train_dir, test_dir, new_data_dir, valid_ratio)

Image enhancement

transform_train = transforms.Compose([
    # 随机对图像裁剪出面积为原图像面积0.08~1倍、且高和宽之比在3/4~4/3的图像,再放缩为高和宽均为224像素的新图像
    transforms.RandomResizedCrop(224, scale=(0.08, 1.0),  
                                 ratio=(3.0/4.0, 4.0/3.0)),
    # 以0.5的概率随机水平翻转
    transforms.RandomHorizontalFlip(),
    # 随机更改亮度、对比度和饱和度
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
    transforms.ToTensor(),
    # 对各个通道做标准化,(0.485, 0.456, 0.406)和(0.229, 0.224, 0.225)是在ImageNet上计算得的各通道均值与方差
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # ImageNet上的均值和方差
])

# 在测试集上的图像增强只做确定性的操作
transform_test = transforms.Compose([
    transforms.Resize(256),
    # 将图像中央的高和宽均为224的正方形区域裁剪出来
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

Read data

torchvision.datasets.ImageFolder explained

# new_data_dir目录下有train, valid, train_valid, test四个目录
# 这四个目录中,每个子目录表示一种类别,目录中是属于该类别的所有图像
train_ds = torchvision.datasets.ImageFolder(root=os.path.join(new_data_dir, 'train'),
                                            transform=transform_train)
valid_ds = torchvision.datasets.ImageFolder(root=os.path.join(new_data_dir, 'valid'),
                                            transform=transform_test)
train_valid_ds = torchvision.datasets.ImageFolder(root=os.path.join(new_data_dir, 'train_valid'),
                                            transform=transform_train)
test_ds = torchvision.datasets.ImageFolder(root=os.path.join(new_data_dir, 'test'),
                                            transform=transform_test)
batch_size = 128
train_iter = torch.utils.data.DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size=batch_size, shuffle=True)
train_valid_iter = torch.utils.data.DataLoader(train_valid_ds, batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(test_ds, batch_size=batch_size, shuffle=False)  # shuffle=False

Define the model

The data for this competition belongs to a subset of the ImageNet data set. We use a fine-tuning method to select a model pre-trained on the ImageNet complete data set to extract image features as the input of a custom small-scale output network.

Here we use the trained ResNet-34 model to directly reuse the input of the pre-trained model in the output layer, that is, the extracted features, and then we redefine the output layer. This time we only train the parameters of the redefined output layer , And for the part used to extract features, we retain the parameters of the pre-training model.

Derivative issues:

def get_net(device):
    finetune_net = models.resnet34(pretrained=False)  # 预训练的resnet34网络
    finetune_net.load_state_dict(torch.load('/home/kesci/input/resnet347742/resnet34-333f7ec4.pth'))
    for param in finetune_net.parameters():  # 冻结参数
        param.requires_grad = False
    # 原finetune_net.fc是一个输入单元数为512,输出单元数为1000的全连接层
    # 替换掉原finetune_net.fc,新finetuen_net.fc中的模型参数会记录梯度
    finetune_net.fc = nn.Sequential(
        nn.Linear(in_features=512, out_features=256),
        nn.ReLU(),
        nn.Linear(in_features=256, out_features=120)  # 120是输出类别数
    )
    return finetune_net

Define the training function

def evaluate_loss_acc(data_iter, net, device):
    # 计算data_iter上的平均损失与准确率
    loss = nn.CrossEntropyLoss()
    is_training = net.training  # Bool net是否处于train模式
    net.eval()
    l_sum, acc_sum, n = 0, 0, 0
    with torch.no_grad():
        for X, y in data_iter:
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l_sum += l.item() * y.shape[0]
            acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
    net.train(is_training)  # 恢复net的train/eval状态
    return l_sum / n, acc_sum / n
def train(net, train_iter, valid_iter, num_epochs, lr, wd, device, lr_period,
          lr_decay):
    loss = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.fc.parameters(), lr=lr, momentum=0.9, weight_decay=wd)
    net = net.to(device)
    for epoch in range(num_epochs):
        train_l_sum, n, start = 0.0, 0, time.time()
        if epoch > 0 and epoch % lr_period == 0:  # 每lr_period个epoch,学习率衰减一次
            lr = lr * lr_decay
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
        for X, y in train_iter:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            train_l_sum += l.item() * y.shape[0]
            n += y.shape[0]
        time_s = "time %.2f sec" % (time.time() - start)
        if valid_iter is not None:
            valid_loss, valid_acc = evaluate_loss_acc(valid_iter, net, device)
            epoch_s = ("epoch %d, train loss %f, valid loss %f, valid acc %f, "
                       % (epoch + 1, train_l_sum / n, valid_loss, valid_acc))
        else:
            epoch_s = ("epoch %d, train loss %f, "
                       % (epoch + 1, train_l_sum / n))
        print(epoch_s + time_s + ', lr ' + str(lr))

Tuning

num_epochs, lr_period, lr_decay = 20, 10, 0.1
lr, wd = 0.03, 1e-4
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
net = get_net(device)
train(net, train_iter, valid_iter, num_epochs, lr, wd, device, lr_period, lr_decay)

Train the model on the complete data set

# 使用上面的参数设置,在完整数据集上训练模型大致需要40-50分钟的时间
net = get_net(device)
train(net, train_valid_iter, None, num_epochs, lr, wd, device, lr_period, lr_decay)

Categorize the test set and submit the results

Use the trained model to predict the test data. The competition requires that for each picture in the test set, predict the probability of it belonging to each category.

preds = []
for X, _ in test_iter:
    X = X.to(device)
    output = net(X)
    output = torch.softmax(output, dim=1)
    preds += output.tolist() #将数组或矩阵转换成列表
ids = sorted(os.listdir(os.path.join(new_data_dir, 'test/unknown')))
with open('submission.csv', 'w') as f:
    f.write('id,' + ','.join(train_valid_ds.classes) + '\n')
    for i, output in zip(ids, preds):
        f.write(i.split('.')[0] + ',' + ','.join(
            [str(num) for num in output]) + '\n')
  • Practice questions

1. The train, valid, train_valid and test data sets obtained after collating the data sets, the following statement is wrong: (b)

  • a. After finding a suitable set of hyperparameters, use train_valid to retrain the network

  • b. You can use the train_valid data set to train the model, and adjust the hyperparameters by observing the loss and accuracy on the test data set

  • c. You can use the train data set to train the model, and adjust the hyperparameters by observing the loss and accuracy on the valid data set

  • d. The sample proportions of the corresponding categories in train and valid should be similar

Option 1: Correct, train_valid contains all labeled data. After determining the hyperparameters, you can retrain on train_valid

Option 2: Error, the test data set does not contain the label of the sample, and the loss and accuracy cannot be calculated

Option 4: Correct, the data distribution should be as consistent as possible when dividing the data set

2. Regarding the practice of fine-tuning the ResNet-34 pre-training model for image classification, what is wrong in the following statement: (c)

  • a. The image category has changed and the output layer needs to be replaced

  • b. Since we do not want to change the parameters of the feature extraction part of the model, we can set the parameters of this partrequires_grad = False

  • c. If there is no parameter setting for the feature extraction part of requires_grad = Falsethe model, the model cannot be trained

  • d. When defining the optimizer, you only need to pass in the model parameters of the output layer

Option 2: Correct, this part of the parameters does not participate in training, set to requires_grad = Falsefreeze these parameters

Option 3: Error, the model can still be trained in this case

Option 4: Correct, because only the model parameters of the output layer are required to be trained

supplement:

  • python shutil.copy() usage

​ shutil.copyfile(src, dst): Copy the contents of the file (without metadata) from src to dst. dst must be the complete target file name; if src and dst are the same file, shutil.Error will be raised. dst must be writable, otherwise an exception IOError will be raised. If dst already exists, it will be replaced. Special files, such as character or block devices and pipes cannot use this function, because copyfile will open and read the file. src and dst are path names in string form.

  • There are two functions, join() and os.path.join() in Python. The specific functions are as follows:

    join(): Connect an array of strings. Join the elements in the string, tuple, and list with specified characters (separators) to generate a new string
    os.path.join(): Combine multiple paths and return

  • join() function

Syntax:'sep'.join(seq)

Parameter description
sep: separator. Can be empty
seq: sequence of elements to be connected, string, tuple, dictionary The
above grammar is: use sep as a separator to merge all elements of seq into a new string

Return value: return a string generated by connecting each element with the separator sep

  • os.path.join() function

Syntax: os.path.join(path1[,path2[,…]])

Return value: return after combining multiple paths

Note: The parameters before the first absolute path will be ignored

Guess you like

Origin blog.csdn.net/weixin_43901214/article/details/105520476