Overview:

This article will organize your own training data, use the Pytorch deep learning framework to train your own model, and finally realize your own image classification! This article takes the identification of balconies as an example to describe.

1. Data preparation

The basis of deep learning is data. To complete image classification, of course data is also essential. First use the crawler to crawl 1200 balcony pictures and 1200 non-balcony pictures. The names of the pictures are edited from 0.jpg to 2400.jpg, and the crawled pictures are placed in the same folder and named image (as shown in Figure 1 below Show).

The crawler code for Baidu pictures is also put here for your convenience. The code can crawl any custom pictures:

import requests
import os
import urllib


class Spider_baidu_image():
    def __init__(self):
        self.url = 'http://image.baidu.com/search/acjson?'
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.\
            3497.81 Safari/537.36'}
        self.headers_image = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.\
            3497.81 Safari/537.36',
            'Referer': 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1557124645631_R&pv=&ic=&nc=1&z=&hd=1&latest=0&copyright=0&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&sid=&word=%E8%83%A1%E6%AD%8C'}
        self.keyword = input("请输入搜索图片关键字:")
        self.paginator = int(input("请输入搜索页数，每页30张图片："))


    def get_param(self):
        """
        获取url请求的参数，存入列表并返回
        :return:
        """
        keyword = urllib.parse.quote(self.keyword)
        params = []
        for i in range(1, self.paginator + 1):
            params.append(
                'tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=1&latest=0&copyright=0&word={}&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&cg=star&pn={}&rn=30&gsm=78&1557125391211='.format(
                    keyword, keyword, 30 * i))
        return params

    def get_urls(self, params):
        """
        由url参数返回各个url拼接后的响应，存入列表并返回
        :return:
        """
        urls = []
        for i in params:
            urls.append(self.url + i)
        return urls

    def get_image_url(self, urls):
        image_url = []
        for url in urls:
            json_data = requests.get(url, headers=self.headers).json()
            json_data = json_data.get('data')
            for i in json_data:
                if i:
                    image_url.append(i.get('thumbURL'))
        return image_url

    def get_image(self, image_url):
        """
        根据图片url，在本地目录下新建一个以搜索关键字命名的文件夹，然后将每一个图片存入。
        :param image_url:
        :return:
        """
        cwd = os.getcwd()
        file_name = os.path.join(cwd, self.keyword)
        if not os.path.exists(self.keyword):
            os.mkdir(file_name)
        for index, url in enumerate(image_url, start=1):
            with open(file_name + '\\{}.jpg'.format(index), 'wb') as f:
                f.write(requests.get(url, headers=self.headers_image).content)
            if index != 0 and index % 30 == 0:
                print('{}第{}页下载完成'.format(self.keyword, index / 30))

    def __call__(self, *args, **kwargs):
        params = self.get_param()
        urls = self.get_urls(params)
        image_url = self.get_image_url(urls)
        self.get_image(image_url)


if __name__ == '__main__':
    spider = Spider_baidu_image()
    spider()

Each picture needs to be tagged accordingly, so in the txt file, select the name of the picture and add a tag after it. The label is 1 if it is a balcony, and 0 if it is not a balcony. Among the 2400 pictures, it is divided into two txt documents as the training set and the verification set "train.txt" and "val.txt" (as shown in Figures 2 and 3 below)

By observing the pictures I crawled, I can find that there are all kinds of balconies, some are semi-open, some are closed, and some are even mixed with other identifiable objects such as flowers and grass. At the same time, the size of the pictures is inconsistent, some are vertical rectangles, some are horizontal rectangles, but we need to be a square with a reasonable size in the end. So we use the Resize library to scale the image. Here I scale the image to the level of 84*84. In addition to the scaling operation, the data needs to be preprocessed:

torchvision.transforms is an image preprocessing package in pytorch

Generally, Compose is used to integrate multiple steps together:

for example

transforms.Compose([

transforms.CenterCrop(84),

transforms.ToTensor(),

])

This puts the two steps together

CenterCrop is used to crop the image from the center, and the target is a square with a length and width of 84, which is convenient for subsequent calculations. Adding a RandomCrop in addition to CenterCrop is to crop at a random position.

The purpose of the ToTenser() function is to read the image pixels and convert them into numbers from 0-1 (for normalization).

code show as below:

data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(84),
        transforms.CenterCrop(84),
        # 转换成tensor向量
        transforms.ToTensor(),
        # 对图像进行归一化操作
        # [0.485, 0.456, 0.406]，RGB通道的均值与标准差
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(84),
        transforms.CenterCrop(84),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

After solving the image processing, if you want to start training the network model, the first thing to solve is to read in the image data. Pytorch uses DataLoader to realize the image data reading. The code is as follows:

class my_Data_Set(nn.Module):
    def __init__(self, txt, transform=None, target_transform=None, loader=None):
        super(my_Data_Set, self).__init__()
        # 打开存储图像名与标签的txt文件
        fp = open(txt, 'r')
        images = []
        labels = []
        # 将图像名和图像标签对应存储起来
        for line in fp:
            line.strip('\n')
            line.rstrip()
            information = line.split()
            images.append(information[0])
            labels.append(int(information[1]))
        self.images = images
        self.labels = labels
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader

    # 重写这个函数用来进行图像数据的读取
    def __getitem__(self, item):
        # 获取图像名和标签
        imageName = self.images[item]
        label = self.labels[item]
        # 读入图像信息
        image = self.loader(imageName)
        # 处理图像数据
        if self.transform is not None:
            image = self.transform(image)
        return image, label

    # 重写这个函数，来看数据集中含有多少数据
    def __len__(self):
        return len(self.images)


# 生成Pytorch所需的DataLoader数据输入格式
train_dataset = my_Data_Set('train.txt', transform=data_transforms['train'], loader=Load_Image_Information)
test_dataset = my_Data_Set('val.txt', transform=data_transforms['val'], loader=Load_Image_Information)
train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=10, shuffle=True)

It can be verified whether the DataLoader format data is generated:

# 验证是否生成DataLoader格式数据
for data in train_loader:
    inputs, labels = data
    print(inputs)
    print(labels)
for data in test_loader:
    inputs, labels = data
    print(inputs)
    print(labels)

2. Define a Convolutional Neural Network

Convolutional neural network is a typical multi-layer neural network, which is good at dealing with machine learning problems related to images, especially large images. Through a series of methods, the convolutional neural network successfully reduces the dimension of the image recognition problem with a large amount of data, and finally enables it to be trained. Convolutional neural network (CNN) was first proposed by Yann LeCun and applied to handwriting recognition.

A typical CNN network architecture is shown in Figure 4:

First import the libraries required by Python:

import torch
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import Dataset
import numpy as np
import os
from PIL import Image
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

plt.ion()

Define a convolutional neural network:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 18 * 18, 800)
        self.fc2 = nn.Linear(800, 120)
        self.fc3 = nn.Linear(120, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 18 * 18)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

net = Net()

We first define a Net class, which encapsulates all training steps, including convolution, pooling, activation, and full connection operations.

The __init__ function first defines all the required functions, which will be called in forward. Starting from conv1, conv1 actually defines a convolutional layer. 3 represents the number of layers of the pixel array of the input image. Generally speaking, it is the number of channels of the input image. For example, the images used here are all color images. By R, G, and B are composed of three channels, so the value is 3; 6 means that we want to perform 6 convolutions, and each convolution can generate a different feature map array, which is used to extract 6 features of the image. The result of each feature map will eventually be stacked together to form an image output, and then used as the input of the next step; 5 is the size of the filter frame, indicating that we want to use a 5 * 5 matrix to perform points with the matrix of the same size in the image Multiply and add to form a value. After defining the volume base layer, we then define the pooling layer. What the pooling layer does is simple to say, in fact, because the pixel matrix generated by the large image is too large, we need to use a reasonable method to reduce the dimension without losing the characteristics of the object, so the pooling technology is used. Every four elements are combined into one element, and this element is used to represent the values of the four elements, so the image volume will be reduced to a quarter of the original. On the next line, we once again encountered a volume base layer: conv2, like conv1, its input is also a multi-layer pixel array, and its output is also a multi-layer pixel array. The difference is that the amount of calculation completed this time is larger. , we see that the parameters here are 6, 16, 5 respectively. The reason why it is 6 is because the number of output layers of conv1 is 6, so the number of layers input here is 6; 16 represents the number of output layers of conv2, which is the same as conv1, and 16 means that this convolution operation will learn 16 types of pictures Mapping features, the more features you can theoretically learn, the better the effect. The size of the filter frame used by conv2 is the same as that of conv1, so it will not be repeated.

For fc1, 16 is easy to understand, because the height of the image matrix generated by the last convolution is 16 layers. We cut the training image into a square size of 84 * 84, so the earliest input of the image is a 3 * 84 * 84 array. After the first 5*5 convolution, we can conclude that the result of the convolution is a 6*80*80 matrix, where 80 is because we use a 5*5 filter frame, when it is from the upper left After the first element of the corner is convolved, the center of the filter frame is from 2 to 78, not from 0 to 79, so the result is an 80*80 image. After a pooling layer, the width and height of the image size are reduced to 1/2 of the original size, so it becomes 40 * 40. Then another convolution was performed, and as last time, the length and width were reduced by 4 to become 36 * 36, and then the last layer of pooling was applied, and the final size was 18 * 18. So the size of the input data of the first fully connected layer is 16 * 18 * 18. What the three fully connected layers do is very similar, that is, continuous training, and finally output a binary classification value.

The forward function of the net class represents the entire process of forward calculation. forward accepts an input and returns a network output value, and the intermediate process is a process of calling the layer defined in the init function.

F.relu is an activation function that converts all non-zero values into zero values. The last critical step in this image recognition is the real loop training operation.

#训练
cirterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.5)
for epoch in range(50):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        inputs, labels = Variable(inputs), Variable(labels)
        optimizer.zero_grad()                        # 优化器清零
        outputs = net(inputs)
        loss = cirterion(outputs, labels)
        loss.backward()
        optimizer.step()                         #优化
        running_loss += loss.item()
        if i % 200 == 199:
            print('[%d %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

print('finished training!')

Here we have conducted 50 trainings, and each training is to obtain the training data in the train_loader in batches, clear the gradient, calculate the output value, calculate the error, backpropagate and correct the model. We take the average error of every 200 calculations as observations.

The following test phase:

#测试
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = net(Variable(images))
        _, predicted = torch.max(outputs.data, dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum()
print('Accuracy of the network on the 400 test images: %d %%' % (100 * correct / total))

Finally, a recognition accuracy rate will be obtained.

3. The complete code is as follows:

import torch
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import Dataset
import numpy as np
import os
from PIL import Image
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

plt.ion()



data_transforms = {
    'train': transforms.Compose([
        transforms.Resize(84),
        transforms.CenterCrop(84),
        # 转换成tensor向量
        transforms.ToTensor(),
        # 对图像进行归一化操作
        # [0.485, 0.456, 0.406]，RGB通道的均值与标准差
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(84),
        transforms.CenterCrop(84),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}
def Load_Image_Information(path):
    # 图像存储路径
    image_Root_Dir= r'C:/Users/wbl/Desktop/pythonProject1/image/'
    # 获取图像的路径
    iamge_Dir = os.path.join(image_Root_Dir, path)
    # 以RGB格式打开图像
    # Pytorch DataLoader就是使用PIL所读取的图像格式
    return Image.open(iamge_Dir).convert('RGB')


class my_Data_Set(nn.Module):
    def __init__(self, txt, transform=None, target_transform=None, loader=None):
        super(my_Data_Set, self).__init__()
        # 打开存储图像名与标签的txt文件
        fp = open(txt, 'r')
        images = []
        labels = []
        # 将图像名和图像标签对应存储起来
        for line in fp:
            line.strip('\n')
            line.rstrip()
            information = line.split()
            images.append(information[0])
            labels.append(int(information[1]))
        self.images = images
        self.labels = labels
        self.transform = transform
        self.target_transform = target_transform
        self.loader = loader

    # 重写这个函数用来进行图像数据的读取
    def __getitem__(self, item):
        # 获取图像名和标签
        imageName = self.images[item]
        label = self.labels[item]
        # 读入图像信息
        image = self.loader(imageName)
        # 处理图像数据
        if self.transform is not None:
            image = self.transform(image)
        return image, label

    # 重写这个函数，来看数据集中含有多少数据
    def __len__(self):
        return len(self.images)


# 生成Pytorch所需的DataLoader数据输入格式
train_dataset = my_Data_Set('train.txt', transform=data_transforms['train'], loader=Load_Image_Information)
test_dataset = my_Data_Set('val.txt', transform=data_transforms['val'], loader=Load_Image_Information)
train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=10, shuffle=True)

'''
# 验证是否生成DataLoader格式数据
for data in train_loader:
    inputs, labels = data
    print(inputs)
    print(labels)
for data in test_loader:
    inputs, labels = data
    print(inputs)
    print(labels)

'''

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 18 * 18, 800)
        self.fc2 = nn.Linear(800, 120)
        self.fc3 = nn.Linear(120, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 18 * 18)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

net = Net()

#训练
cirterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.5)
for epoch in range(50):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
        inputs, labels = Variable(inputs), Variable(labels)
        optimizer.zero_grad()                        # 优化器清零
        outputs = net(inputs)
        loss = cirterion(outputs, labels)
        loss.backward()
        optimizer.step()                         #优化
        running_loss += loss.item()
        if i % 200 == 199:
            print('[%d %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

print('finished training!')

#测试
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = net(Variable(images))
        _, predicted = torch.max(outputs.data, dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum()
print('Accuracy of the network on the 400 test images: %d %%' % (100 * correct / total))

Welcome everyone to criticize and correct~

Image classification tasks using Pytorch

Overview:

1. Data preparation

2. Define a Convolutional Neural Network

3. The complete code is as follows:

Guess you like