[Deep Vision] Chapter 2: Data of Convolutional Network

4. Data of convolutional network

In the previous series, we explained in detail the fully connected layer neural network DNN under the pytorch framework. In this series, we begin to talk about convolutional neural networks CNN, Convolutional Neural Networks.

In the last chapter, I intercepted a picture from Teacher Lu Peng’s courseware, which showed in detail the fields related to computer vision. Obviously, this subject is an interdisciplinary subject, so feel free to expand your knowledge domain, for example, camera equipment You need to know more or less about performance, imaging principles, generation and acquisition of image data, video special effects, 3D, image restoration, image segmentation, recognition, geometry, optics, signal processing and other technologies.

1. Data structure of convolutional network
Let’s start with the data. When I talked about DNN earlier, I always emphasized that you have to convert your data into two-dimensional tabular data , that is, there are rows, columns, and labels. The rows represent samples and the columns represent features. That is, one sample is one row of data, and multiple samples are multiple rows of data. Only with this kind of data can you package it and feed it into the DNN network in small batches for learning and training. That is, DNN is suitable for one-dimensional and two-dimensional data. One-dimensional is one sample, and two-dimensional is multiple samples.

The CNN we are talking about now, the data structures matching it are three-dimensional and four-dimensional data structures. That is to say, no matter what kind of data you have, you must first convert your data into three-dimensional or four-dimensional before you can use a convolutional neural network. Image data is inherently three-dimensional data, and CNN was born to process images. We will talk about this again when we talk about the architecture. Therefore, the data of CNN is generally image data. In other words, a CNN sample is a piece of image data.

If you want to feed a picture to CNN and let the data propagate forward through the network once to see if the network data flow is connected, then the data structure of the sample you feed should be three-dimensional: (channels, height, width) such a data structure.
If you want to feed CNN a small batch of samples at a time, that is, when you feed several pictures at a time to view the results, then the data structure you input is four-dimensional: (samples, channels, height, width) such a data structure .
If you want to feed the CNN in small batches one by one to train the network, generally we have to perform data preprocessing , data enhancement , and then package the samples and labels, and then divide them into small batches before feeding the CNN network for learning. and training. Among them, data preprocessing and data enhancement involve many details, and a dedicated chapter will be opened later.

If your data is not image data at all, and you want to run it with a convolutional neural network, you can do it. No matter what method you use, as long as you finally transform the data into three or four dimensions, you can use a convolutional network. This part of the content involves a lot of knowledge points, and I will write a separate subtitle to demonstrate it later.

2. Concepts related to images.
Convolutional networks mainly process image data, so let’s start with images:

(1) What is an image?
Figure: It is the distribution of light reflected or transmitted by an object, and is an inherent characteristic of the object itself.
Image: It is the impression or understanding formed in the human brain from the image accepted by the human visual system.

(2) Analog images and digital images
Analog images are continuously stored data, which are susceptible to noise interference and do not look very clear to the human eye.
Digital images are hierarchically stored data, generally divided into 256 levels, that is, 8 bits.
At present, basically all analog images have been replaced by digital images.

(3) Image representation method
Binary image: Each pixel is either white or black; a pixel can be represented by only one bit; each pixel is represented by 0 or 1.

Grayscale image: The image has only one color. For example, the image can be red, gray, blue, green, etc., but no matter what color it is, there is only one color. But we have divided this color into 256 levels, which are 256 gray levels, which can be understood as 256 different degrees of lightness and darkness. For example, in a red grayscale image, pixel value = 0 is the darkest, black, and pixel value = 255 is the brightest, which is the brightest red. The brightness in 255 can be represented by 8 bits, which is 1 byte.

Color image: The image is in color, and each pixel of the image is a mixture of three colors. These three colors are RGB, and the value of each color is between 0-255. Each color is a channel, so color images generally have 3 channels. A few images are 4 channel because there is also a transparency between 0-1.

This representation of images allows us to perform image processing. For example, changing the value of a pixel changes the display of the image. For example, changing the channel means changing the color space of the image. For example, slicing is to capture a specific area of ​​an image. For example, performing addition, subtraction, multiplication, division, bitwise operations, etc. is to perform numerical operations on images.

3. Image data structure
The data structure of a binary image is a two-dimensional data structure (rows, columns), that is, there is no channel dimension. Where the rows represent the height of the image and the columns represent the width of the image. And all numbers are either 0 or 1. Of course, if you want to feed CNN, you have to increase the dimension to three dimensions: (1, row, column) such a structure.

The data structure of the grayscale image is also a two-dimensional data structure, and there is no channel dimension. The rows also represent the height and the columns represent the width, but all numbers are from 0 to 255. In the same way, if you want to feed this type of image data into DNN, you also have to upgrade to three dimensions.

The data structure of color images is generally a three-dimensional data structure. The three dimensions are (height, width, channels). The first two values ​​determine the size of the image, and the following channels determine the color effect of the image.

img.shape, returns: (667, 1000, 3), indicating that the data structure of this image is 667 rows, 1000 columns, and 3 channels. There are 3 numbers with 667 rows and 1000 columns. That is, each channel has 667 pixels in the vertical direction and 1000 pixels in the horizontal direction. That's the size of this picture.
The 3 in the third dimension means channel. The channel dimension is sometimes placed before height and width. The color of the image is determined by the number of channels. A single channel is a grayscale picture, three channels and four channels are a color picture, and the channels of our image data generally only have three values: 1 channel, 3 channel and 4 channel. The numbers in each channel determine the outline, lines, colors, edges and other information of the image, which basically determines the display content of the image.

The 'uint8' returned by img.dtype indicates that the data type of the object img is uint8 type. The value range of this type of data is between 0-255, which will not be less than 0 or greater than 255. For example, the value of a certain pixel is 100. If you add 200 to it, then the pixel becomes 300-255=45, which means the pixel changes from 100 to 45. This data structure has its own set of operation rules. Of course, such operation rules are mainly to meet the needs of image processing.

4.
Why do color space channels only have 3 values? This brings us to the concept of color space :

The world is colorless. The various colored lights we humans see are just electromagnetic waves of specific wavelengths that can stimulate the cone cells of the human eye and then form color signals in the human brain. In fact, the wavelength range of electromagnetic waves is very wide. , and we humans can only perceive a small area, and the light-sensitive cells in our eyes transmit electromagnetic wave stimulation signals of specific wavelengths to the brain, and the brain feeds back different colors. So for humans, the human color space is a specific space.

But for a printer, its color space is completely different from that of humans. For example, in the dark, the stronger the light, the clearer colors humans can see, but for printers, the more ink of a certain color. , if there are multiple colors, then the color it displays is black.

Similarly, for monitors, different monitors have different physical principles of light emission. For example, the color on the LCD screen controls the color of each pixel by controlling the level of voltage, while the CRT monitor relies on emitting electron beams to attack the fluorescence. The phosphor on the screen is used to achieve the display, so different monitors, even if you input the same signal data, the display effect will be different.

In the same way, even if the colors of images captured by many different cameras using different principles are the same, the colors presented to our human eyes are different. For example, military pictures, meteorological pictures, and pictures from ordinary cameras we use every day are all different when viewed with our naked eyes.

Therefore, in order for the pictures that human eyes see and the pictures printed by printers, pictures displayed on different monitors, pictures taken by different cameras, and pictures displayed by different picture display software to be viewed by human beings, they are unified. That is to say, the color accuracy of the same photo is consistent, so that the scenery we humans see is the same as the scenery printed on the printer, the same as the scenery displayed on the monitor, and the same as the scenery displayed after transmission by picture viewing software such as WeChat. If the pictures are consistent, we need to define different color spaces. On the contrary, for example, if there is a cat in a picture, if a human sees the picture as a white cat in a certain color space, if the picture is transferred to another person's mobile phone via WeChat, the picture will still be the same picture, even if The data of this picture has not changed, but the picture of another person's mobile phone is displayed in another color space. For example, if that person adjusts the color space of the mobile phone to eye protection mode, night mode, etc., it is a different color space. Then This cat is very likely to be seen by the person's eyes as a cat of another color. In this way, when the picture information is transmitted between people, due to the inconsistency in the color space of the medium - WeChat, the two people receive Inconsistent information means that the color accuracy has changed. So we have to define the color space. Therefore, different color spaces need to be able to convert each other.

Here are several common color spaces:
(1) RGB color space is a common basic color channel in computers. RGB represents the three basic colors of red, green, and blue respectively.

(2) The RGBA color space is a four-dimensional channel: (red, green, blue, transparency alpha). The value range of transparency alpha is between 0-1. When transparency is added to an RGB pixel, the color becomes "transparent". Therefore, RGBA can provide richer color styles, making the color of the image more brilliant and more in line with the aesthetic dimension of human eyes.

(3) CMYK color space is the color space of a color printer and the color recognized by the printer. It is composed of Cyan, Magenta, Yellow and Black, so it is a four-dimensional channel, which is expressed in the form of (height, width, 4) in the image data structure. This is the color world of the printer, which is different from us humans. The darker the colors we humans see, the more brilliant they are, but the darker the colors the printer sees, the darker the color, so the color space is just a standard.

(4) HSV (or HSL) color space: H represents hue, S represents saturation, and V represents brightness. HSV color mode is another popular color mode besides RGB color mode.

The above color spaces can be switched freely and have corresponding conversion formulas, but there will be a slight loss of data.

5. Understand the concept of channels again.
In the computer world, the basic colors used to make up other colors are called 'channels'. A grayscale channel is a channel with only one color, not gray. Grayscale refers to the degree of lightness and darkness in computer vision.
Generally speaking, the colors in the computer world are a color world formed by mixing the three basic colors of RGB, red, green, and blue to varying degrees. The values ​​of these three basic colors are [0, 255]. 0 means that the intensity of light of this color is almost 0, that is, black, that is, there is no light. 255 means that the intensity of light of this color is the highest, that is, the color is the most intense. Heavy. If the values ​​of the three basic colors are the same, the displayed color will be white. The larger the value, the more dazzling the whiteness will be. When the value of a certain pixel is (0,0,0), the pixel is black. When the value of a certain pixel is (255,255,255), the pixel is white. When the value of a certain pixel is (255,0,0), the pixel is red. When the value of a certain pixel is (0,255,0), the pixel is green. When the value of a certain pixel is (0,0,255), the pixel is blue.

6. Can image data be processed using DNN?
sure. DNN is suitable for two-dimensional tabular data, where rows represent samples and columns represent features. So you only need to convert your image data into two-dimensional data and then feed it into DNN to run and test the effect.
If your image is a binary image or a grayscale image, then you only need to flatten each image data into one-dimensional data, treat each image as a sample, and then use DNN to test the effect.
If your image is a three-dimensional color image, then you also have to flatten the image data according to the channel dimension direction, and then splice it into one-dimensional data in channel order. That is, a picture is a one-dimensional structure, and multiple pictures are a two-dimensional structure, where a row represents a sample. This way you can run it with DNN. When I talk about the architecture later, I will write a complete small case separately to show how a linear classifier classifies images. That is an example of DNN processing image data. You can then experience it in detail.

7. Common data sets in the field of image recognition

The fully connected neural network DNN will be stretched when dealing with image data, because now an ordinary image must be 1000x1000 in size, which means it must have 1 million pixels, and a three-channel image will have 3 million pixels. Numbers, if we feed 3 million dimensional data into a neural network, the neural network will collapse immediately and cannot handle this amount of data. Therefore, as long as it is an image data set, a convolutional network must be used. The convolutional network model is the real beginning of training huge models with massive data.

We know that the troika of deep learning is data, algorithm, and computing power . Data refers to the data used to train a network, such as the training set, validation set, and test set; the algorithm is the collective name for your architecture, model, network, optimization algorithm, etc.; the computing power is the computing resource GPU and other hardware resources. Here we are talking about the data of convolutional networks.

As mentioned before about computer vision tasks, different tasks correspond to different labels . When you clarify your mission, you have to start building your data set, and organizing the data set is a very complex, time-consuming, and money-consuming task. This job seems simple but is actually very complex and tedious. It is normal for a visual project to sometimes take two-thirds of the time and money to organize the data set from start to delivery. No matter how powerful the model is, garbage in garbage out, so organizing the data is a very, very important task. If you look at Professor Li Feifei, a celebrity in the field of computer vision, what makes her stand out is not the powerful algorithm she invented, but the fact that she organized the famous Imagenet data set. This work is enough for her to shine and be glorious. The reason why I say this here is because once, a person asked me to organize a video data set for him. He said that he wanted to train a cyborg, but when he heard that there was no data, he asked me to find videos from the Internet and edit them for him. And there were no details, so I was confused. Later, when their technical staff communicated with me, they did not consider the difficulty or method of obtaining the data set. They simply said what kind of data they wanted, and then said, as for how to obtain it, it is your own business. He is training the model. There is no obligation to collect a data set. I was also speechless for a moment. I want to say that the person who trains the model must be the person who knows the data best. If you don’t know your data very well, what model can you train? Any excellent model or model with good effects must have done a lot of processing and feature engineering on the data, and even some tricks, such as the balance of the sample, the characteristics of the sample, whether there are any abnormal samples, and whether there are any special features. The position, size, clarity, illumination, deformation, angle, occlusion, etc. of the sample and the object to be identified in the image must be understood in detail, so that the model can be adjusted accordingly, rather than ignoring everything. First, I feel that as long as I find a powerful model, I can output it all at once with my eyes closed. As for the output, I leave it to fate. If the output is not good, the responsibility will be put on the annotator. This kind of person is particularly annoying.

I'm going a bit far. I just want to say that if you want to train a huge model with massive data, it is almost impossible to collect the data set yourself. The cost of collecting data and labeling the data is very expensive. We are studying here, so we cannot spend a lot of energy organizing a data set. We usually generate some toy data ourselves, or use some data sets built into the pytorch framework. These data sets are relatively simple and classic. Moreover, we are now talking about classification algorithms, so it does not involve data sets with more complex labels. Here are some common data sets listed for you.

(1) MNIST, FashionMNIST, SVHN, Omniglot, and CIFAR10 data sets.
These five data sets can be downloaded from the relevant interfaces provided by pytorch. They are the easiest data sets to obtain, so we will introduce these five data sets first:
MNIST : A data set of handwritten digits with white characters on a black background. It is used for recognition tasks and cannot be used for detection and segmentation. There are 10 label categories and the image size is 28x28.

FashionMNIST: Clothing supplies data set, used for recognition tasks, cannot be used for detection and segmentation. There are also 10 label categories, and the image size is 28x28.

SVHN: Real-shot street view digital data set street view house number. It is a digital recognition data set based on the house number pictures in the street view taken by Google Earth. It is a more difficult data set in digital recognition and detection because it is relatively blurry. The data set supports three tasks: recognition, detection, and unsupervised tasks, and cannot be used for segmentation. Because it is also a number, there are also 10 label categories, and the image size is 32x32.

Omniglot: A full-language handwritten alphabet dataset. The dataset contains 1623 different handwritten characters from 50 different characters. There are 1623 categories in total, each category has 20 samples, and the size of each sample is 28X28. This dataset is designed for 'one-shot learning'. For example, the strategy of face recognition is one-time learning. In the 'one-shot learning' strategy, the training data is a group of photos, and the label only needs to be 'yes' or 'no', which means that the label only needs to mark whether the group of photos is of the same person. This strategy is a two-dimensional Classification strategy. In the training sample, the algorithm is given two photos. By calculating the distance or calculating the similarity, it determines whether the two photos are of the same person. The output label is "yes/no, similar/not similar". In this strategy, the test The samples in the set are also two photos, and the samples in the test set do not need to appear in the training set. The face recognition projects that are actually implemented now are all based on one-time learning. First scan your ID card to get a picture of the ID card, then use the camera to take a photo of your face, and then determine whether the photo taken by the camera and the photo on the ID card are of the same person. As for the name of this person , the model doesn’t care if it’s a boy or a girl. The Omniglot data set is a data set specially trained for one-time learning. It can only be used for recognition tasks and cannot be used for detection and segmentation.

CIFAR10: A common dataset for ten categories. Covers ten categories of animals and vehicles. It is the most commonly used teaching image and is very suitable for recognition tasks. They are color pictures, and the size of each picture is 32x32x3.

(2) Show the acquisition process of FashionMNIST:

In the same way, MNIST, SVHN, Omniglot, and CIFAR10 can also be obtained in the above way:

Just some details are different:
B: Parameter split='train'/'test'/'val', indicating whether the data you want to load is a training set, a test set, or a validation set
. C: Because of the origin of the omnit data set In the paper, the training set is named background, so the parameter background=True here means that the data to be loaded is the training set data. background=False indicates that the test set is to be loaded.

It can be seen that different data sets have different interfaces, and different data sets have different properties and methods. If you want to know which properties of this data set can be called, you must enter the source code of the data set to view it. If you don't want to look at the source code, you can only try to use indexing and looping to see if it works.
How to read PyTorch source code efficiently? - Know almost

The above data sets are the most basic and simple data sets in recognition tasks. These data sets can only be used to test the baseline of our model , that is, to see the effect of the model. The SOTA architecture in deep learning generally achieves high scores of more than 99% on these data sets. If you feel that the amount of data in these datasets is too small and the data is too simple, and you want to use some more complex datasets to run on the GPU, then the lowest cost is the competition dataset. For example, ImageNet, VOC, LSUN and other data sets.

(3) The acquisition channel for ImageNet, VOC, and LSUN data sets.
ILSVRC shut down the recognition task in 2017 and moved the competition to Kaggle. It is said that the ImageNet2019 data set can be downloaded for free on Kaggle and is mainly used for detection tasks. But the data sets before 2017 can no longer be found. And it is said that you must have an edu email address when downloading, but commercial applications are not allowed and can only be used for research.

The VOC data set is the data set VOCSegmentation&Detection in the visual object classification in the PASCAL (pattern analysis, statistical modeling and computation learning competition) competition.
VOCSegmentation is used for segmentation tasks. In the 2012 version, there are 5717 training data, 5823 test data, and 27450 object labels.
VOCDetection is used for detection and segmentation tasks. In the 2012 version, the training data is 1464, the test data is 1449, and the segmentation labels are 6929.
Pytorch supports the download of 5 versions from 2007 to 2015, because it is relatively small, only about 3.6G, but it is very unstable.

The LSUN dataset is also a dataset from the Large-Scale Scene Understanding Challenge, which is specifically focused on landscapes. The dataset is used for recognition tasks. The data sets of different landscape scenes have different sizes. Some scene data sets are large, such as more than 40G, and some are small, such as about 2G. The entire data set is about 200G. Let me show you the download process of one category:

there are many pitfalls. Even if you have obtained this data set, it is very difficult to read it smoothly. The code demonstrated later is also this data set.

Except for LSUN, the above data set can barely be run on the CPU. The other data sets must have a GPU. Without GPU computing resources, it is difficult to conduct proper training on these data. You must use a very small batch_size, but the batch_size is too small and it will not work properly. Will extend the training time. If it takes 20 hours to train an ImageNet, then you cannot afford to spend this time on subsequent operations such as parameter adjustment and grid search. If you must use the ImageNet data set, it is recommended to use large GPUs from online platforms such as Colab.

8. Related functions and classes for reading data.
It is already very difficult to obtain data sets in deep learning at low cost. And when you get the data from any way, you usually get a compressed file. After decompression, you will see that it may be: pt file, mat file, lmdb file, excel, csv, txt file, or just It's directly a folder. There are jpg or png pictures in the folder, and the tags are in another file. There is no universal reading method for these different situations, we can only handle them differently:

(1) If you get a .mat file. That is the standard format for matlab data storage. The mat file is a standard binary file. If you use the C environment, you can read it directly. But when we learn deep learning, we usually use the python environment. You have to use the loadmat function of the scipy library to read the mat file:

(2) If the file obtained is in .pt format. pt files are the file format used to save models and data in PyTorch. So you can use the torch.load() function in the PyTorch library to load the .pt file.

(3) If you obtain a file in .mdb format. LMDB, Lightning Memory-Mapped Database Manager, is a memory-mapped database based on btree. So .mdb is a database file.

At this point you need to:

After installation, you have to write a class specifically for reading mdb format:
 
the above class I wrote myself to read lmdb files. In fact, we wrote an example of this class when we talked about DNN. The idea is the same. Just inherit the Datasets part and the architecture part and copy it. In addition, you need to have a good understanding of how to use python to process lmdb files. The code is more engineering-oriented, but it is not difficult if you understand it. It is a fixed process. Take a look at the effect below:

In addition, we mentioned earlier that the LSUN data set is a data set in lmdb format. There is an interface specifically for this data set in pytorch. Let me demonstrate how to read it using the interface in pytorch:

(4)

9. Classes related to dividing test sets and training sets, packaging data and labels, and dividing into small batches in pytorch

to be continued. . . .

Guess you like

Origin blog.csdn.net/friday1203/article/details/135469813