[2017cs231n]: course notes - Lecture 2: image classification

[2017cs231n]: course notes - Lecture 2: image classification

Search micro-channel public number: 'AI-ming3526' or 'this small computer vision' for more algorithms, machine learning, dry
CSDN: https://blog.csdn.net/baidu_31657889/
GitHub: https://github.com/ aimi-cn / AILearners

Course Description

Stanford CS231n (for visual recognition of convolution neural network) course we are all familiar, deep learning entry required courses.
This is a video update each semester will cause a wave of screaming star class. I am referring to is the 2017 version.

Curriculum resources

Courses Address: http://cs231n.stanford.edu/

Courses address Chinese version - Netease cloud Course

Courses address Chinese version -b station

Course notes github address

Course ppt address: public concern number "computer vision this small" or "AI-ming3526" replies the keyword "cs231n" free access

Coursework

Official notes work address: http://cs231n.github.io/

In the process of taking notes, I will look at 19 years or 18 years of course work to do and it may be used pytorch tensorflow can exercise the way the code capabilities

data-driven approach section2.1

In the last lecture, referred to the mandate for image classification, this is a real core computer vision tasks, but also the focus of this course.

When making image classification, classification system to receive some input image, and the system has clearly some have identified a set of classification or label, the label may be cats, dogs, cars and some fixed set of tags category and so on; computer work is to observe pictures and allocate some fixed category labels to it. For people, this is a very simple thing, but for a computer, but it is a very difficult thing.

file

When a computer looking at these pictures, he saw what it was, he certainly did not have the cat's entire concept, rather than as we have seen, the way a computer picture is actually a lot of numbers. Therefore, the image size (above) may be 600 pixels wide, 800 pixels high, there are three color channels, namely red, green and blue (referred to as RGB). Thus, the image contains the 600X800X3 = 1440000 digits, each digit is integer in the range 0-255, where 0 represents full black, 255 represents full white. Our task is to take these millions of numbers into a simple label, such as "cat."

file

So, for a computer, this is a huge array of numbers, it is difficult to extract the features of cats, we have this problem is called "semantic gap." The concept for the cat or its label, the label is a semantic image we give, and there is a big gap between the pixel values ​​of semantic tags cat and computer actually see.

Difficulties and challenges: for people, identified as a "cat" as a visual concept is extremely simple, but from the point of view on computer vision algorithms worth pondering. Below, we've listed some of the difficulties encountered in computer vision algorithms in image recognition, to remember that images are represented by 3-dimensional array, the array elements are brightness values.

  • Viewing angle: the same object, the camera can show from various angles.
  • Changes in the size: the size of the visual object will usually change (not just in the picture, in the real world size also varies).
  • Deformation: the shape of a lot of things are not immutable, there will be great changes.
  • Shutter: the target object may be blocked. Sometimes only a small portion of the object (as small as several pixels) is visible.
  • Lighting conditions: On the pixel level, the effect of light is very large.
  • Background interference: objects may be mixed into the background, making it difficult to identify.
  • Within-class variance: The size difference between the large object for a class of individuals, such as a chair. There are many different objects of this class of objects, each with its own shape.

The face of all the changes and combinations of the above, good image classification model can classify conclusion while maintaining stability, keeping the class differences between sensitive enough.

file
file
file
file
file

If you use a python write image classifier, define a method to accept the picture as an input parameter to a wave of God operating, the picture finally returned to mark a cat or a dog, and so on. But what is clear and simple and can be done directly these identification algorithm, it is difficult to image recognition algorithms.

file

For cats, it has ears, eyes, nose, mouth, and by Hubel and Wiesel studied in the previous chapter, we learned that edge for visual recognition is very important, so try to calculate the edge of the image, and then the edge , angle of various good shape classification, can write some rules to identify these cats.

file

But if you want to identify, such as trucks, other animals, they need to start all over again, so this is not an inference method, we need is a recognition algorithm can be extended to a variety of object recognition in the world, which we think a method of data-driven.

 We do not need specific classification rules to identify other objects such as a cat or a fish, method replaced by:

    Example (1) First, different types of images collected, made into a tagged image data set;

    (2) Then, a classifier is trained using machine learning;

    (3) Finally, this classifier to recognize the new pictures, see if can identify.

Therefore, if a write method, can define two functions, one function is the training, for receiving pictures and labels, then output model; another number prediction function, receiving a model to predict the kind of image.

file

This data-driven algorithm than the depth of the class is learning a concept more broadly, through this process, the simplest classifier (nearest neighbor classifier), in the training process, we just simply a record of all training data; in the prediction process, to take a new image has been trained and trained contrast to predict.

file

Image classification data set: the CIFAR-10 . A very popular image classification data set is CIFAR-10. This data set contains 60,000 small image of 32X32. Each image has a classification tag 10 species. This image is divided into 60 000 50 000 images included training set and test set contains 10,000 images. In the following figure on the left you can see 10 random pictures 10 class.

file

Left: sample image from CIFAR-10 database. Right: The first column is the test images, each test image on the right and then the first column is to use the Nearest Neighbor algorithm, based on the pixel difference from the training set selected 10 most similar images.

We need to know the details of a question: given two pictures, how to compare them?

If the test image is compared with all the training images, we will have many different options to determine what kind of comparison function needs. We can use the L1 distance (sometimes called Manhattan distance), this is an easy way to compare images, but these images to a single pixel is compared:

file

Testing and training to use two images from L1 to compare. Differencing the image pixel by pixel, and then all together to obtain a difference value. If the two pictures exactly the same, then the distance L1 is 0, but if the two images are very different, and that L1 value will be very large.

Although this method is somewhat stupid, but sometimes there is its rationality, it gives a specific method of comparing two pictures.

Here is the nearest neighbor classifier python code

file

But the nearest neighbor algorithm will appear the following question, if we have a training set of N examples, case processing time training and testing complexity that the answer is training: O (1) Testing: O (N), that perspective nearest neighbor algorithm to a little behind, it took little time in training, and spent a lot of time in the test; and see convolutional neural network and other parameters of the model, by contrast, they spend a lot of time on training while in the testing process is very fast. We hope that the test can be a little faster, a little slow but training does not matter, it is done in the data center.

? So in practical applications, the nearest neighbor in the end how well you can see the following image:

file

It is the decision-making area nearest neighbor classifier training set containing these points in two-dimensional plane, the colors represent different points of different categories or labels, there are five types of points. For these points, it calculates the most recent instance of the training data, and then the colored background of these points, it is indicated that the class labels, can be found nearest neighbor classifier is cut according to the space adjacent dots and colored .

However, the above picture, you can see the middle of the green area yellow area (the fact that the point should be green), part of the blue area in the green zone, these show the processing nearest neighbor classifier is problematic .

Then, based on the above issue, resulting in a K- nearest neighbor, it is not only looking for the nearest point, but also to do something special operation, according to the distance measure, find the nearest K points, and then to vote in these neighboring points, number of votes neighboring point to predict the outcome.

K below using the same data sets were used = 1, K = 3, K = 5 nearest neighbor classifier:

file

When K = 3, you can see the yellow green region of the point no longer cause the surrounding region is divided into yellow because of the use of majority voting, the intermediate region will be divided into a green green; when K = 5, could see the decision boundary between the blue and red regions becomes smoother look good.

So when using a nearest neighbor classifier, K always to assign a larger value, which is the decision boundary becomes smoother, resulting in better results. Of course, this value can not be too large to be adjusted on the size of your training or test samples.

Examples of k- nearest neighbor before written a practical machine learning - the recognition of handwritten numbers: https://blog.csdn.net/baidu_31657889/article/details/89095213

Student Question: What is the white area represents the image above?

A: The white area indicates that this area did not get to vote K- nearest neighbor, you can do a bold assumption, it is divided into a different category.

section2.2 K- nearest neighbor algorithm

Continue to discuss KNN (K- nearest neighbor), to return to the picture, it does not actually good performance, with red and green image classification were marked correct or not:

file

Value depends on its neighbor, you can see the effect of the performance of KNN is not very good, but if you can use a larger value of K, the result of the voting operation it may reach a good classification results.

When we use the K- nearest neighbor algorithm to determine how to compare data in relatively close proximity distance value. For example, the distance Ll has been discussed, which is the sum of the absolute difference between the pixel; Another common choice is the distance L2, which is the Euclidean distance (root sum square).

file

These two ways, depending on the coordinate system of your choice L1, so if the rotational axis will change the distance between the point L1; axis changes had no effect on the distance L2.

Great shape boundary change decision-making at various distances below, these decisions boundary L1 tends to follow the axis, but also because L1 depends on our choice of coordinate system, L2 sort of distance is not affected by axes, but it border placed in the most natural place. (Well, I did not see much difference ==) but http://vision.stanford.edu/teaching/cs231n-demos/knn/ effect on this site is really obvious, we go and see, apparently L2 Euclidean distance using intuition to fit better, more natural edge, the KNN, is actually very interesting, it can be a good training decision boundary.

file

So, once really try to use this algorithm in practice, there are several options needs to be done. For example, different values ​​of K are discussed options, choose a different distance metric, how to select these parameters according to the problem and ultra data, K value and distance metric called hyper-parameters, they do not necessarily learn from training data.

In practice, often used k-NN classifier. But the k value or how to determine these parameters over it?

Wrong two ideas Idea1 and Idea2

file

(1) Select can give the highest accuracy of the training set, the best performing super parameters;

Do not do this, in machine learning, not to fit the training set as possible, but to make the classifier to perform better on unknown data outside the training set. As in the k-nearest neighbor algorithm, assuming k = 1, we can always perfect classification training set data, in practice, let k take a larger value, although training will focus on individual data points wrong, but not in the training set the data appeared better classification performance.

(2) All the data is divided into two parts: a training set, another part of the test set, and then used over different parameters on the training set to train the algorithm, the trained classifier is used in a test set, and then selecting a group the best performance in the ultra-parameter test set;

Also not to do so, the purpose of the machine learning system is to let us know exactly how the algorithm performance, so the purpose of the test is to give us a set of prediction methods, if using this method, we can only make this set of algorithms on the test set good performance, but it is not representative of the performance of the data is not seen.

 Right two ideas Idea3 Idea4

file
file

(3) All the data is divided into three parts: the training set and test set validation set, most of the data as the training set, is usually done in the training set to train the algorithm over different parameters evaluated on the validation set, then a set of parameters to select the best performing super, then the performance of this group of the best classifier validation set out to run on the test set, this is the right approach on the validation set.

(4) cross-validation: less common in depth study. Sometimes, the smaller the number of training set (and therefore a smaller number validation set), this method is more complicated. Or use the earlier example, if the cross-validation set is, we are not taking an image 1000, but the average training set is divided into 5 parts, which are used to train 4 parts, one part for authentication. Then we take the circulating wherein 4 parts to train, wherein one part to verify that all of the last five times to take the average value as a verification result of the algorithm verification result.

So after cross-validation may get such a picture:

file

The horizontal axis represents the value of the parameter K K- nearest neighbor classifier, the vertical axis represents the accuracy of the classifier for different K-values ​​in the data. Here with a 5-fold cross-validation for each K value, all of the algorithms are five different tests to understand how this algorithm performance; so when training a machine learning model, and finally to draw such a map, which We can see the relationship between the performance of individual algorithms and parameters ultra can eventually select the best model and the corresponding parameters on the validation set over.

In fact, KNN is rarely used in image classification.

(1) It is a very long test times

(2) measure such as Euclidean distance or a distance L1 in the comparison image is not suitable, to this quantized distance function is not suitable for visual similarity between the image represented by

What we do is how to distinguish between the different images?

file

The left is the original picture, the right side is the result of image processing, such as covering the mouth, down the translation distance of a few pixels, the whole image or a bluish dye, and if the original calculation of FIG occlusion, translation, FIG Euclidean distance between the staining, the result is the same, L2 does not fit in a visually perceptible differences between the images.

Why distance L2 is the same, because we deliberately made this picture and these figures to calculate the distance L2 when the same treatment, so you can display the distance L2 even KNN not suitable calculation between images.

(3) curse of dimensionality: KNN bit like the sample space into pieces, which means if you want to have good results classifier, intensive training of the spatial distribution of the data; The problem is that you want to dense distribution in the sample space It requires exponentially training data, but can not get this pixel high-dimensional space.

file

Note: The point here is a training data point represents the color of their category. In the one-dimensional space, two categories only four points on it to cover the space, two-dimensional space, then you need 16 points, three points need 64, the number of training samples is exponential growth, terrible.

KNN: summary

In image classification , we are from a set of training data set began images and labels, and must predict the test set labels on

k-nearest neighbor classifier to predict the label based on the recent training instances

Distance metric (L1 L2), and K is hyperparameter

Use validation set selection hyper-parameters; our test set to put the final run, and run only once.

section2.3 linear classification

Linear classification is very important, but it is also a relatively simple learning algorithm, which helps us build up the entire neural network and convolution network.

Such as linear classifiers to you in Lego toys to you, take out the entire large castle or something equivalent to the entire neural network, linear classifier is equivalent to the entire Lego castle base module.

file

Linear classification, and will use a slightly different K- nearest neighbor method, linear classification parameters in the model is the simplest example, the following figures as an example, we use the data set or CIFAR10 inside 10 categories, each image size is 32 * 32 * 3.

file

32 * 32 * 3 above figure 3 refers to a three-channel RGB, because it is a color image has three channels, and a gray image is two-dimensional.

Typically the input data x is set, a weight of w, now write function input parameters comprising parameters x and w, and there will be described a digital output 10, i.e., corresponding to the category 10 corresponding CIFAR-10 score. Now, in this parametric approach, we summarize the knowledge of the training data and put it to use all of these parameters w, in a time of testing, eliminating the need actual training data, only these parameters to predict the outcome, this makes the model more efficient.

Depth study, the entire description are correct structure on the function F, a function can be written in different forms with different complex weights and combination data, which correspond to different neural network architectures, they would be multiplied by the simplest combination, which is a linear classifier.

file

In their own language to explain the figure: the leftmost cat is the input image, the equivalent of the middle of the formula X, the input image size is 32 * 32 * a total of 3 expansion is a column vector 3072 * 1, W we can he is equivalent to a weight matrix, his role is to record what we learned in depth study, only W matrix can predict the result at the time of the test, the size W is 10 * 3072, W and X are multiplied after it will get a column vector, just 10 * 1, the corresponding values ​​for the last ten classification, which classified the maximum value, we determined this was the kind of image classification. Sometimes plus b, which is a bias term, he gives us some data independent of the value of the preference, the preference value for just one class.

Examples of linear classifier operates as follows:

file

We image 2 * 2 is stretched into a column vector with four elements, in this example, limited only three categories: cat, dog, boat; weight matrix w is 3 rows by four columns (four pixels type); plus a 3 yuan deviation vector, which provides the data for each category independent error term; cat can now see the score between the input image pixels plus the product of the weight matrix and a bias term.

The figure is that we have a linear classifier trained, the bottom is a row vector weighting matrix we obtained training data set corresponds to the 10 class dependent visualization of the results.

file

Another view linear classifier is a return to the image, as a concept point and high-dimensional space, you can imagine each image point is similar to a high-dimensional space, and now try to draw a linear classifier on a linear decision boundary linear classification category and a divided surface of the remaining other categories, as shown below:

file

During training, the lines will start randomly, and then quickly change to try to separate the data is correct area, but from the point of view of this high-dimensional linear classifier, linear classifier can see the problems that occur again.

Suppose there are two categories of a data set, blue and red, blue is the number of pixels in the image category is greater than 0 and is an odd number; the number of red pixels in the image type is greater than 0 and is an even number, if to draw these different decision can see the blue category odd pixel has two quadrants in the plane, there is no way to draw a single straight line dividing the blue and red, which is the problem where a linear classifier. The leftmost figure below.

There are other cases difficult to resolve non-linear classifiers, such as multi-classification, the middle and rightmost shown below.

file

Thus, the linear classifier indeed there are many problems, but it is a very simple algorithm, easy to use and understand.

to sum up:

Discussed in this section the form of a linear function corresponding to the classifier (matrix-vector multiplication), and corresponding to the template matching for each category learning a single template, with the training matrix once, he can get any new training sample He scores.

Thinking: how to choose the right weight to the data set? This includes loss of function and other optimization methods, will continue to discuss in the next chapter.

AIMI-CN AI learning exchange group [1015286623] for more information on AI

Sharing technology, fun in life: our number of public computer vision this small push "AI" series News articles per week, welcome your interest!

Guess you like

Origin www.cnblogs.com/aimi-cn/p/11431693.html