CS231n Course Notes Translation: Image Classification Notes (Part 1)

Translator's Note : This article is the first intelligent unit , translated from Stanford CS231n course notes image classification notes , authorized by course teacher Andrej Karpathy for translation. This tutorial was translated by DUKE. ShiqingFan has carefully proofread the translation and put forward a lot of suggestions for revision, with a rigorous attitude and a lot of help. Gong Zijia also made good suggestions on the use of several terms and translation optimization. Zhang Xin et al are also helpful.

The original text is as follows

This is an introductory tutorial for students not in computer vision fields. The tutorial will introduce students to image classification problems and data-driven methods. Here is the content list :


  • Image classification, data-driven methods and processes
  • Nearest Neighbor Classifier
    • k-Nearest Neighbor Translator's Note: The last translation deadline
  • Validation Set, Cross Validation Set and Hyperparameter Tuning
  • The pros and cons of Nearest Neighbor
  • summary
  • Summary: Applying kNN in Practice
  • Extended reading

image classification

Objective : In this section we will introduce the image classification problem. The so-called image classification problem is that there is a fixed set of classification labels, and then for the input image, a classification label is found from the classification label set, and finally the classification label is assigned to the input image. Simple as it may seem, this is one of the core problems in computer vision and has a wide variety of practical applications. In later lessons, we will see that many seemingly disparate problems in computer vision (such as object detection and segmentation) can be reduced to image classification problems.

Example : Take the following picture as an example, the image classification model reads the picture and generates the probability that the picture belongs to each label in the set {cat, dog, hat, mug}. It's important to note that to a computer, an image is a huge 3-dimensional array of numbers. In this example, the image size of the cat is 248 pixels wide and 400 pixels high, with 3 color channels, namely red, green and blue (RGB for short). In this way, the image contains 248X400X3=297600 numbers, each of which is an integer in the range 0-255, where 0 means all black and 255 means all white. Our task is to turn these millions of numbers into a simple label like "cat".

—————————————————————————————————————————

<
The task of image classification is to predict, for a given image, which class label it belongs to (or the possibility of belonging to a series of different labels). The image is a 3-dimensional array, and the array elements are integers ranging from 0 to 255. The dimensions of the array are width x height x 3, where the 3 represents the red, green and blue color channels.

—————————————————————————————————————————

Difficulties and challenges : Recognizing a visual concept like a "cat" is extremely simple for a human, but it is worth thinking about from the perspective of a computer vision algorithm. Below we list some of the difficulties that computer vision algorithms encounter in image recognition, remembering that images are represented as 3-dimensional arrays, and the elements of the array are brightness values.

  • Viewpoint variation : The same object, the camera can be displayed from multiple angles.
  • Scale variation : The visible size of objects usually changes (not only in pictures, but also in the real world).
  • Deformation ( Deformation ) : The shape of many things is not static, there will be great changes.
  • Occlusion : The target object may be blocked. Sometimes only a small part of the object (which can be as small as a few pixels) is visible.
  • Illumination conditions : At the pixel level, the effect of lighting is very large.
  • Background clutter : Objects may blend into the background, making it difficult to recognize.
  • Intra-class variation : The shapes of individuals of a class of objects vary widely, such as chairs. There are many different objects in this category, each with its own shape.

In the face of all the above changes and their combinations, a good image classification model can maintain a stable classification conclusion while remaining sensitive enough to class differences.

—————————————————————————————————————————

<
—————————————————————————————————————————

Data-Driven Approach : How to Write an Image Classification Algorithm? This is very different from writing a sorting algorithm. How to write an algorithm to recognize cats from images? Can't figure it out. Therefore, instead of writing directly in the code what various objects look like, it is better to say that the method we take is similar to teaching children to read pictures and recognize objects: give the computer a lot of data, then implement the learning algorithm, and let the computer learn to the shape of each class. This approach is the data-driven approach . Since the first step of the method is to collect images that have been classified and annotated as a training set, let's take a look at what the database looks like:

—————————————————————————————————————————

<
A training set with 4 visual classifiers. In practice, we may have thousands of categories, each with thousands of images.

—————————————————————————————————————————

Image classification process . As you have learned in the course video, image classification is inputting an array of pixel values ​​and assigning it a classification label. The complete process is as follows:

  • Input : The input is a collection of N images, each labeled with one of K classification labels. This set is called the training set.
  • Learning : The task of this step is to use the training set to learn what each class looks like. Typically this step is called training a classifier or learning a model .
  • Evaluation : Let the classifier predict the classification labels of images it has not seen before, and use this to evaluate the quality of the classifier. We compare the labels predicted by the classifier to the actual class labels of the images. Undoubtedly, if the classification label predicted by the classifier is consistent with the real classification label of the image, it is a good thing, and the more such cases, the better.

Nearest Neighbor Classifier

As the first method introduced in the course, let's implement a Nearest Neighbor classifier . Although this classifier has nothing to do with convolutional neural networks and is rarely used in practice, by implementing it, readers can have a basic understanding of how to solve image classification problems.

Image Classification Dataset: CIFAR-10. A very popular dataset for image classification is CIFAR-10 . This dataset contains 60,000 small 32X32 images. Each image has one of 10 classification labels. These 60,000 images are divided into a training set of 50,000 images and a test set of 10,000 images. In the image below you can see 10 random images of 10 classes.

—————————————————————————————————————————

<
Left : Sample images from the CIFAR-10 database. Right : The first column is the test image, then to the right of each test image in the first column is the 10 most similar images from the training set based on pixel differences using the Nearest Neighbor algorithm.

—————————————————————————————————————————

Suppose now that we have 50,000 images of CIFAR-10 (5,000 for each category) as a training set, we want to use the remaining 10,000 as a test set and label them. The Nearest Neighbor algorithm will take the test image and compare it with each image in the training set, and then assign the test image the label of the training image that it thinks is the most similar. The image on the right above shows the result. Note that out of the 10 categories above, only 3 are accurate. For example, in row 8, the horse's head is classified as a red sports car, because the black background of the red sports car is very strong, so the horse is misclassified as a sports car.

So how exactly do you compare the two images? In this case, 32x32x3 blocks of pixels are compared. The easiest way to do this is to compare pixel by pixel and add up all the differences at the end. In other words, convert the two images into two vector sums I_1, I_2and then calculate their L1 distance:

\displaystyle d_1(I_1,I_2)=\sum_p|I^p_1-I^p_2|

The summation here is for all pixels. Below is an example of the entire comparison process:

—————————————————————————————————————————

<
Take a color channel in the picture as an example to illustrate. The two images are compared using the L1 distance. Take the difference pixel by pixel, then add all the differences to get a single value. If the two images are exactly the same, the L1 distance will be 0, but if the two images are very different, the L1 value will be very large.

—————————————————————————————————————————

Next, let's see how to implement this classifier in code. First, we load the CIFAR-10 data into memory and split it into 4 arrays: training data and labels, test data and labels. In the code below, Xtr (size 50000x32x32x3) holds all the images in the training set, and Ytr is the corresponding 1-dimensional array of length 50000 that holds the classification labels (from 0 to 9) corresponding to the images:

Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # a magic function we provide
# flatten out all images to be one-dimensional
Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072
Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072

Now we get all the image data and stretch them into row vectors. The following shows how to train and evaluate a classifier:

nn = NearestNeighbor() # create a Nearest Neighbor classifier class
nn.train(Xtr_rows, Ytr) # train the classifier on the training images and labels
Yte_predict = nn.predict(Xte_rows) # predict labels on the test images
# and now print the classification accuracy, which is the average number
# of examples that are correctly predicted (i.e. label matches)
print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) )

As an evaluation criterion, we often use accuracy , which describes how well we predicted the correct score. Please note that all classifiers we implement in the future will need to have this API: the train(X, y) function. The function is trained using the data and labels of the training set. From its internals, the class should implement some model about the labels and how the labels are predicted. There is also a predict(X) function, which is used to predict the classification label of the new input data. The implementation of the classifier has not been introduced yet. The following is the implementation routine of the Nearest Neighbor classifier using the L1 distance:

import numpy as np

class NearestNeighbor(object):
  def __init__(self):
    pass

  def train(self, X, y):
    """ X is N x D where each row is an example. Y is 1-dimension of size N """
    # the nearest neighbor classifier simply remembers all the training data
    self.Xtr = X
    self.ytr = y

  def predict(self, X):
    """ X is N x D where each row is an example we wish to predict label for """
    num_test = X.shape[0]
    # lets make sure that the output type matches the input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    # loop over all test rows
    for i in xrange(num_test):
      # find the nearest training image to the i'th test image
      # using the L1 distance (sum of absolute value differences)
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances) # get the index with smallest distance
      Ypred[i] = self.ytr[min_index] # predict the label of the nearest example

    return Ypred

If you run CIFAR-10 with this code, you'll see an accuracy of 38.6% . This is better than 10% of random guessing, but still far worse than the level of human recognition ( 94% estimated by research ) and the 95% that convolutional neural networks can achieve. Click to view the Kaggle algorithm competition leaderboard based on CIFAR-10 data .

Distance selection : There are many ways to calculate the distance between vectors. Another commonly used method is the L2 distance . From a geometrical point of view, it can be understood that it is calculating the Euclidean distance between two vectors. The formula for the L2 distance is as follows:

\displaystyle d_2(I_1,I_2)=\sqrt{ \sum_p(I^p_1-I^p_2)^2}

In other words, we're still calculating the difference between pixels, just squaring it, adding up all those squares, and taking the square root of the sum. In Numpy, we only need to replace 1 line of code in the above code:

distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1))

Note that np.sqrt is used here , but may not be used in practice. Because the square root function is a monotonic function , although the square root of the absolute value of different distances changes the value, it still maintains the order of different distances. So with or without it, the size of the pixel difference can be compared correctly. If you run this model on CIFAR-10, the accuracy rate is 35.4% , which is a little lower than before.

L1 and L2 comparison . It is interesting to compare these two metrics. When faced with differences between two vectors, L2 is less tolerant of these differences than L1. That is, the L2 distance is more likely to accept multiple moderate differences than 1 large difference. Both L1 and L2 are special forms commonly used in p-norm .

k-Nearest Neighbor Classifier

You may have noticed, why only use the label of the most similar image as the label of the test image? Isn't that weird! Yes, you can do better with the k-Nearest Neighbor classifier . Its idea is very simple: instead of just looking for the label of the most similar image, we find the labels of the k most similar images, then let them vote on the test image, and finally use the label with the highest votes as the test image. predict. So when k=1, the k-Nearest Neighbor classifier is the Nearest Neighbor classifier. It can be seen from the intuitive feeling that a higher k value can make the classification effect smoother and make the classifier more resistant to outliers.

—————————————————————————————————————————

<
The above example shows the difference between the Nearest Neighbor classifier and the 5-Nearest Neighbor classifier. The example uses a 2-dimensional point representation, divided into 3 classes (red, blue and green). The different colored regions represent the decision . The white areas are examples of classification ambiguous (i.e. the image is bound to more than two classification labels). It is important to note that in the NN classifier, abnormal data points (eg: green points in the blue area) create an island of incorrect predictions. The 5-NN classifier smoothes out these irregularities, making it generalization better on the test data (not shown in the example) . Note that there are also some gray areas in 5-NN, which are caused by the same highest votes for the nearest neighbor labels (eg: 2 neighbors are red, 2 neighbors are blue, and 1 is green).

—————————————————————————————————————————

In practice, k-NN classifiers are mostly used. But how is the value of k determined? This issue is discussed next.

Image classification notes (top) are over.

Click to view image classification notes (below) .

Translator feedback :

  1. Due to the length of the notes of a single course of CS231n, it will take a long time to read it completely. Friends Zhang Xin and others suggested that it should be properly split. Therefore, in this translation, the image classification notes are divided into upper and lower parts, and the reading volume of each article is controlled at about 10,000 words to reduce the reading cost . What is the effect, please comment from friends;
  2. The formula editor of the Zhihu column can use LaTeX syntax, like it. However , manual spaces are required to center the formula. Is there a more elegant way ? Friends please comment and advise;
  3. For any inadequacies in the translation, please criticize and correct them in the comments, we will discuss and give feedback in time;
  4. This translation is a free act, and the Zhihu column was first published. Reprinting is permitted, please keep the full text and indicate the source.

Supplementary reprint from:
https://zhuanlan.zhihu.com/p/20894041?refer=intelligentunit

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326170259&siteId=291194637