The assignment-1 implementation of CS231n (KNN part, 2022)

The assignment-1 job address is as follows:

https://cs231n.github.io/assignments2022/assignment1/

Encountered when importing training set and test set:

FileNotFoundError: [Errno 2] No such file or directory: 'cs231n/datasets/cifar-10-batches-py\\data_batch_1'

So download the data set to the local, here is the relative address of my data set in the local: (Download data can be seen: Portal )knn.ipynbcs231n/datasets/CIFAR10

The first task is to complete compute_distances_two_loopsthe section:

Compute the distance between each test point in X and each training point in self.X_train using a nested loop over both the training data and the test data.

Just calculate the Euclidean distance. In fact, it is to traverse all the samples in X_test (500), calculate the distance between each sample and X_train (5000), and save the distance in the dists array (dimension of the dists array: 500 , 5000):

dists[i, j] = np.sum((X[i] - self.X_train[j])**2)**0.5

Some numpy documentation can be seen: https://cs231n.github.io/python-numpy-tutorial/#jupyter-and-colab-notebooks

Then draw a heat map (showing the difference in data through color difference and brightness) like this:

insert image description here

Inline Question 1

Notice the structured patterns in the distance matrix, where some rows or columns are visibly brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

What in the data is the cause behind the distinctly bright rows?
What causes the columns?

$\color{blue}{\textit Your Answer:}$

For What in the data is the cause behind the distinctly bright rows? Since this test set is very different from all other training sets, it may be that the characteristics of this test set are weird, or the background color is special
For What causes the columns? It means that the training set data is special, and there are no similar points in all test sets

after the completion predict_labelspart

Here I sort the dists array, find the nearest k points from 5000 points, record it as idx, and then count the number of each category in idx, that is, class0 has several samples in idx, class1 in There are several samples in idx, etc. (here there are 10 classes in total, and the number of idx is k)

The first TODOis:

idx = np.argsort(dists[i])[:k]
closest_y = self.y_train[idx]

Then count the category of samples to be marked, which is the category with the largest number of samples in idx

The second TODOis:

counter = np.zeros(10)

# 其实closest_y的大小就是k
for j in closest_y: 
  counter[j] += 1

# https://blog.csdn.net/t20134297/article/details/105007292/
y_pred[i] = np.argmax(counter)

You can also write:

counter = np.zeros(np.max(self.y_train) + 1)  # counter = np.zeros(10)

'''for j in closest_y: 
  counter[j] += 1'''
  
# https://blog.csdn.net/kingsure001/article/details/108168331
np.add.at(counter, closest_y, 1) # 按照上面的写法也可以

y_pred[i] = np.argmax(counter)

Inline Question 2

We can also use other distance metrics such as L1 distance.For pixel values $p_{ij}^{(k)}$ at location $(i, j)$ of some image $I_k$ ,

the mean $\mu$ across all pixels over all images is $\mu=\frac{1}{nhw}\sum_{k=1}^n\sum_{i=1}^{h}\sum_{j=1}^{w}p_{ij}^{(k)}$
And the pixel-wise mean $\mu_{ij}$ across all images is
$\mu_{ij}=\frac{1}{n}\sum_{k=1}^np_{ij}^{(k)}.$
The general standard deviation $\sigma$ and pixel-wise standard deviation $\sigma_{ij}$ is defined similarly.

Which of the following preprocessing steps will not change the performance of a Nearest Neighbor classifier that uses L1 distance? Select all that apply.

Subtracting the mean $\mu$ ( $\tilde{p}_{ij}^{(k)}=p_{ij}^{(k)}-\mu$ .)
Subtracting the per pixel mean $\mu_{ij}$ ( $\tilde{p}_{ij}^{(k)}=p_{ij}^{(k)}-\mu_{ij}$ .)
Subtracting the mean $\mu$ and dividing by the standard deviation $\sigma$ .
Subtracting the pixel-wise mean $\mu_{ij}$ and dividing by the pixel-wise standard deviation $\sigma_{ij}$ .
Rotating the coordinate axes of the data.

$\color{blue}{\textit Your Answer:}$

1，3

$\color{blue}{\textit Your Explanation:}$

$L 1$ 距离公式： $d_{1}\left( I_{1},I_{2}\right) =\sum _{p}\left| I_{1}^{p}-I_{2}^{p}\right|$

$L 2$ 距离公式： $d_{2}\left( I_{1},I_{2}\right) =\sqrt{\sum _{p}( I_{1}^{p}-I_{2}^{p}) ^{2}}$

Each pixel subtracts the same average value, $The loss values of L 1 and L 2$ are all subtracted from the same constant item, and the relative loss remains unchanged,The calculation methods of $L$ $1$ $and$ $L$ $2$ do not affectthe performance of KNN
Each pixel subtracts a different mean value, $L 1, L 2$ loss values minus constant items are not necessarily the same, $Both L 1 and L 2$ calculation methodsaffectKNN performance
Same distance scaling, $L 1, L 2$ loss values are all divided by the same constant item, the result ranking remains unchanged,The calculation methods of $L$ $1$ $and$ $L$ $2$ do not affectthe performance of KNN
Different distance scaling ratios, $L 1, L 2$ The constant items divided by the loss value are not necessarily the same, $Both L 1 and L 2$ calculation methodsaffectKNN performance
It can be proved by some simple examples, such as point set ${(0,0) , (2,2)\}$ turn counterclockwise $45^{\circ }$ After $^{\circ}$ $\{(0,0) , \left( 0,2\sqrt{2}\right)\}$ , get $L 1$ calculation methodaffectsKNN performance, but $L 2$ calculation methoddoes not affectKNN performance

Common $L1$ Numpy implementation $:$

for i in range(num_test):
  for j in range(num_train):
    a = X_test[i]-X_train[j]
    b = np.fabs(a)
    dists[i][j] = np.sum(b)

Common $L2$ Numpy implementation $:$

for i in range(num_test):
  for j in range(num_train):
    a = X_test[i]-X_train[j]
    b = np.square(a)
    c = np.sum(b)
    dists[i][j] = np.sqrt(c)

After that, the part is completed compute_distances_one_loop, that is, the partial vectorization of the operation is realized:

dists[i] = np.sum((X[i] - self.X_train)**2, axis=1)**0.5

The completion compute_distances_one_looppart is to realize the complete vectorization of the operation. Here, the complete square formula is directly expanded and calculated with the vector inner product (see the dot function: summary of the dot function in python ):

Suppose $X$ is the test set $(m * d)$ , the training set $Y$ is $(n * d)$ , where $m$ is the number of test data, $n$ is the number of training data, $d$ is the dimension, and the formula for calculating the two is as follows:

$\sum ^{m}_{i=0}\sum ^{n}_{j=0}\sqrt{\left( p_{2}-c_{j}\right) ^{2}}=\sum ^{m}_{i=0}\sum ^{n}_{j=0}\sqrt{\left\| p_{i}\right\| ^{2}+\left| \right| c_{j}\left| \right| ^{2}-2* p_{i}c_{j}}$

Here is np.sum(X**2, axis=1, keepdims=True)the result of $(M, 1)$ matrix

Setting keepidms=Trueis to maintain the two-dimensional or three-dimensional characteristics of the matrix, (the result maintains its original dimension)

np.sum(self.X_train**2, axis=1)The result is $(N, 1)$ matrix

X.dot(self.X_train.T)The result of $(M, N)$ matrix ( $\cdot (d,N)->(M,N)$ ）

dists = np.sum(X**2, axis=1, keepdims=True) + np.sum(self.X_train**2, axis=1) - 2*X.dot(self.X_train.T)
dists = dists**0.5

After comparing the time of these three types of loops, it is found that the operation after full vectorization is the fastest:

Two loop version took 60.576823 seconds
One loop version took 80.915230 seconds
No loop version took 0.343678 seconds

hyperparameter optimization

Set up the validation set

Ideas can be used for reference: https://cs231n.github.io/classification/

We split the training set in two: a slightly smaller training set ( training set ), and another part called the validation set ( validation set )

Using CIFAR-10 as an example, for example, we can use 49,000 training images for training and set aside 1,000 for validation, this validation set is essentially used as a fake test set to tune hyperparameters, at the end of the process , we can draw a graph showing which values of k work best, and then we'll keep using this value and evaluate once on the actual test set

Cross-validation

The idea is that by setting up different validation sets and averaging the performance of these validation sets, you can get a better, less noisy estimate of k that works, rather than arbitrarily choosing the first 1000 data points as a validation set and the rest as training set

For example, in 5-fold cross-validation, we split the training data into 5 equal-sized parts, 4 of which are used for training and 1 for validation. Then, we set each of these 5 parts as the validation set, and evaluate performance, and finally average the performance of these 5 validation sets as the working effect of this k value

Divide the training set into 5 parts:

X_train_folds = np.array_split(X_train, num_folds)  # 分成5份
y_train_folds = np.array_split(y_train, num_folds)

Then loop 5 times for training, and put the results of each k into the corresponding k_to_accuracies[k]:

for k in k_choices:
    k_to_accuracies[k] = []
    for i in range(num_folds):
        X_train_cv = np.vstack(X_train_folds[:i] + X_train_folds[i + 1:])
        y_train_cv = np.hstack(y_train_folds[:i] + y_train_folds[i + 1:])
        X_test_cv = X_train_folds[i]
        y_test_cv = y_train_folds[i]

        classifier = KNearestNeighbor()
        classifier.train(X_train_cv, y_train_cv)
        dists = classifier.compute_distances_no_loops(X_test_cv)
        y_test_pred = classifier.predict_labels(dists, k)
        num_correct = np.sum(y_test_pred == y_test_cv)
        accuracy = float(num_correct) / X_test_cv.shape[0]
        k_to_accuracies[k].append(accuracy)

Results for different values of k:

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000

The image drawn:

insert image description here
Choose k = 7, the correct rate of the test set:

Got 137 / 500 correct => accuracy: 0.274000

Choose k = 10, the correct rate of the test set:

Got 141 / 500 correct => accuracy: 0.282000

predict_labels function optimization (acc: 27.4%->29.4%)

But when the official website said that k = 7there is an optimal solution, so I want to optimize predict_labelsthe function

Link: https://cs231n.github.io/classification/#summary-applying-knn-in-practice

insert image description here

Optimization ideas:

The category with the largest k number will not repeat, but there may be more than one. At this time, it is selected in order, that is, the one with the smallest distance is selected as the predicted value, but if its other similar distance is larger, but another category If the distances are small, the other class is more likely to be a positive solution, so the strategy is: if the largest class is repeated, take the mean of this class as the evaluation index

At this point, predict_labelsthe second of TODOthe changes to:

counter = np.zeros(np.max(self.y_train) + 1)  # counter = np.zeros(10)

# 因为后面需要求最小值，所以bin_sum初始赋值为无穷大
bin_sum = np.zeros(np.max(self.y_train) + 1) + np.inf

# 其实closest_y的大小就是k
for j in closest_y[:k]: 
    counter[j] += 1
    
max_num=np.max(counter)

# https://www.jb51.net/article/207257.htm
# np.where()[0] 表示行索引，np.where()[1]表示列索引
class_idx=np.where(counter==max_num)[0] # 取出k数最大的类别，不会重复，但可能有多个

if len(class_idx)>1:
    dis_k=dists[i,idx] # 前k个距离
    for j in class_idx: # 将每个最大类别的欧几里得距离求总和
        idx_j=np.where(closest_y[:k]==j)[0]
        bin_sum[j]=np.sum(dis_k[idx_j])
    # 因为这里需要求最小值，所以bin_sum初始赋值为无穷大
    y_pred[i]=np.argmin(bin_sum)
else:
    y_pred[i]=class_idx[0]

The results of different k values at this time:

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.257000
k = 3, accuracy = 0.263000
k = 3, accuracy = 0.273000
k = 3, accuracy = 0.282000
k = 3, accuracy = 0.270000
k = 5, accuracy = 0.263000
k = 5, accuracy = 0.274000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.297000
k = 5, accuracy = 0.288000
k = 8, accuracy = 0.271000
k = 8, accuracy = 0.298000
k = 8, accuracy = 0.284000
k = 8, accuracy = 0.301000
k = 8, accuracy = 0.291000
k = 10, accuracy = 0.270000
k = 10, accuracy = 0.305000
k = 10, accuracy = 0.288000
k = 10, accuracy = 0.295000
k = 10, accuracy = 0.286000
k = 12, accuracy = 0.268000
k = 12, accuracy = 0.304000
k = 12, accuracy = 0.286000
k = 12, accuracy = 0.290000
k = 12, accuracy = 0.276000
k = 15, accuracy = 0.259000
k = 15, accuracy = 0.307000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.294000
k = 15, accuracy = 0.281000
k = 20, accuracy = 0.267000
k = 20, accuracy = 0.291000
k = 20, accuracy = 0.293000
k = 20, accuracy = 0.290000
k = 20, accuracy = 0.283000
k = 50, accuracy = 0.274000
k = 50, accuracy = 0.289000
k = 50, accuracy = 0.282000
k = 50, accuracy = 0.267000
k = 50, accuracy = 0.274000
k = 100, accuracy = 0.260000
k = 100, accuracy = 0.271000
k = 100, accuracy = 0.266000
k = 100, accuracy = 0.260000
k = 100, accuracy = 0.266000
# plot the raw observations

The image drawn:

insert image description here
Choose k = 7, the correct rate of the test set:

Got 148 / 500 correct => accuracy: 0.296000

Choose k = 10, the correct rate of the test set:

Got 147 / 500 correct => accuracy: 0.294000

It can be seen that the result has improved

Inline Question 3

Which of the following statements about $k$ -Nearest Neighbor ( $k$ -NN) are true in a classification setting, and for all $k$ ? Select all that apply.

The decision boundary of the k-NN classifier is linear.
The training error of a 1-NN will always be lower than or equal to that of 5-NN.
The test error of a 1-NN will always be lower than that of a 5-NN.
The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
None of the above.

$\color{blue}{\textit Your Answer:}$

2，4

$\color{blue}{\textit Your Explanation:}$

Because there is no linear relationship between input and output, KNN is not a linear classifier. The interface of KNN consists of many small linear spaces, and the interface is locally linear.
When k=1, it means that only the nearest point is used as the basis for judgment, so there is no error in training. When k=5, according to the different rules of vote (classification decision), there will be different training errors
The smaller k, if there is noise in some data and overfitting, the generalization ability will be poor, so k=1 is not necessarily better than k=5
In testing, KNN needs to perform calculations on the entire dataset and sort the points by distance, so the time required increases with the size of the data

Reference information:

k-Nearest Neighbor (kNN) exercise

CS231N homework detailed zero-based version

Solutions to Stanford’s CS 231n Assignments 1 Inline Problems: KNN