这是斯坦福课程的作业，根据大纲里面assignment1内的提示，下载好实验所需要的数据集。

前言：

1、knn的预处理步骤需要对数据进行Normalize，对于图像而言，可以理解为一个像素为一个feature。因为像素的分布是homogeneous,且不存在widely different distributions，因此这儿不需要data mormalization。

2、实际运用knn时需要对数据的预处理第二步是降维处理，因为knn是通过定义不同的distance metric来vote的，而distance一旦处于高维就是反直觉的，并且在数学上也是能解释为什么效果不好，具体参见文章section6。这儿的实验只是为了学习knn，实际中从来不会用knn进行图片分类，原因见3

3、knn的优点在于简单、直观。但缺点一是在训练上花费很少的时间（具体见下，训练只是缓存了训练集），但在预测上花费很多时间（有扩展的算法可以减少预测时间，以正确率作为代价，例如Approximate Nearest Neighbor (ANN) ，具体可参考FLANN），实际的需要与此相反。二是 knn在高维的数据上效果并不好。

4、可以先从一个很简单的例子开始：knn简单例子。

代码主体：

part1 数据的导入以及预处理

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

"""
这一步只是进行配置
"""

# Run some setup code for this notebook.

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
   del X_train, y_train
   del X_test, y_test
   print('Clear previously loaded data.')
except:
   pass

# cs231n.data_utils 中的load_CIFAR10 实现在后面
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
# print('Training data shape: ', X_train.shape) >> Training data shape:  (50000, 32, 32, 3)
# print('Training labels shape: ', y_train.shape) >> Training labels shape:  (50000,)
# print('Test data shape: ', X_test.shape) >> Test data shape:  (10000, 32, 32, 3)
# print('Test labels shape: ', y_test.shape) >> Test labels shape:  (10000,)


# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = range(num_training)
# 下面两种方式都可以实现筛选前5000行，注意，若数据是高维的，也可以实现
# 例如若数据是（10000，32，32，3），那么下面的代码运行后，维数变为（5000，32，32，3）
#X_train = X_train[mask]
X_train = X_train[np.arange(num_training)]
y_train = y_train[mask]
#print(X_train.shape) >> (5000, 32, 32, 3)
num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]


# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
# print(X_train.shape, X_test.shape) >> (5000, 3072) (500, 3072)

part2 knn训练以及预测

from cs231n.classifiers import KNearestNeighbor


# the Classifier simply remembers the data and does no further processing 
# 具体实现见后面
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
#plt.imshow(dists, interpolation='none')
#plt.show()

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

"""
接下来是用不同的方式计算dist，并进行比较，可以不看
"""

# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
#矩阵A的Frobenius范数（ord='fro')定义为矩阵A各项元素的绝对值平方的总和
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')



# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# you should see significantly faster performance with the fully vectorized implementation

具体实现

对于评分矩阵而言，理论上是二重循环，但可以利用python函数的特性，将二重循环减少为一重，甚至是直接一次性完成。

3.1 数据导入

from __future__ import print_function

from six.moves import cPickle as pickle
import numpy as np
import os
from scipy.misc import imread
import platform

def load_pickle(f):
    version = platform.python_version_tuple()
    if version[0] == '2':
        return  pickle.load(f)
    elif version[0] == '3':
        return  pickle.load(f, encoding='latin1')
    raise ValueError("invalid python version: {}".format(version))

def load_CIFAR_batch(filename):
  """ load single batch of cifar """
  with open(filename, 'rb') as f:
    datadict = load_pickle(f)
    X = datadict['data']
    Y = datadict['labels']
    X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
    Y = np.array(Y)
    return X, Y

def load_CIFAR10(ROOT):
  """ load all of cifar """
  xs = []
  ys = []
  for b in range(1,6):
    f = os.path.join(ROOT, 'data_batch_%d' % (b, ))
    X, Y = load_CIFAR_batch(f)
    xs.append(X)
    ys.append(Y)    
  Xtr = np.concatenate(xs)
  Ytr = np.concatenate(ys)
  del X, Y
  Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
  return Xtr, Ytr, Xte, Yte


def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000,
                     subtract_mean=True):
    """
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
    it for classifiers. These are the same steps as we used for the SVM, but
    condensed to a single function.
    """
    # Load the raw CIFAR-10 data
    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
        
    # Subsample the data
    mask = list(range(num_training, num_training + num_validation))
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = list(range(num_training))
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = list(range(num_test))
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean image
    if subtract_mean:
      mean_image = np.mean(X_train, axis=0)
      X_train -= mean_image
      X_val -= mean_image
      X_test -= mean_image
    
    # Transpose so that channels come first
    X_train = X_train.transpose(0, 3, 1, 2).copy()
    X_val = X_val.transpose(0, 3, 1, 2).copy()
    X_test = X_test.transpose(0, 3, 1, 2).copy()

    # Package data into a dictionary
    return {
      'X_train': X_train, 'y_train': y_train,
      'X_val': X_val, 'y_val': y_val,
      'X_test': X_test, 'y_test': y_test,
    }

3.2 计算dist（三种实现，由易到难，原理相同）

二重循环（最简单直观的实现）

"""test for two loops
x=np.array([[1,2],[3,1]])
y=np.array([2,2]).reshape(1,2)
a=np.zeros((y.shape[0],x.shape[0]))
for i in range(y.shape[0]):
    for j in range(x.shape[0]):
        a[i,j]=np.sum((y[i,:]-x[j,:])**2)
"""        

def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      for j in range(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        
        # L2 distance.
        dists[i,j] = np.sqrt(np.sum((X[i,:]-self.X_train[j,:])**2))
        
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
        
    return dists

一重循环

# test for one loop
#x=np.array([[1,2],[3,1]]) # 训练集
#y=np.array([[2,2],[3,3]]) # 测试集
#a=np.zeros((y.shape[0],x.shape[0]))
"""
# y的第0行与x矩阵中的每一行进行操作
y[0,:]-x
Out[19]: 
array([[ 1,  0],
       [-1,  1]])
# 表示按行进行求和操作，不写axis会变成整个矩阵求和
np.sum((y[0,:]-x),axis=1)
Out[21]: array([1, 0])
"""
#for i in range(y.shape[0]):
#        a[i,:]=np.sum((y[i,:]-x)**2,axis=1)


  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      # L2 distance.
      dists[i,:] = np.sqrt(np.sum((X[i,:] - self.X_train)**2, axis = 1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

没有循环（利用了（x-y)^2=x^2+y^2-2xy的数学性质，注意这儿是矩阵，得做一些改动）

  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    """
    np.sum(y**2,axis=1)
    Out[23]: array([ 8, 18])
    
    """
    # L2 distance vectorized.
    X_squared = np.sum(X**2,axis=1)
    Y_squared = np.sum(self.X_train**2,axis=1)
    XY = np.dot(X, self.X_train.T)

    # Expand L2 distance formula to get L2(X,Y) = sqrt((X-Y)^2) = sqrt(X^2 + Y^2 -2XY)
    """
    X_squared[:,np.newaxis]增加了一维，例如原本为(10,)现在变为(10,1),可以在直觉上理解为转置操作
    但因为python对一维向量转置还是它自己（因为实际上一维就是数组，数组的转置就是它本身），所以不能直接
    使用T的操作
    """
    dists = np.sqrt(X_squared[:,np.newaxis] + Y_squared -2*XY)

    # Also useful https://medium.com/dataholiks-distillery/l2-distance-matrix-vectorization-trick-26aa3247ac6c
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

3.3 预测

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      # Select a test row.
      test_row = dists[i,:]
      
      # np.argsort returns indices of sorted input.
      # 返回输入从小到大的排列的索引号
      sorted_row = np.argsort(test_row)

      # Get the k closest indices.
      closest_y = self.y_train[sorted_row[0:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      # Find the most occuring index in our closest k.
      """
      np.bincount([2,2,2,2,1,1,1])
      Out[54]: array([0, 3, 4]) 即0出现0次，1出现3次，2出现4次
      """
      y_pred[i] = np.argmax(np.bincount(closest_y))
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred


"""
或者写成汇总的形式
"""
  def predict(self, X, k=1, num_loops=0):
    """
    Predict labels for test data using this classifier.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data consisting
         of num_test samples each of dimension D.
    - k: The number of nearest neighbors that vote for the predicted labels.
    - num_loops: Determines which implementation to use to compute distances
      between training points and testing points.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    if num_loops == 0:
      dists = self.compute_distances_no_loops(X)
    elif num_loops == 1:
      dists = self.compute_distances_one_loop(X)
    elif num_loops == 2:
      dists = self.compute_distances_two_loops(X)
    else:
      raise ValueError('Invalid value %d for num_loops' % num_loops)

    return self.predict_labels(dists, k=k)

通过Cross-validation选取hyperparameters

为什么需要validation set呢？因为我们有预设参数，为了评判参数的好坏，就用validation set（validation set是train set的一部分）。一定要注意的是，test set 只能当作别人拥有的集合，拿来测试自己模型的好坏，自己不能用（或者说自己只能在模型发布前最后使用，用test set测试出来的模型正确率即为发布时宣布的模型正确率）

而corss-validation的意思就是，首先可以把train set分成几份，这儿假设分成了5份，那么首先取第一份作为validation set，其余为train set，评判当前选择的k值好坏，接着选取第二份作为validation set，其余为train set，类似的操作进行5次。

那么，最后关于每个k值，我们都可以得到5个关于它好坏的分数。

可视化结果如下，其中垂直的线代表每一个k值的5个结果的方差，而整个图的趋势线就是平均值的连线。

一般而言，实际中并不会使用cross-validation，因为计算上消耗太大，但是如果我们的训练集小的时候，可以使用这种方法预设参数。

In particular, we cannot use the test set for the purpose of tweaking hyperparameters.

Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a validation set.

Cross-validation. In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called cross-validation.

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################

X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################

for k in k_choices:
    for n in range(num_folds):
        # Concat all our folds together except for the nth fold for training.
        current_train_fold_x = np.concatenate(tuple([X_train_folds[i] for i in range(num_folds) if i!=n]))
        current_train_fold_y = np.concatenate(tuple([y_train_folds[i] for i in range(num_folds) if i!=n]))
        
        # Select the held out fold to be our test data.
        current_test_fold_x = X_train_folds[n]
        current_test_fold_y = y_train_folds[n]
        
        classifier.train(current_train_fold_x, current_train_fold_y)
        
        # Perform prediction on our test set, default is to use no loop version.
        y_test_pred = classifier.predict(current_test_fold_x, k=k)
        
        # Evaluate and store in k_to_accuracies dict.
        num_correct = np.sum(y_test_pred == current_test_fold_y)
        if k not in k_to_accuracies:
            k_to_accuracies[k] = [float(num_correct) / current_test_fold_x.shape[0]]
        else:
            k_to_accuracies[k].append(float(num_correct) / current_test_fold_x.shape[0])

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

“”“
# output
k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000
”“”

可以将结果可视化处理

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

根据图标选择最优的k，并预测模型最优的正确率

# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 10

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

自己实现knn，进行图片分类（python）

前言：