cs231n-assignment1-knn

之前准备把所有的问题都写在一个博客里面，我发现实在太长了，所以就把作业一的问题都分开了！

part0：环境的安装和配置

part1：KNN，

part2：SVM

part3：softmax

part4：two_layer_net

part5:feature

part1：KNN

现在就part1进行解释与说明：

一、建议写之前先阅读下面的材料：http://cs231n.github.io/classification/

二、KNN的解释说明

KNN是一个简单的分类器，它包含下面的三个步骤：

step1：读取我们的训练数据，也就是简单的存储；

step2：测试，对于每一张测试的图片，我们将他和所有的训练集的每一个样本计算L2距离，找出离得最近的k张，然后我们将这k张里面出现次数最多的作为这个测试样本的标签；

step3：我们会通过交叉验证集来选择最佳的k值。

三、对于代码的解释和说明

knn.ipynb:

In[1]:

#预备代码

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline#这句话会让matplotlib的画出现在notebook的页面上，而不是新建一个窗口
plt.rcParams['figure.figsize'] = (10.0, 8.0) # 设置默认的绘图窗口的大小
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# 自动的重载Python的外部的模块
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In[2]:

# 加载数据集
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# 打印出数据集的大小
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

输出：

Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)

In[3]:

np.flatnonzero(a):返回的是flatten version of a中非零元素的下标；选出y与y_train里面元素相同的下标

np.randomchoice(a,size,replace=True,p=None):随机的选择属于y类别的七个图片进行显示

a:可以是int，那么就是在np.arange(a)里面选择出size个元素，要么是一个数组，从数组中选择出size个元素

size:选择出元素的个数

replace:如果为True，那么每次选出的元素再放回去，可以选出重复的元素；但是如果为False，不重复

p:以概率p选出元素

#可视化数据集，我们对每一个类别选择出其中的七个进行数据的展示
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

结果显示：

In[4]:选出前5000个训练集样本和前500个测试集样本

# 为了跟高效的执行相应的代码，我们选择数据集的子集来进行训练和测试，这里选择的训练数是5000，测试数是500
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

In[5]:

# 将数据集的每一个样本原本是32*32*3的图片，拉成一个1*3072维度的向量
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

结果：

(5000, 3072) (500, 3072)

这里的5000,500就是之前说的训练集和测试集的大小

In[6]:

from cs231n.classifiers import KNearestNeighbor

#创建一个KNN的实例，但是我们的训练过程仅仅单独的进行数据的存储，而不会进行计算
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

In[7]:K_nearest_neighbor.py文档里面的补全代码在之后会介绍！！！

# 在 cs231n/classifiers/k_nearest_neighbor.py 这个文件里，需要我们补全代码
# 实现L2距离的计算，这里我们使用两层循环

# 我们传入的参数是我们的测试集，但是需要注意的是，我们的测试集大小是Nte，训练集大小是Ntr，那我们输出的距离矩阵应该大小是Nte×Ntr。
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

结果：输出的是dist的大小，很显然是这样的

(500, 5000)

In[8]:

# 我们显示距离矩阵，注意我们这里的每一行都是测试样本和距离训练样本的距离
plt.imshow(dists, interpolation='none')
plt.show()

结果显示：

问题一：上图的某些行和列很明显颜色变浅，深色代表距离小，浅色代表距离大。请问为什么某些行或者列的颜色明显变浅？

回答一：某些行颜色变浅，表示测试样本与所有训练样本差异较大，该测试样本可能明显过亮或过暗或者有色差。
某些列颜色变浅，所有测试样本与该列表示的训练样本差异都较大，该训练样本可能明显过亮或过暗或者有色差。

In[9]:classifier.predict_labels传入的参数是我们之前用两层循环计算出来的距离dists，下面进行验证自己的代码是否正确：

# 使用K=1的KNN，可以看作是NN
y_test_pred = classifier.predict_labels(dists, k=1)

# 输出准确预测的样本数和准确率
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果：我们希望得到的是27%的准确率

Got 137 / 500 correct => accuracy: 0.274000

下面我们尝试的K=5的KNN，结果希望得到的稍微比27%多一点的准确率

In[10]:

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果：

Got 139 / 500 correct => accuracy: 0.278000

In[11]:

np.linalg.norm()详见我的博客：http://blog.csdn.net/m0_37393514/article/details/79529168

# 你需要补全compute_distances_one_loop.py的代码，这里面只有一层循环更加的高效！
dists_one = classifier.compute_distances_one_loop(X_test)
#为了保证向量化的代码运行正确，我们将运行结果与前面的方法的结果进行对比。对比两个矩阵是否相等的方法有很多，
# 比较简单的一种是使用Frobenius范数。
# Frobenius范数表示的是两个矩阵所有元素的差值的均方根。或者说是将两个矩阵reshape成向量后，它们之间的欧氏距离．
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

结果：表示你的两个计算dists的函数得到的结果都是一样的！

Difference was: 0.000000
Good! The distance matrices are the same

In[12]:

# 完成compute_distances_no_loops函数的代码，实现无循环的计算dists
dists_two = classifier.compute_distances_no_loops(X_test)

# 还是比较与之前计算的dists的差，看看结果正确否
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

结果：

Difference was: 0.000000
Good! The distance matrices are the same

In[13]:

# 查看对比几个计算dists的函数所花销的时间
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# 毋庸置疑，肯定是完全的向量化的操作最快

结果：

Two loop version took 33.193914 seconds
One loop version took 62.247426 seconds
No loop version took 0.503530 seconds

下面我们需要通过交叉验证集来选择最佳的K:

In[14]:

np.vstack和np.hstack详见我的博客：http://blog.csdn.net/m0_37393514/article/details/79538748

np.array_split()详见我的博客：http://blog.csdn.net/m0_37393514/article/details/79537639

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
#####################################################################################
# 将数据分成num_folds份，存入X_train_folds和y_train_folds，y_label_folds[i]是一个标签向量，
# 表示对应于X_label_folds[i]中的那些点标签！ 提示: 可以尝试使用numpy的array_split方法
#####################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################
# 一个字典k_to_accurracies表示我们在交叉验证集上运行k的knn的时候的准确率的大小
# k_to_accuracies[k]表示，当我们选择k时，我们对于每一个num_folds个交叉验证集的某个folds
# 为测试样本的时算出来的准确率，也就是说是一个大小为num_folds的lists
k_to_accuracies = {}

################################################################################
# 对于每一个可能的k值，运行num_folds次knn算法，每一次你使用其中的num_folds-1个作为训练集，
# 剩下的那个作为验证集，将结果保存在K_to_accuracies中                                #
################################################################################
for k in k_choices:
    k_to_accuracies[k] = []
    for i in range(num_folds):
        X_train_cv = np.vstack(X_train_folds[:i] + X_train_folds[i + 1:])
        X_test_cv = X_train_folds[i]

        y_train_cv = np.hstack(y_train_folds[:i] + y_train_folds[i + 1:])
        y_test_cv = y_train_folds[i]
        classifier.train(X_train_cv, y_train_cv)
        dists_cv = classifier.compute_distances_no_loops(X_test_cv)
        y_pred = classifier.predict_labels(dists_cv, k)
        num_correct = np.sum(y_pred == y_test_cv)
        accuracy = float(num_correct) / y_test_cv.shape[0]
        k_to_accuracies[k].append(accuracy)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# 打印计算出的准确率
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

结果：

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000

绘制关于k和交叉验证集的准确率的图表：

In[15]:

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

结果：

可以得到最佳的k=10

In[16]:

#基于你找的最佳的k，你会得到超过28%的准确率
best_k = 10

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# 计算和展示准确率
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

结果：

Got 141 / 500 correct => accuracy: 0.282000

下面我们来仔细的介绍之前前面没有介绍的计算距离的三个函数：

k_nearest_neighbor.py:

import numpy as np
from past.builtins import xrange


class KNearestNeighbor(object):
    """ 使用L2来进行KNN算法 """

    def __init__(self):
        pass

    def train(self, X, y):
        """
        在训练数据的时候，knn仅仅完成记忆的功能，而不会实现进行其他的一些处理
        输入：
        -X：一个大小为(num_train,D)的数组作为训练数据集，包含num_train个样本，每个样本都是D维度
        -y：数组的大小是(N,)包含的训练样本的标签，y[i]是X[i]的标签
        """
        self.X_train = X
        self.y_train = y

    def predict(self, X, k=1, num_loops=0):
        """
        使用knn分类器进行对测试集的预测

        输入：
        - X:是一个大小为（num_test，D)的数组，包含num_test个测试样本，每一个测试样本的维度是D
        - K:KNN选择的K
        - num_loops: 决定计算dists是采用几层的循环

        返回：
        - y: 一个数组大小为(num_test,)，是为了测试集输出的标签的大小，y[i]表示X[i]的预测的类别的结果
        """
        if num_loops == 0:
            dists = self.compute_distances_no_loops(X)
        elif num_loops == 1:
            dists = self.compute_distances_one_loop(X)
        elif num_loops == 2:
            dists = self.compute_distances_two_loops(X)
        else:
            raise ValueError('Invalid value %d for num_loops' % num_loops)

        return self.predict_labels(dists, k=k)

    def compute_distances_two_loops(self, X):
        """
        计算X（测试集）的每一个样本和self.X_train的每一个样本的之间的距离，使用两层循环

        输入：
        - X: 一个大小为(num_test,D)的测试集

        返回：
        - dists: 一个大小为(num_test,num_train)的距离矩阵，dists[i][j]表示第i个测试样本和第j个训练样本之间的欧氏距离
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in xrange(num_test):
            for j in xrange(num_train):
                #####################################################################
                # TODO:                                                             #
                # Compute the l2 distance between the ith test point and the jth    #
                # training point, and store the result in dists[i, j]. You should   #
                # not use a loop over dimension.                                    #
                #####################################################################
                dists[i][j] = np.linalg.norm(X[i] - self.X_train[j])
                #####################################################################
                #                       END OF YOUR CODE                            #
                #####################################################################
        return dists

    def compute_distances_one_loop(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a single loop over the test data.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in xrange(num_test):
            #######################################################################
            # 计算第i个测试样本和所有的训练样本的之间的欧氏距离存放在dists[i,:]
            #######################################################################
            dists[i, :] = np.linalg.norm(X[i, :] - self.X_train[:], axis=1)
            #######################################################################
            #                         END OF YOUR CODE                            #
            #######################################################################
        return dists

    def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        #########################################################################
        # 计算我们的欧氏距离不使用循环                                               #
        # 不允许使用scipy中的函数，只允许使用基础的数组操作.                            #
        # 提示：可以使用矩阵乘法和两次广播求和
        #########################################################################
        dists += np.sum(np.multiply(X, X), axis=1, keepdims=True).reshape(num_test, 1)
        dists += np.sum(np.multiply(self.X_train, self.X_train), axis=1, keepdims=True).reshape(1, num_train)
        dists += -2 * np.dot(X, self.X_train.T)
        dists = np.sqrt(dists)
        #########################################################################
        #                         END OF YOUR CODE                              #
        #########################################################################
        return dists

    def predict_labels(self, dists, k=1):
        """
        给定测试样本和训练样本之间的距离，求解每一个测试样本的标签

        输入：dists，大小为(num_test,num_train)

        返回：
        - y,大小为(num_test,)，y[i]表示预测的X[i]的类别
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # 用来存储距离当前第i个测试样本前k个最近的样本的标签
            closest_y = []
            #########################################################################
            # 使用距离矩阵dists来找到距离第i个测试样本最近的k个样本，使用self.y_train来取出这些
            # 标签存放到closest_y中                                                   #
            # 暗示：使用numpy.argsort                                                #
            #########################################################################
            closest_y = self.y_train[np.argsort(dists[i])[:k]]
            #########################################################################
            # 现在你肯定实现了得到距离最近的k个训练样本的标签，你需要找到那个出现次数最多的那个标签#
            # 存入y_pred[i]中                                                        #
            #########################################################################
            y_pred[i] = np.argmax(np.bincount(closest_y))
            #########################################################################
            #                           END OF YOUR CODE                            #
            #########################################################################

        return y_pred

两层循环很简单，我们不展开介绍了；就是使用np.linalg.norm()这个函数，
我的博客里面有介绍： http://blog.csdn.net/m0_37393514/article/details/79529168

一层循环是对于每一个测试样本与全部的训练样本整体进行计算，还是使用np.linalg,norm()函数，但是这里使用的是广播机制。
我的每一个测试样本X[i,：]大小是（1，D）,而我们的训练样本的大小X_train->(num_train,D)的大小。我们是在D的方向上求得norm。
dist[i,:]->(1,num_train)

无循环的操作：其实不会写是抄的，看的网上大佬的解释，网页链接： http://blog.csdn.net/geekmanong/article/details/51524402
我整理了一下大神的说法，见下图：（记得补图）

所以对于我们的测试集X->(num_test,D),训练集self.X_train->(num_train,D)，我们使用按位乘multiply，将数组对应的位置相乘，输出的
数组的大小与原数组相同。但是np.multiply(X,X)得到的大小是(num_test,D)，np.multiply(self.X_train,self.X_train)的大小是
(num_train,D)，而我们的-2×np.dot(X,self.X_train.T)是矩阵乘，对应的大小是(num_test,num_train)。根据我们上面的那个图可以
知道，我们一个测试样本对应一个训练样本的情况来说，X按位乘X我们需要沿着axis=1的方向上求和，这样得到的是一个（num_test,)的向量，
需要reshape为(num_test,1)这样的向量，同时那对于self.X_train按位乘self.X_train也是沿着axis方向上求和，得到的也是(num_train,)
的向量，需要reshape成(1,num_train)。在这样进行相加进行广播得到一个维度为(num_test,num_train)大小的dists。

广播机制详见某个大神博客： http://blog.csdn.net/weixin_39449570/article/details/78696991

猜你喜欢