卷积神经网络 + 机器视觉： L3_Loss Functions and Optimization (斯坦福课堂）

开始利用 Linear Classifier 的时候需要做的几件事：

定义一个 Loss Function 用来量化一个训练集中“不满意的得分”的水平，简单说就是好与不好（0与1）该以哪里作为临界点。
需要一个方法用来找出 Loss Function 里面的参数，能够让整体 Loss Function 得出来的值达到最小化。

-- 第一步，把一张图片的全部 pixels 拉直成为一个只有一行的矩阵，并且我们命名为数据集 Xi
-- 第二步，接着用权重矩阵与被拉长成一条的矩阵做 dot product得出每个 pixel 的重要性评比
-- 第三步，加上一个修正量目的是为了更好的让分类器能够分辨哪些零散的数据是属于哪一个区域的，提升划线区分两种不同类别的东西的操作性。

但是这以运算过程中包含了 dot 与 add 两个计算，不方便我们在程序代码中去操作，因此我们会在被拉长的那一条矩阵最下面一列加一个“1”，并把修正项的矩阵融合到权重的矩阵中，这么一来就可以直接用一个 dot 把所有事情都做了，操作上简单。

Loss Function 的重要性

这个 Function 是一个我们主要用来评判线性分选器中经过权重 W 的评分后预测出来的值有多差的一个手段。下面介绍的第一个方法称为 Multiclass Support Vector Machine Loss，又简称 SVM Loss，定义为整个 dataset 加起来的损失的平均，公式如下：

其中参数里面，
x 是一张图片被拉直后做成的原始数据输入于此，是预测位置数据的依据；
y 是被线性分选器（linear classifier）预测出来的预测值。
这边的目标就是找到一个最合适的 W 权重值，使得整个 loss function 可以达到最小值。

一般使用的公式模型如下：

顺带附上 wiki 的介绍网址： https://en.wikipedia.org/wiki/Hinge_loss

Hinge Loss

意思可以想象成一个合格指标，只有当分类器在各个种类的得分过程中，正确的一项高过其他不正确的项至少一个 hinge loss 大小的时候，我们才把这个分类器得出来的判断结果归纳为正确。如图上的 hinge loss 就是 1 。下面一个举例计算 Loss Function 可以更直观理解：

	猫的图片	汽车的图片	青蛙的图片
猫的得分	3.2	1.3	2.2
汽车的得分	5.1	4.9	2.5
青蛙的得分	-1.7	2.0	-3.1

可以看到这个分类器 classifier 是有问题的，猫的图片对应到的跑分竟然是汽车比较高，青蛙则错得更离谱了，只有汽车是好的。如果把公式套用到这个例子里面，算法如下：
    1. Loss Function for cat: max(0, 5.1-3.2+1) + max(0, -1.7-3.2+1) = 2.9
    2. Loss Function for car: max(0, 1.3-4.9+1) + max(0, 2.0-4.9+1) = 0
    3. Loss Function for frog: max(0, 2.2-(-3.1)+1) + max(0, 2.5-(-3.1)+1) = 12.9
     4. Entire Loss: (2.9 + 0 + 12.9)/3 = 5.27
** 补充： max() 的功能是比较 () 里面的数字，哪一个最大就选择哪一个作为答案。

上述例子可以很明显看出合格指标 Hinge Loss 起到的作用，如果它越大，则对 classifier 要求的正确与不正确之间的差距也要越大才能够到达及格水平，让最后在 max() 里面得出的项是比 0 小的，达到没有 Loss 的境地。

This triangle delta is some fixed margin indicating that the score of the correct labeled object should be higher at least THIS amount of value than the rests of the error classes so that there would be no loss. If we further stretch out this formula to be an illustrative line, it goes like this:

进一步我们如果用 python 代码去表示这个方程式的算法，可以写成：

>>> import numpy as np
>>> def Li_vectorized(x, y, W):
        scores = W.dot(x)
        margins = np.maximum(0, scores - scores[y] + delta)
        margins[y] = 0
        loss_i = np.sum(margins)
        retrun loss_i

从公式可以得出，Loss Function 最大值就是 0 ，而最小值可以是 -∞。

Special case 1:
如果遇到一开始的 W 很小，导致得出来的权重分数接近的话，那 Loss 的值就是总类数 -1（在 delta = 1 的情况下）。可以视为很小的数 + delta = delta，并且自己的 Syi 项不会用来自己剪去自己，故答案 = 总类数 - 1。

Special case 2:

如果把 max() 做了平方，得出来的结果就会是个不同权重的 Loss 值，一次方与二次方是个不同的方法，需要依照自己所面临的情况去调整。

但是有一点很重要的观念需要切记，即便 Loss 的值达到了最优 0，也不意味着这就是最好的情况，因为那是根据现有的数据做分类，我们创造这个 classifier 目的是为了检测未知的数据，看它在未知的领域的表现，现有的数据只是我们的一个预测未知的依据，很多时候达到最优 0 反而意味着过拟合，并不是一件好事。

Regularization

前面说到了过拟合，这就是用来处理普遍会在机器学习出现的过拟合问题的，它的用意就是让 W 在选择的时候，能够根据实际情况挑选一个比较简单的参数去套用。因此一个完整的 Loss Function 应该包含两个项，并在 Regularization 项前面乘上 λ (a hyperparameter)。

Although linear classifier looks good when we gain the perfect Weight value, this W is still not the unique answer. We can simply multiply its value originally found as the perfect one to fit the dataset to gain the other one. Therefore, this Regularization term helps to set a preference for a certain set of weights W over others to remove this ambiguity.

Re 有很多方法，这边就只介绍 L2 Regularization (Weight Decay)。公式如下：

整个方程式只基于 W 去给参数。加上原有的项之后，整个完整的 MSVM Loss Function 成了这个样子：

我们可以借由惩罚大的权重来提升整体的 generalization 效果，小的 W 权重可以有相对更小的 Re 项，使得整个结果更难过拟合。

The other classifier: Softmax classifier
This is similar to the output treatment by SVM. But Softmax's output is a bit more intuitive and has a probabilistic interpretation inside the formula. The loss function looks like this:

我们可以拿一个随意实数值的向量，把它塞进这个 Softmax function 的流程中做运算，最后得出一个按照原本数字大小比例分布的，所有新的数值都介于 0～1 的新向量。

我们把算出来的所有权重分数作为 exponential 的指数方差项，让得出来的结果不论如何都是“正整数”，接着对加总过后的总和标准化（normalize）得出一个像是几率区间分布的结果。我们希望通过这个方法激励例如在猫的那个评分里有权重占比的数值，使之更加凸显其特征，最后做到拉开其他不正确分类积分的差距。

相对上一个方法而言，Softmax 前面要加一个 ”-“ 原因是 Loss Function 是用来评比预测值有多 “差” 而不是多好，下面的例子就可以理解词话的含义：

计算方法如下：
    1. cat：exp(3.2) = 24.5 ; 24.5/(24.5+164.0+0.18) = 0.13 ; -log(0.13) = 0.89
    2. car: exp(5.1) = 164.0 ; 164.0/(24.5+164.0+0.18) = 0.87 ; -log(0.87) = 0.06
    3. frog: exp(-1.7) = 0.18 ; 0.18/(24.5+164.0+0.18) = 0 ; ...

这个方法的极值加上 ”-“ 后，也是 maximum = 0；minimum = -∞。
具体更多细节补充可以查询下面网址：
https://blog.csdn.net/u010976453/article/details/78488279

Comparison of these two methods above

o SVM Loss: 更趋近于让原本数据之间的差异保留原汁原味，甚至我们可以通过一些手段把原本就差很大的间距提升到更大。换言之，这个方法只能够基于数据本身的差异上 ”加油添醋“ ，跟现实世界的贫富差距有异曲同工之妙。

o Softmax Loss：得出来的结果有是有一个区间限制的，只能在 0～1 之间，不论原本彼此差别多么的大，最终都要把差距按照公式定义的比例压缩到这个区间里，因此这个方法强调的是如何让有差距的数据往 0 或者 1 的两极方向跑，借由让数据集分化去 0 或者 1 的一端，达到最终目的。

Optimization

这里开始介绍在现实情况中，我们使用何种方法去找到 Loss Function 中 W 权重的最佳值。

Strategy #1. Random search, a very bad idea though... here are the lines of code:

>>> bestloss = float('inf')        # python assigns the highest possible float value
>>> for num in xrange(1000):
        W = np.random.randn(10, 3073) * 0.0001        # to generate random parameters
        loss = L(X_train, Y_train, W)        # get the loss over the entire training set
        if loss < bestloss:
            bestloss = loss
            bestW = W
        print('in attempt $d the loss was %f, best %f' %(num, loss, bestloss))

一般而言，这个方法是在最一开始就被舍弃的，无头苍蝇，纯靠运气，无用。

Strategy #2. Follow the slope

对该 Loss Function 做微分，找出对应的切线方程式，梯度 Gradient 就是该方程对应的向量。代码如下：

>>> while True:        # in a real graph, we use for loop to run the model typically
        weights_grad = evaluate_gradient(loss_fun, data, weights)
        weights += -step_size * weights_grad        # perform parameter update
# step_size is also a hyperparameter used to define how fast the loop will get closed to the target.

以前微积分课教的 lim 那套虽然说原理和算法上都是对的，但是对于计算机要处理那么多比资料的情况下，外加可能需要被微分的数学公式极为复杂，这个方式是非常慢且不合实际情况的，（不过它是一个很好用来测试最终结果是否是最小值的方法，这也是上帝帮他开起的另一扇窗吧），因此我们一般会使用 Backpropagation，这个方法细节后面笔记介绍。

我们需要把一个方程式微分求梯度，但同时面对效率提升同时要维持功能的压力下，需要分门击破真个数据集的运算，不能同时挤在一起，因此解法是：Stochastic Gradient Descent (SGD)。

In every iteration, we sample some small sets of training samples called minibatch to compute and estimate the full sum and the true gradient. The code goes like this:

>>> while True:
        data_batch = sample_training_data(data, 256)
        # sample 256 exampels
        weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
        weights += -step_size * weights_grad        # perfrom parameter update

The other earlier example approaches to recognize the image:
- Histogram of Oriented Gradients (HoG): Divide image into 8*8 pixel regions and quantize edge of the regions direction into 9 bins.
- Color Histogram
- Bag of words, the inspiration from nature language processing. the logic here is to count the difference of the characteristics in different images as a paragraph. but there are not vocabulary to describe the image understood by a computer.

Image features vs ConvNets
Get an image >>> feature extraction >>> 10 numbers giving scores for classes >>> training >>> convolutional network application >>> output.

The extracted features from the images would be the fixed block that remains unchanged during the training. The only thing changed is the parameters applied in the linear classifier set to fit the minimum loss.

When we talk about CNN, it is quite similar to the earlier approach. But CNN learn the features directly from the data instead of making visual words first.

下节链接：卷积神经网络 + 机器视觉： L4_反向传播_神经网络内部构造（斯坦福课堂）

卷积神经网络 + 机器视觉： L3_Loss Functions and Optimization (斯坦福课堂）

完整的视频课堂链接如下：

完整的视频课堂投影片连接：

前一課堂筆記連結：

猜你喜欢