Wu Enda Improves Deep Neural Network Chapter Notes (3) - Parameter Debugging and Batch Normalize


Video course link:
https://www.bilibili.com/video/BV1FT4y1E74V?
Note reference link:
https://blog.csdn.net/weixin_36815313/article/details/105728919

1. Debugging Process (Tuning Process)

Changes to a neural network involve setting many different hyperparameters. The priorities for hyperparameter tuning are as follows:

  • first priority
    • Learning rate α \alphaa
  • second priority
    • Momentum parameter β \betaβ (generally 0.9)
    • The size of the mini-batch
    • number of hidden units
  • third priority
    • The number of layers of the neural network
    • Learning rate decay factor
  • fourth priority
    • Adam algorithm parameter β 1 \beta1β 1 (generally 0.9)
    • Adam algorithm parameter β 2 \beta2β 2 (generally 0.999)
    • Bias term ϵ \epsilonϵ (generally take1 0 − 8 10^{-8}108

insert image description here

In the earlier generation of machine learning algorithms, if you had two hyperparameters, called hyperparameter 1 and hyperparameter 2, the common practice was to sample points in the grid and then systematically study these values ​​(as shown above) .
insert image description here

In the field of deep learning, our common method is to randomly select points (as shown above), you can select the same number of points, and then use these randomly selected points to test the effect of hyperparameters. The reason for this is that it is difficult to know in advance which hyperparameters are most important for the problem you are trying to solve.
insert image description here

In fact, you may search for more than two hyperparameters. If you have three hyperparameters, then what you search is not a square, but a cube (as shown above), the hyperparameter 3 represents the third dimension, and then take the value in the three-dimensional cube, you will try more values.
insert image description here

Another convention is to use a coarse-to-fine strategy when you value hyperparameters .
Take two-dimensional hyperparameters as an example, take the value of hyperparameters, maybe you will find a certain point with the best effect, maybe some other points around this point are also good, then the next thing to do is to zoom in This small area (small blue box), and then get more intensive or random values ​​in it, gather more resources, and search in this blue box. If you doubt the optimal results of these hyperparameters in this area, after a rough search in the whole square, you will know that you should focus on smaller squares next. In smaller squares, you can take points more densely, so this kind of coarse-to-fine search is also often used.
By experimenting with different values ​​of hyperparameters, you can choose the optimal value for the training set objective, or the optimal value for the development set, or what you most want to optimize during the hyperparameter search process.

2. Select the appropriate range for the hyperparameters (Using an Appropriate Scale to Pick Hyperparameters)

In the range of hyperparameters, random values ​​can improve your search efficiency, but random values ​​are not random uniform values ​​within the effective range, but to choose appropriate scales for exploring these hyperparameters.

2.1 Linear axis scaling

insert image description here

Suppose you want to select the number of hidden units n [ l ] n^{[l]}n[ l ] , the range of hyperparameters is[ 50 , 100 ] [50,100][50,1 0 0 ] . In this case, making a number line from 50-100 with random points on it is a more intuitive way to search for specific hyperparameters.
insert image description here

If you want to choose the number of layers of the neural network, assuming that the number of layers is a value from 2 to 4, so you can randomly sample uniformly in the order 2, 3, 4, or apply grid search, you will feel that 2, 3, 4 These three values ​​are reasonable.
Here are a few examples of random uniform values ​​in the range you consider, which are reasonable, but inappropriate for some hyperparameters.

2.2 Log axis scale

insert image description here

Suppose you are searching for a hyperparameter α \alphaα (learning rate), its value range may be in[ 0.0001 , 1 ] [0.0001,1][0.0001,1]内。如果画一条从0.0001到1的数轴,沿其随机均匀取值,那90%的数值将会落在0.1到1之间,即在0.1到1之间,应用了90%的资源,而在0.0001到0.1之间,只有10%的搜索资源。
因此这里使用对数标尺搜索超参数会更合理,而不使用线性轴,分别依次取0.0001,0.001,0.01,0.1,1,在对数轴上均匀随机取点,这样,在0.0001到0.001之间,就会有更多的搜索资源可用。
对数标尺转化在Python中的实现方法:

r = -4 * np.random.rand()
a = np.power(10,r)

np.random.rand()的作用是返回一个取值范围在 [ 0 , 1 ) [0,1) [0,1)内的随机样本值,从而可以得到 r ∈ [ − 4 , 0 ) r\in[-4,0) r[4,0),那么 a ∈ [ 1 0 − 4 , 1 0 0 ) a\in[10^{-4},10^0) a[104,100),即 a ∈ [ 0.0001 , 1 ) a\in[0.0001,1) a[0.0001,1 ) .
If you are at1 0 a 10^a10a and1 0 b 10^b10b , in this example,1 0 a = 0.0001 10^a=0.000110a=0 . 0 0 0 1 , so you can passa = lg 0.0001 a=lg{0.0001}a=l g 0 . 0 0 0 1 calculateaaThe value of a , which is -4, can be calculated in the same way asbbThe value of b b = lg 1 b=lg{1}b=l g 1 , which is 0. All you have to do is in[ a , b ] [a,b][a,b ] The interval is random and evenly given torrr value, then setaathe value of a .
insert image description here

Another example is to give β \betaThe value of β is used to calculate the weighted average of the index. Supposeβ \betaThe value range of β is[0.9, 0.999] [0.9, 0.999][ 0.9,0.999 ] . _ _ _ _ _ _ _ _ When computing the weighted average of the indices,β \betaA beta of 0.9 is equivalent to calculating the average among 10 values, while a value of 0.999 is equivalent to averaging among 1000 values. So what we want to explore is1 − β \pmb{1-\beta}1b1b1β , the range of this value is[ 0.1 , 0.001 ] [0.1,0.001][0.1,0 . 0 0 1 ] , so usinga logarithmic scale, 0.1 corresponds to1 0 − 1 10^{-1}101 , 0.001 corresponds to1 0 − 3 10^{-3}103 , so all you have to do is in[ − 3 , − 1 ] [-3,-1][3,1 ] randomly and uniformly giverrThe value of r , set1 − β = 1 0 r 1-\beta=10^r1b=10r,即 β = 1 − 1 0 r \beta=1-10^r b=110r , which becomes a random value of hyperparameters within a specific selection range. In this way to get the desired result, you spend as much resources searching for hyperparameters between 0.9 and 0.99 as you do searching between 0.99 and 0.999.
As for why the linear axis is not used, it is becausewhenβ \pmb{\beta}bbAs β approaches 1, the sensitivity of the obtained results changes, even ifβ \pmb{\beta}bbβ has slight variations. Supposeβ \betaβ takes a value between 0.9 and 0.9005, according to the formula1 1 − β \frac{1}{1-\beta}1β1可知,在这个范围内取值就相当于大概10个值取平均,因此你的结果几乎不会变化。而 β \beta β值如果在0.999到0.9995之间,这会对你的算法产生巨大影响。当 β \beta β取0.999,这相当于对1000个值取平均,当 β \beta β取0.9995,这就相当于对2000个值取平均。所以整个取值过程中,你需要更加密集地取值。

3. 超参数训练实战:Pandas vs. Caviar (Hyperparameters Tuning in Practice: Pandas vs. Caviar)

关于如何搜索超参数的问题,大概有两种重要的思想流派,或人们通常采用的两种重要但不同的方式。

3.1 照看一个模型 (Babysitting one model)

insert image description here

一种是你照看一个模型,通常是在有庞大的数据组,但没有许多计算资源或足够的CPU和GPU的前提下,基本而言,你只可以一次负担起试验一个模型或一小批模型,在这种情况下,即使当它在试验时,你也可以逐渐改良。比如,在第0天随机初始化参数,然后开始试验,然后你逐渐观察自己的学习曲线,可以是损失函数 J J J , data setting error or other things, gradually decrease in the first day, then at the end of the day, you may try to increase the learning rate a little bit, see how it will be, maybe the second day will find that it becomes better. Two days later, it's still doing pretty well, maybe fill inthe Momentumor reduce the variables. Then every day you watch it and keep tweaking your parameters. Maybe one day, you will find that your learning rate is too large, so you may fall back to the previous model, but you can say that you spend time looking after this model every day, even if it is over many days or many weeks. middle.
So it's an approach that people take care of a model, watch how it behaves, and patiently tune the learning rate, but that's usually the way people do it because you don't have enough computing power to experiment with a lot of models at the same time.

3.2 Simultaneous training of multiple models (Training many models in parallel)

insert image description here

Another way is to experiment with multiple models at the same time, you set some hyperparameters, let it run by itself, or for a day or even for several days, and then you will get a learning curve like the one above (blue curve), This could be loss function J, experimental error, or data error loss, but are all measures of the curve trajectory. Also you can start a different model with different hyperparameter settings, so your second model will generate a different learning curve (purple curve), maybe this one looks better. In the meantime, you can experiment with a third model, which may result in a new learning curve, or any other learning curve. Or you can test many different models in parallel at the same time, and the different orange curves correspond to different models. This way you can experiment with many different parameter settings and finally quickly choose the one that works best.

4. Normalizing Activations in a Network

4.1 The role of Batch normalization

After the rise of deep learning, one of the most important ideas is an algorithm called Batch normalization . Batch normalization will make your parameter search problem easier, make the choice of hyperparameters for the neural network more stable, the range of hyperparameters will be larger, work well, and will also make your training easier, Even the deep web.
insert image description here

When you train a neural network model, normalizing the input features can speed up the learning process. By calculating the mean value μ = 1 m ∑ x ( i ) \mu=\frac{1}{m}\sum{x^{(i)}}m=m1x( i ) , then subtract the meanx = x − μ x=x-\mux=xμ , calculate the varianceσ 2 = 1 m ∑ ( x ( i ) ) 2 \sigma^2=\frac{1}{m}\sum{(x^{(i)})^2}p2=m1(x(i))2 , then normalize your input datasetx = x / σ 2 x=x / \sigma^2x=x / p2 . In the previous chapters, we learned that through normalization, the outline of the learning problem can be changed from flat to circular, which is easier for algorithm optimization. So for logistic regression and for normalizing the input eigenvalues ​​of neural networks, this is very efficient.
insert image description here

However, in the neural network training, we only normalized the input layer data, but did not normalize the middle layer . Although we have normalized the input data, but the input data after σ ( WTX + b ) σ(W^TX+b)s ( WTX+b ) After such matrix multiplication and nonlinear operations, thedata distribution is likely to be changed, and with the multi-layer operation of the deep network, the data distribution will change more and more.
So if you want to train these parameters likew[3], b[3] w^{[3]}, b^{[3]}w[3]b[ 3 ] , then normalizea [ 2 ] a^{[2]}a[ 2 ] mean and variance, so thatw [ 3 ], b [ 3 ] w^{[3]}, b^{[3]}w[3]bThe training of [ 3 ] is more efficient. Although strictly speaking, what we really normalize is nota [ 2 ] a^{[2]}a[ 2 ] instead ofz [ 2 ] z^{[2]}z[2]。深度学习文献中有一些争论,关于在激活函数之前是否应该将值 z [ 2 ] z^{[2]} z[2]归一化,或是否应该在应用激活函数 a [ 2 ] a^{[2]} a[2]后再规范值。实践中,经常做的是归一化 z [ 2 ] z^{[2]} z[2],我推荐其为默认选择。

4.2 Batch归一化的使用方法

在神经网络中,假设隐藏层单元 z [ l ] ( i ) z^{[l](i)} z[l](i)已知,其中 i ∈ [ 1 , m ] i\in[1,m] i[1,m]。下面的公式都是针对第 l l l层单元,但为了简化符号,因此省略 [ l ] [l] [l]
首先,取每个 z ( i ) z^{(i)} z(i)值,使其规范化。方法如下,计算 z ( i ) z^{(i)} z(i)的均值,再将每个 z ( i ) z^{(i)} z(i)值减去均值,除以标准差。为了使数值稳定,通常在分母加上 ϵ \epsilon ϵ,以防 σ = 0 \sigma=0 σ=0的情况。 μ = 1 m ∑ i = 0 m z ( i ) \mu=\frac{1}{m}\sum_{i=0}^{m} {z^{(i)}} μ=m1i=0mz(i) σ 2 = 1 m ∑ i = 0 m ( z ( i ) − μ ) 2 \sigma^2=\frac{1}{m} \sum_{i=0}^{m} {(z^{(i)}-\mu)^2} σ2=m1i=0m(z(i)μ)2 znorm ( i ) = z ( i ) − μ σ 2 + ϵ z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+ \epsilon}}znorm(i)=p2+ϵ z(i)mThe batch of data z ( i ) z^{(i)}z( i ) After normalization, it satisfies the normal distribution, at this timezzEach component of z has a mean of 0 and a variance of 1. But the normalizedzzz will basically be limited to a normal distribution, which will reduce the expressiveness of the network, so all we need to do next isscale transformation and offset. z ~ ( i ) = γ znorm ( i ) + β \tilde{z}^{(i)}=\gamma z^{(i)}_{norm}+\betaz~(i)=γznorm(i)+β Its intuitive function is to useznorm ( i ) z_{norm}^{(i)}znorm(i)乘以c cγ adjusts the value size, plusβ ββ增加偏移后得到 z ~ ( i ) \tilde{z}^{(i)} z~(i),这里的 γ γ γ是尺度因子, β β β是平移因子。如果 γ = σ 2 + ϵ \gamma=\sqrt{\sigma^2+\epsilon} γ=σ2+ϵ (即 z n o r m ( i ) = z ( i ) − μ σ 2 + ϵ z^{(i)}_{norm}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}} znorm(i)=σ2+ϵ z(i)μ中的分母项), β \beta β等于 μ \mu μ(这里的 μ \mu μ就是 z n o r m ( i ) = z ( i ) − μ σ 2 + ϵ z^{(i)}_{norm}=\frac{z^{(i)}-\mu}{\sqrt{\sigma^2+\epsilon}} znorm(i)=σ2+ϵ z(i)μ中的 μ \mu μ),那么 z ~ ( i ) = z ( i ) \tilde{z}^{(i)}=z^{(i)} z~(i)=z(i)。因此, γ z n o r m ( i ) + β γ \gamma z^{(i)}_{norm}+\betaγ γznorm(i)+βγ的实际作用在于,通过对 γ \pmb{\gamma} γγγ β \pmb{\beta} βββ is reasonably set, and the hidden unit value z ( i ) \pmb{z^{(i)}} thatsatisfies other means and variances can be constructedz(i)z(i)z(i)

5. Fitting Batch Normalization into a Neural Network (Fitting Batch Norm into a Neural Network)

insert image description here

In practice, Batch Normalization is often used with mini-batches of the training set . Assuming there is a neural network, the way you apply Batch normalization is to take the first mini-batch X { 1 } X^{\{1\}}X{ 1 } as input, then apply the parameterw [ 1 ] w^{[1]}w[ 1 ] andb[ 1 ] b^{[1]}b[ 1 ] Calculatez [ 1 ] z^{[1]}z[ 1 ] , thenBatch normalizationwill subtract the mean, divided by the standard deviation, byβ [ 1 ] \beta^{[1]}b[ 1 ] andγ [ 1 ] \gamma^{[1]}c[ 1 ] Rescale so thatz ~ [ 1 ] \tilde{z}^{[1]}z~[ 1 ] , then apply the activation functiong [ 1 ] ( z ~ [ 1 ] ) g^{[1]}(\tilde{z}^{[1]})g[1](z~[ 1 ] )to geta [ 1 ] a^{[1]}a[ 1 ] . Then use the parameterw [ 2 ] w^{[2]}w[ 2 ] andb [ 2 ] b^{[2]}b[ 2 ] Calculatez [ 2 ] z^{[2]}z[ 2 ] ,Batch normalizationto getz ~ [ 2 ] \tilde{z}^{[2]}z~[ 2 ] , introduce two parametersβ in the middle [ 2 ] \beta^{[2]}b[ 2 ] andγ [ 2 ] \gamma^{[2]}c[ 2 ] , and finally apply the activation functiong [ 2 ] ( z ~ [ 2 ] ) g^{[2]}(\tilde{z}^{[2]})g[2](z~[ 2 ] )to geta [ 2 ] a^{[2]}a[ 2 ] . And so on, the second mini-batchX { 2 } X^{\{2\}}X{ 2 } , the third mini-batchX { 3 } X^{\{3\}}X{ 3 } and so on continue to train in this way. What needs to be emphasized
aboutBatch normalization

  • Batch normalization occurs in the calculation of zzzwaaa __a之间的。
  • 这里的 β [ 1 ] , β [ 2 ] \beta^{[1]},\beta^{[2]} β[1]β[2]等和Momentum超参数 β \beta β没有任何关系。
  • z z z的计算方式是 z ( i ) = w T a ( i ) + b z^{(i)}=w^Ta^{(i)}+b z(i)=wTa(i)+b,而Batch归一化中均值的计算方式 μ = 1 m ∑ i = 0 m z ( i ) = E ( w T a ( i ) ) + b \mu=\frac{1}{m}\sum_{i=0}^{m} {z^{(i)}}=E(w^Ta^{(i)})+b μ=m1i=0mz(i)=E(wTa(i))+b.znorm ( i ) = z ( i ) − μ σ 2 + ϵ z_{norm}^{(i)}=\frac{z^{(i)} - \mu}{\sqrt{\sigma^ 2+\epsilon}}znorm(i)=p2 +ϵ z( i )mz ( i ) − μ = ( w T a ( i ) + b ) − ( E ( w T a ( i ) ) + b ) = w T a ( i ) − E ( w T a ( i ) z^ {(i)}-\mu=(w^Ta^{(i)}+b)-(E(w^Ta^{(i)})+b)=w^Ta^{(i)}- E(w^Ta^{(i)}z(i)m=(wT a(i)+b)( E ( wT a(i))+b)=wT a(i)E ( wT a( i ) . This means that regardless of the bias termbbWhatever the value of b is, it will be eliminated. Therefore, the bias term b \pmb{b} can not be added when using Batch normalizationbbb , or set it to 0.

To summarize how to apply gradient descent using Batch Normalization . Assuming you are using mini-batch gradient descent , run ttt is a for loop from 1 to the number of batches, and the following operations are performed in the for loop:

  1. For mini-batch X { t } X^{\{t\}}X{ t } applies forward propagation, and applies forward propagation to each hidden layer, and calculatesz [ l ] z^{[l]}z[l]
  2. Use Batch normalization to calculate z ~ [ l ] \tilde{z}^{[l]}z~[ l ] , thus replacingz [ l ] z^{[l]}z[l]
  3. Use backpropagation to calculate the llthAll parameters of layer l , namely dw [ l ] dw^{[l]}dw[l] d b [ l ] db^{[l]} db[ l ]d β [ l ] d\beta^{[l]}dβ[ l ]d γ [ l ] d\gamma^{[l]}dγ[l]
  4. Update parameters: w [ l ] = w [ l ] − α dw [ l ] w^{[l]}=w^{[l]}-\alpha dw^{[l]}w[l]=w[l]αdw[ l ]β [ l ] = β [ l ] − α d β [ l ] \beta^{[l]}=\beta^{[l]}-\alpha d\beta^{[l]}b[l]=b[l]a d b[ l ]γ [ l ] = γ [ l ] − α d γ [ l ] \gamma^{[l]}=\gamma^{[l]}-\alpha d\gamma^{[l]}c[l]=c[l]αdγ[l]
    MomentumRMSpropAdam等优化算法在这里也同样适用)

insert image description here

6. Batch Norm 为什么起作用? (Why does Batch Norm work?)

6.1 Covariate shift

insert image description here

假设有一个神经网络,建立在猫的识别检测上。假设你已经在所有黑猫的图像上训练了数据集,但是现在将此网络应用于有色猫的识别。在这种情况下,正样本中不只有黑猫,还有其它颜色的猫。
insert image description here

假设左图是黑猫的训练集正负样本分布情况,右图是黑猫和有色猫混合的训练集正负样本分布情况。在实际中,使用左图这样的训练集作为输入而训练结果不错的神经网络,同样给右图这样的训练集运行,却并不见得会好。即使存在运行都很好的同一个函数,也不希望去使用。
Covariate shift的概念就是由于训练集和测试集,即输入数据存在分布的差异性,给网络的泛化性和训练速度带来了影响。

6.2 Covariate shift如何影响神经网络?

insert image description here

现在有一个上图这样的深层神经网络,以隐藏层第三层为例,假设此网络已经学习了参数 w [ 3 ] w^{[3]} w[3] b [ 3 ] b^{[3]} b[3]
insert image description here

Then mask the left part, from the perspective of the third hidden layer, it gets some values ​​from the previous layer, namely a 1 [ 2 ], a 2 [ 2 ], a 3 [ 2 ], a 4 [ 2 ] a_1^{[2]}, a_2^{[2]}, a_3^{[2]}, a_4^{[2]}a1[2]a2[2]a3[2]a4[2], these values ​​can also be seen as input values ​​x 1 , x 2 , x 3 , x 4 x_1, x_2, x_3, x_4x1x2x3x4. The job of the third hidden layer is to find a way to map these values ​​to y ^ \hat{y}y^
insert image description here

Now we uncover the left side of the network, this network also has parameters w [ 1 ], b [ 1 ] w^{[1]}, b^{[1]}w[1]b[ 1 ] Sumw [ 2 ] ,b [ 2 ] w^{[2]},b^{[2]}w[2]b[ 2 ] , after each parameter update, after the network calculation of the first layer and the second layer of the hidden layer,a 1 [ 2 ], a 2 [ 2 ], a 3 [ 2 ], a 4 [ 2 ] a_1 ^{[2]}, a_2^{[2]}, a_3^{[2]}, a_4^{[2]}a1[2]a2[2]a3[2]a4[2]The value of will also change. For the third layer of the hidden layer, the distribution of the input data will change, so it has a Covariate shift problem.

6.3 How does Batch normalization solve the Covariate shift problem?

insert image description here

Draw the distribution of the hidden unit values ​​​​of the second layer (for ease of understanding, only z 1 [ 2 ] z^{[2]}_1 is considered herez1[2]Sum z 2 [ 2 ] z^{[2]}_2z2[2]two hidden unit values), since z 1 [ 2 ] z^{[2]}_1z1[2]Sum z 2 [ 2 ] z^{[2]}_2z2[2]The value of will change, and therefore its data distribution will also change.
What Batch Normalization does is it limits how much the distribution of these hidden unit values ​​varies. When the neural network updates the parameters in the previous layer, Batch normalization can ensure that no matter how it changes z 1 [ 2 ] z^{[2]}_1z1[2]Sum z 2 [ 2 ] z^{[2]}_2z2[2]The mean and variance of remain unchanged (the mean and variance can be 0 and 1 respectively, that is, conform to the normal distribution, or can be determined by the parameter β \betaβ γ \gamma γ决定)。
Batch归一化减少了输入值改变的问题,使输入值变得更稳定,神经网络中后面的层就会有更坚实的基础。即使输入分布改变了一些,也会改变得更少。Batch归一化让当前层保持学习,当输入分布改变时,迫使后面的层对其适应的程度减小了,或者说它减弱了前层参数的作用与后层参数的作用之间的联系,使得网络中每一层都可以自己学习,稍稍独立于其它层,这有助于加速整个网络的学习。

6.4 Batch归一化的其他作用

Batch归一化还有一个作用,它有轻微的正则化效果。Batch归一化通常与mini-batch梯度下降法一起使用,由于每个mini-batch X { t } X^{\{t\}} X{ t}相当于不同的输入数据集,且mini-batch size较小,因此在mini-batch上计算出的均值和方差会有一些小的噪声。同时从 z [ l ] z^{[l]} z[l] z ~ [ l ] \tilde{z}^{[l]} z~[l]的缩放过程中也有一些噪音,因为它是用本身存在噪音的均值和方差计算得出的。
所以和dropout相似,Batch归一化给每个隐藏层的激活值上增加了噪音,这迫使后面的单元不过分依赖任何一个隐藏单元。对于隐藏单元来说,向输入添加方差极小的噪声等价于对权重施加范数惩罚,因此相当于有正则化的作用。但是因为添加的噪音很微小,所以正则化的效果并不是很大。
如果你想得到dropout更强大的正则化效果,你可以将Batch归一化dropout一起使用。另外,通过应用较大的min-batch,可以减少噪音,同时也减少正则化效果。但事实上,不建议把Batch归一化当作正则化
insert image description here

7. 测试时的 Batch Norm (Batch Norm at Test Time)

Batch归一化将你的数据以mini-batch的形式逐一处理,但在测试时,你可能需要对每个样本逐一处理。
insert image description here

First review the process of performing Batch normalization in the form of mini-batch during the training phase . In a mini-batch , first all samples z ( i ) z^{(i)} in the mini-batchz( i ) The value is accumulated and summed to calculate the mean value, heremmm to represent the number of samples in thismini-batch, not the entire training set. Then calculate the variance, and then calculateznorm ( i ) z^{(i)}_{norm}znorm(i), that is, adjusted with the mean and standard deviation, the denominator plus ϵ \epsilonϵ is for numerical stability. Finally, useγ \gammacb \betaβ adjusts znormagainznormget z ~ \tilde{z}z~
请注意在训练阶段用于调节计算的均值 μ \mu μ和方差 σ 2 \sigma^2 σ2是在整个mini-batch上进行计算,但是在测试阶段,你也许不能对一个mini-batch中所有的样本同时处理,因此你需要用其它方式来得到均值 μ \mu μ和方差 σ 2 \sigma^2 σ2,而且如果你只有一个样本,一个样本的均值和方差没有意义。在典型的Batch归一化应用中,你需要用一个指数加权平均来估算,这个平均数涵盖了所有mini-batch
假设在第 l l l层,首先训练第一个mini-batch X { 1 } X^{\{1\}} X{ 1},得到第一个mini-batch的均值 μ { 1 } [ l ] \mu^{\{1\}[l]} μ{ 1}[l]。然后在这一层继续训练第二个mini-batch X { 2 } X^{\{2\}} X{ 2},得到均值 μ { 2 } [ l ] \mu^{\{2\}[l]} μ{ 2}[l]。接着在这一层训练第三个mini-batch X { 3 } X^{\{3\}} X{ 3},得到均值 μ { 3 } [ l ] \mu^{\{3\}[l]} μ{ 3}[l]。正如之前用的指数加权平均来计算 θ 1 , θ 2 , θ 3 \theta_1,\theta_2,\theta_3 θ1θ2θ3The mean value of , here you can also use the exponential weighted average to estimate the hidden units zz of this hidden layermean of z . Similarly, you can also use the exponentially weighted average to trackσ 2 \sigma^2inmini-batchp2 . Therefore, while training the neural network with differentmini-batchesthe μ \muof each layer you are looking atμσ 2 \sigma^2pThe real-time value of the mean of 2 .
Finally, when testing, the corresponding equationznorm ( i ) = z ( i ) − μ σ 2 + ϵ z_{norm}^{(i)}=\frac{z^{(i)}-\mu}{\ sqrt{\sigma^2+\epsilon}}znorm(i)=p2 +ϵ z( i )m, you just need to use your zzz value to calculateznorm ( i ) z_{norm}^{(i)}znorm(i),用μ \muμσ 2 \sigma^2p2 exponentially weighted average to make adjustments, and then use the just calculatedznorm z_{norm}znormand the parameter β \beta you got during neural network trainingβ γ \gamma γ to compute z ~ \tilde{z}for that test samplez~ value.

8. Softmax Regression

All the examples of classification we have encountered so far have used binary classification, which has only two possible labels, 0 or 1. There is a general form of logistic regression called Softmax regression , which is used to solve multi-classification problems.
insert image description here

Suppose you need to identify cats, dogs and chickens. Here I set cats as category 1, dogs as category 2, and chickens as category 3. If they do not belong to any of the above categories, set them as category 0 (as shown above). The symbol CC is used hereC represents the total number of categories in the classification.
insert image description here

First build a neural network with 4 output layers, or CCC output units, son [ L ] n^{[L]}n[ L ] , that is, the number of output layer units is equal to 4, or in general equal toCCC
insert image description here

Each output unit of the output layer will give the probability of each category (as shown in the figure above), that is, when the input is XXWhen X , the first node corresponds to the probabilityP ( other ∣ X ) P(other|XP ( o t h e r X ) , the second node corresponds to the probability that the output is a cat (category 1)P ( cat ∣ X ) P(cat|X)P ( c a t X ) , the third node corresponds to the probability that the output is a dog (category 2)P ( dog ∣ X ) P(dog|X)P ( d o g X ) , the fourth node corresponds to the probability that the output is a chick (category 3)P ( baby chick ∣ X ) P(baby\ chick|X)P ( b a b y c h i ck k X )  . Thereforey ^ \hat{y}y^will be a 4×1 dimensional vector, and the probabilities of the four outputs should add up to 1.
Making your network do the above requires a Softmax layer, and an output layer to generate the output. In the last layer of the neural network, you will calculate the linear part of each layer as usual, the calculation process is as follows:
(1) Calculate z [ L ] = W [ L ] a [ L − 1 ] + b [ L ] z ^{[L]}=W^{[L]}a^{[L-1]}+b^{[L]}z[L]=W[L]a[L1]+b[ L ]
(2) ApplySoftmaxactivation function:

  1. Calculate a temporary variable t = ez[ L ] t=e^{z^{[L]}}t=ez[L],这是对所有元素求幂。其中 z [ L ] z^{[L]} z[L]的维度是4×1,因此 t = e z [ l ] t=e^{z^{[l]}} t=ez[l]是一个4×1维向量;
  2. t t t进行归一化,使输出的和为1,然后输出 a [ L ] a^{[L]} a[L]。因此 a [ L ] = e z [ L ] ∑ j = 1 4 t i a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{j=1}^4t_i} a[L]=j=14tiez[L],换句话说, a [ L ] a^{[L]} a[L]也是一个4×1维向量,而这个四维向量的第 i i i个元素ai [ L ] = ti ∑ j = 1 4 tia^{[L]}_i=\frac{t_i}{\sum_{j=1}^4t_i}ai[L]=j=14titi

To give a specific example, suppose z [ L ] = [ 5 2 − 1 3 ] z^{[L]}=\begin{bmatrix} 5\\ 2\\ -1\\ 3\\ in the output layer \end{bmatrix}z[L]=5213, and then use the element exponentiation method to calculate t = ez [ L ] = [ e 5 e 2 e − 1 e 3 ] = [ 148.4 7.4 0.4 20.1 ] t=e^{z^{[L]}}=\ begin{bmatrix} e^5\\ e^2\\ e^{-1}\\ e^3\\ \end{bmatrix}=\begin{bmatrix} 148.4\\ 7.4\\ 0.4\\ 20.1\\ \end{bmatrix}t=ez[L]=e5e2e1e3=148.47.40.420.1。从向量 t t t到向量 a [ l ] a^{[l]} a[l]只需要将这些元素归一化,使总和为1,即把向量 t t t中的所有元素加起来,得到176.3,最终 a [ l ] = y ^ = t 176.3 = [ 0.841 0.041 0.002 0.114 ] a^{[l]}=\hat{y}=\frac{t}{176.3}=\begin{bmatrix} 0.841\\ 0.041\\ 0.002\\ 0.114\\ \end{bmatrix} a[l]=y^=176.3t=0.8410.0410.0020.114。因此在输出层,第一个节点的输出是0.842,即输入 X X X为类别0的概率是84.2%。同理,第二个节点的输出是0.041,即输入 X X X为类别1的概率是4.1%。第三个节点的输出是0.002,即输入 X X X为类别2的概率是0.2%。第四个节点的输出是0.114,即输入 X X X为类别3的概率是11.4%。

9. 训练一个 Softmax 分类器 (Training a Softmax Classifier)

关于训练带有Softmax输出层的神经网络,具体而言,我们先定义训练神经网络会用到的损失函数。以上一章节为例,假设输入是一张猫(类别1)的图片,即真实标签是 y = [ 0 1 0 0 ] y=\begin{bmatrix} 0\\ 1\\ 0\\ 0\\ \end{bmatrix} y=0100。假设你的神经网络输出的是 y ^ = a [ L ] = [ 0.3 0.2 0.1 0.4 ] \hat{y}=a^{[L]}=\begin{bmatrix} 0.3\\ 0.2\\ 0.1\\ 0.4\\ \end{bmatrix} y^=a[L]=0.30.20.10.4, where y ^ \hat{y}y^is a vector whose elements sum to 1. The neural network does not perform well for this sample, which is actually a cat, but only assigns a 20% probability of being a cat, so it does not perform well in this case.
In Softmax classification, the loss function we generally use is L ( y ^ , y ) = − ∑ j = 1 4 yj log ⁡ y ^ j L(\hat{y},y)=-\sum_{j= 1}^4y_j\log\hat{y}_jL(y^,y)=j=14yjlogy^jIn this sample y 1 = y 3 = y 4 = 0 y_1=y_3=y_4=0y1=y3=y4=0y2=1y_2=1y2=1,因此损失函数L ( y ^ , y ) = − y 2 t log ⁡ y ^ 2 = − log ⁡ y ^ 2 L(\hat{y},y)=-y_2t\log\hat{y}_2 =-\log\hat{y}_2L(y^,y)=y2tlogy^2=logy^2L ( y ^ , y ) = − ∑ j = 1 4 yj log ⁡ y ^ j = − y 2 log ⁡ y ^ 2 = − log ⁡ y ^ 2 L(\hat{y},y)=-\ sum_{j=1}^4y_j\log\hat{y}_j=-y_2\log\hat{y}_2=-\log\hat{y}_2L(y^,y)=j=14yjlogy^j=y2logy^2=logy^2. This means that if your learning algorithm tries to make it smaller, because the gradient descent method is used to reduce the loss of the training set, let the loss function L ( y ^ , y ) L(\hat{y} ,y )L(y^,y ) becomes smaller, that is, let− log ⁡ y ^ 2 − -\log\hat{y}_2−logy^2 becomes smaller, andy ^ 2 \hat{y}_2y^2Make it as big as possible.
In a nutshell, what the loss function does is find the true class in your training set, and then try to make the corresponding probability of that class as high as possible.
This is the loss for a single training sample, the loss for the entire training set JJJ is to add up the predictions of all training samples, that is, J ( w [ 1 ] , b [ 1 ] , ⋯ ) = 1 m ∑ i = 1 m L ( y ^ ( i ) , y ( i ) ) J( w^{[1]},b^{[1]},\cdots)=\frac1m\sum_{i=1}^mL(\hat{y}^{(i)},y^{(i) })J(w[1],b[1],)=m1i=1mL(y^(i),y( i ) )So what you have to do is use the gradient descent method to minimize the loss here. In a neural network withSoftmaxoutput layer, the process of implementing the gradient descent method is as follows:

  • Forward propagation process:
    • ① The output layer will calculate z [ l ] z^{[l]}z[ l ] , whose dimension is C×1.
    • ② Use the Softmax activation function to get a [ l ] a^{[l]}a[ l ] , ory ^ \hat{y}y^
    • ③ Calculate the loss.
  • Backpropagation process:
    • dz [ l ] = y ^ − y dz^{[l]}=\hat{y}-ydz[l]=y^y , which is forz [ l ] z^{[l]}z[ l ] The partial derivative of the loss functiondz [ l ] = ∂ J ∂ z [ l ] dz^{[l]}=\frac{\partial J}{\partial z^{[l]}}dz[l]=z[l]J
    • ② Calculate all derivatives required in the entire neural network.

Guess you like

Origin blog.csdn.net/baoli8425/article/details/118970628