Depth Learning (II) --- Depth neural network training Trilogy

Like practicing martial arts, as there are martial arts Cheats, follow the steps step by step practice, the foundation needed to train, then by analogy, eventually you can practice martial arts masterpiece. In the process of learning the depth of field of study is the same, first solid basic skills training, in-depth understanding of the principles and procedures followed by either variant of various algorithms are available through the basic algorithm by analogy. Off his trilogy today in terms of the depth of neural network training, that is how the neural network is trained.

Here's trilogy on training is divided into forward propagation and reverse propagation gradient descent.

Then we separate to explain each step and what effect did this do yes.

1. forward propagation

First come forward propagation depth analysis, forward propagation process refers to the process of calculating the predicted value of the input is positive spread.

But here we may soon find that the last chapter talked about neurons must do an operation is linear transformation and nonlinear transformation of these two steps, and look at this in two steps, we will find the parameters w, b; then the beginning of time we want to get predictive value, we will certainly take this w, b into the calculations to get the final forecast. So here we have to do is initialization parameters, which is why no matter what our neural network architecture used in the practice time will need to initialize the parameters, the reason why. Since the outset to be artificial initialization parameters, then why initialize the appropriate value of it, and here we can not directly say that with a certain value it must be one size fits all, can only give theoretical guidance direction, that is, at initialization time, our w is generally set to about zero, but not too much, and b is generally set to zero. By initializing a w, b can be obtained and then a value obtained by linear transformation, this value needs behind nonlinear transformation, we have said above non-linear transformation function g will be used, this refers to the activation function.

Here the common activation function and the use range.

1. sigmoid function 

f (z) = \ frac {1} {1 + {^ {e} ^ {- z}}}       f'(z)=f(z)(1-f(z))

sigmoid function may be mapped to a real number between [0,1], are generally used for binary classification, such as whether the classification; and the derivative of the function may be expressed by itself.

Computationally intensive, however, and when the data distribution gradient disappears when the prone position curve smoothing. Since the sigmoid function curve toward both sides tend to be more derivative 0, it is easy to enter the saturation region.

This practice is not used in the activation function multilayer neural network.

 

2. ReLU function

f(z)=max(z,0)   

Is greater than 0 is itself less than zero on all set to 0, characterized in that a small amount of calculation, there is a great effect on the convergence of the gradient, where the gradient disappears not exist,

Output is often used to do in the hidden layer deep, the ability to express the whole model of the network stronger.

Disadvantage is too rough, lose some information features, it evolved out of the many variants of ReLU function on its basis, such as Leaky ReLU.

In actual use ReLU be careful to set learning rate, it will be very easy for a network of neurons in the most direct die.

3. Tanh function

f (z) = tanh (z) = \ frac {e ^ {z} -e ^ {-} {z} {e ^ z} + e ^ {-} z}        tanh (x) = 2sigmoid (2x) -1

f'(z)=1-{f(z)}^{2}

In the range of [-1,1], 0 is output as the center, particularly for use in the RNN. Wherein a difference in the effect will be very significant, in the cycle characteristics will continue to expand the actual effect will be more than the sigmoid commonly used in the second classification, the convergence rate faster than the sigmoid function.;

Gradient drawback is that there will disappear by Tanh function curves can be found in the derivative tends to 0 is the same at both ends, in the region of the learning curve very slow almost horizontal.

4. softmax function

y_c = \ sigma (z) j = \ frac {{e} ^ {} {} z_j \ sum_ {K} ^ {k = 1} {e} ^ {}} z_k

Mainly for multi-classification problems, binary classification problem in a multi-classification belongs to, when there are multiple inputs, the probability of which input maximum value determined by the probability, that in order to obtain input can win. Mainly used for multi-classification neural network output. Here exponential expression is mainly to make big bigger, and if at the same time a derivative function.

5. special activation function only applies to linear regression, linear function 

y = x

In fact equivalent to not do any of the non-linear transformation, so it can not fit the nonlinear function, the nonlinear model can not be established.

The above describes so many of activation function, then the activation function is the role of Shane. The role of the activation function of neural networks is the ability to make a stronger expression, no matter how many layers of neurons Without activation function, and ultimately just keep doing linear transformation, but after each addition of non-linear transformation linear transformation, the output will be more complex, so that the ability to express neural network stronger.

Well above it will introduce some common activation function. So after a linear transformation and nonlinear transformation after the last output here to get a prediction value, complete this step is to do a forward propagation. In the forward propagation process, the parameter w, b are assumed to be known.

2. Reverse the spread of

Then the process proceeds to back propagation, counter propagation is through the forward propagation prediction values ​​obtained by direct measurement of the gap between loss function and the real value, so that reverse request parameters for each neuron deflector.

那么说到这里要介绍下常用的损失函数(代价函数)表达有:

线性回归问题中采用均方误差代价函数 

J(w) = \frac{1}{m}\sum_{i=1}^{m}(y^i - \hat{y}^i)^2

这里可以有意思的讲解下这个均方根误差代价函数的由来。实际我们都知道,既然代价函数是度量预测值与真实值直接的差距,那么就可以直接表示为y^i \rightarrow \hat{y^i}  ,为了去除正负号的影响于是采用此方式 ,\left | y^i-\hat{y^i} \right | , 这里是度量一条样本的,如果是多条样本进来则需要做累加\sum_{i=1}^{m}\left | y^i-\hat{y^i} \right | ,但是这么一来要是m足够大,那么产生的代价函数也会很大,但并不能说明真的大, 所以这里要去除掉数据规模所带来的影响,于是求平均,\frac{1}{m}\sum_{i=1}^{m}\left | y^i-\hat{y^i} \right | ,因为是绝对值,不能处处可导,为了保证处处可导,于是进行了求平方得到\frac{1}{m}\sum_{i=1}^{m}(y^i - \hat{y}^i)^2  。

这样看下来就不会觉得公式很头疼了呢。所以有时候需要学会化繁为简。

分类问题中采用的为交叉熵损失函数

lnL(w)=-\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}ln\hat{y}^{(i)}+(1-y^{(i)})ln(1-ln\hat{y}^{(i)})) 此为二分类问题采用的交叉熵损失函数

 

lnL(w)=-\frac{1}{m}\sum_{i=1}^{m}{y^{(i)}}ln\hat{y}^{(i)}    此为多分类问题采用的交叉熵损失函数

得到损失函数后就可以对参数进行求偏导数。

这里我们认为参数w.b是未知的,X是已知数。 也就反向传播就是确立损失函数,同时计算偏导数和梯度

3. 梯度下降 

在反向传播中求得偏导后,就进入到梯度下降过程,梯度下降过程就是更新w,b ,沿着当前位置的最大方向导数的反向进行下降,下降步长由学习率来决定步长,而通过反向传播计算的就是决定朝哪个方向迈步子。

w= w-\partial dw   其中的\partial 为学习率 ,最开始的w,b则是由初始化得到的。 dw 则是由反向传播得到的。

b= b-\partial db

That is, it is to do gradient descent process parameter updates. This three-step finish together even completed a round of training, and then continue the cycle obtained from the above process forward propagation loss function, and then get back propagation gradient, gradient descent update parameters to be constantly updated iteration; we the ultimate goal of deep learning algorithm running loss function is to find  J(w,b)the minimum value, and to find the minimum value method is through gradient descent, gradient descent of the central role is constantly updated parameters, that is, do gradient descent process is constantly attempt, try out step by step the minimum. So I say finished at the top step of the depth of learning and training process it.

Guess you like

Origin blog.csdn.net/qq_27575895/article/details/90544564