"Hands-on deep learning" notes (3) multi-layer perceptron

4. Multilayer perceptron

4.1 Introduction and Implementation

4.1.1 Introduction

  1. [ Linear model may be wrong ] Linearity in affine transformation is a strong assumption, but in fact linear models may be wrong . Linearity means monotony. Any increase in features will lead to an increase in the output of the model or Decrease (determined by the positive or negative weight), and operations such as image inversion cannot be handled.
  2. [ Considering the role between features ] In fact, the importance of any pixel depends on the context of the pixel in a complex way, and these data may have a representation that takes into account the interaction between features. It may be appropriate to build a linear model on top of that , which cannot be done manually. In deep neural networks, we use observation data to jointly learn a hidden layer representation and a linear predictor applied to that representation .
  3. [ Introducing hidden layers ] Add one or more hidden layers to the network to handle more general types of functional relationships . The easiest way to do this is to stack multiple fully connected layers, an architecture often called a multilayer perception , or MLP for short.

A single hidden layer multilayer perceptron with 5 hidden units

  4. [ Parameter overhead ] However, the parameter overhead of this fully connected multi-layer perceptron may be very, very high, and a trade-off needs to be made between parameter saving and model effectiveness ( Zhang et al., 2021 ).

  5. [ Multi-layer perceptron may degenerate into a linear model ] Use X ∈ R n × d \mathbf{X}\in\mathbb{R}^{n\times d}XRn × d to representnnA mini-batch of n samples, each withddd input features. UseX ∈ R n × h \mathbf{X}\in\mathbb{R}^{n\times h}XRn × h means withhhThe hidden layer output of h hidden unit single hidden layer multilayer perceptron, also known ashidden representation, is also called hidden layer variable/hidden variable in mathematics/code, and the output of multilayer perceptron is calculated as follows:
H = XW ( 1 ) + b ( 1 ) , O = HW ( 2 ) + b ( 2 ) . \begin{align} \mathbf{H}=\mathbf{X}\mathbf{W}^{(1)}+\mathbf {b}^{(1)},\\ \mathbf{O}=\mathbf{H}\mathbf{W}^{(2)}+\mathbf{b}^{(2)}.\end{ align}H=XW(1)+b(1),O=HW(2)+b(2).
  6. [ Introduction of activation function ] However, the above definition is easy to get O = XW ( 1 ) W ( 2 ) + b ( 1 ) W ( 2 ) + b ( 2 ) = XW + b \mathbf{O} =\mathbf{X}\mathbf{W}^{(1)}\mathbf{W}^{(2)}+\mathbf{b}^{(1)}\mathbf{W}^{(2) }+\mathbf{b}^{(2)}=\mathbf{X}\mathbf{W}+\mathbf{b}O=XW(1)W(2)+b(1)W(2)+b(2)=XW+b , which goes back to the affine transformation from the previous chapter,without benefiting from the hidden layer. Therefore, an additional key element is required: applying anon-linear activation function σ \sigmaσ , the output value of the activation functionσ ( ⋅ ) \sigma(\cdot)σ ( ) is called the activity value, thus, the multilayer perceptron will no longer degenerate into a linear model.
H = σ ( XW ( 1 ) + b ( 1 ) ) , O = HW ( 2 ) + b ( 2 ) . \begin{align} \mathbf{H}&=\sigma(\mathbf{X}\mathbf{ W}^{(1)}+\mathbf{b}^{(1)}),\\ \mathbf{O}&=\mathbf{H}\mathbf{W}^{(2)}+\mathbf {b}^{(2)}.\end{align}HO=s ( X W(1)+b(1)),=HW(2)+b(2).
  7. [ Generalization ] The activation function is not only operated by row, but also by element. After calculating the linear part of each layer, each activity value is calculated without looking at the values ​​​​of other hidden units. Stacking such hidden layers, Build a more general multi-layer perceptron.
  8. [ Deeper, not wider ] Multi-layer perceptrons can capture complex interactions between complex inputs through hidden layers, and can model arbitrary functions as long as enough neurons and correct weights are given. In fact, however, we can more easily approximate many functions by using deeper (rather than wider) networks .
  9. [ Activation function ] The activation function determines whether the neuron should be activated by weighting the sum and adding a bias. It is the basis of deep learning. The following three common activation functions are introduced: (1) ReLU function (Rectified
linear unit , Corrected linear unit), ReLU ( x ) = max ⁡ ( x , 0 ) \text{ReLU}(x)=\max(x,0)ReLU ( x )=max(x,0 ) , which is equivalent to a piecewise linear function, generally specifying that its derivative at 0 is 0. The reason for using the ReLU function is that it either allows the parameters to pass or the parameters to disappear. Thederivation performance is good, and the optimization performance is better, which alleviates thegradient disappearance problem(will be described in detail later).
(2)The sigmoid function, also known as the squeeze function, will range( − inf , inf ) (-\text{inf},\text{inf})(inf,inf ) transforms the input into the interval( 0 , 1 ) (0,1)(0,1 ) ,sigmoid ( x ) = 1 1 + exp ⁡ ( − x ) \text{sigmoid}(x)=\frac{1}{1+\exp(-x)}sigmoid(x)=1+exp(x)1, the sigmoid function is a smooth, differentiable threshold unit approximation, widely used as the activation function of the output unit; when the input is close to 0, the sigmoid function is close to a linear transformation.
(3) The tanh function , the hyperbolic tangent function, compresses the input to ( − 1 , 1 ) (-1,1)(1,1) 上, tanh ( x ) = 1 − exp ⁡ ( − 2 x ) 1 + exp ⁡ ( − 2 x ) \text{tanh}(x)=\frac{1-\exp(-2x)}{1+\exp(-2x)} tanh ( x )=1+exp(2x)1exp(2x).
  10. From the above, we have learned how to combine nonlinear functions to build a multi-layer neural network architecture with stronger expressive ability, which can be close to deep learning practitioners around 1990.

4.1.2 Implementation of multi-layer perceptron from scratch

4.1.3 Simple implementation of multi-layer perceptron

4.2 Model selection, underfitting and overfitting

  1. [ Discovering patterns instead of remembering data ] The goal of machine learning scientists is to make the model truly discover generalized patterns (patterns), rather than simply remembering data, such as finding patterns between patient genetic data and dementia status Don’t just do things like “That’s Bob! I remember him! He has dementia!” More formally, it is the desire to capture the regularity of the underlying population of the training set in the face of individuals it has never encountered before .

  2. [ Overfitting and regularization ] The phenomenon that the fitting of the model on the training data is closer than the potential distribution is called overfitting (overfitting). The specific embodiment is to adjust the model architecture or hyperparameters and have enough neurons After increasing the number of elements, layers, and training iterations, the model eventually achieves perfect accuracy on the training set, but the accuracy on the test set decreases; the technique used to combat overfitting is called regularization .

  3. The training error refers to the error calculated by the model on the training data set, and the generalization error refers to the expectation of the model error when the model is applied to an infinite number of data samples also drawn from the distribution of the original sample. However the latter is difficult to calculate exactly and can only be estimated by an independent test set .

  4. Many mathematicians have devoted their lives to the study of generalization problems. Grevenko and Cantley derived the rate at which the training error converges to the generalization error in the theorem of the same name ( eponymous theorem ) . Vapnik and Chervonenkis extended this theory to more A general class of functions that lay the groundwork for statistical learning theory.   In the scenarios we explored earlier, the training and test data were drawn independently from the same distribution, known as the i.i.d. assumption , which means that there is no "memory" in the process of sampling the data. In the real world, factors such as location and time will cause sample collection to violate the independence assumption . This violation is sometimes slight and the model continues to work well (such as face recognition, speech recognition, and language translation), and sometimes it is excessively violated, which will bring trouble. Many heuristics in deep learning are designed to prevent overfitting .

  5. [ Reasons affecting model generalization ] When the model is more complex with fewer samples, it is expected that the training error will decrease, while the generalization error will increase. How to judge the complexity of the model ? It is generally believed that a model with more parameters , a larger value range of parameters , and more training iterations is more complex. In addition, the number of training samples also affects the generalization of the model.

  6. [模型选择] 为了确定候选模型中的最佳模型,通常会使用测试集。但是不能依靠测试数据进行模型选择,理想情况下测试数据只使用一次,而现实中很少有足够数据对每一轮实验采用全新测试集。常见解决方法是,把数据分成三份——训练集、测试集和验证集。

  7. [K折交叉验证] 当训练集稀缺时,可能无法提供足够数据构成合适的验证集,K折交叉验证中,原始训练数据被分为K个不重叠的子集,然后执行K次模型训练和验证,每次在 K − 1 K-1 K1 个子集上训练,在剩余尚未用于训练的子集上验证,最后对 K K K 次实验结果取平均,估计训练和验证误差。

  8. [欠拟合还是过拟合] 比较训练和验证误差时,要注意两种常见情况:
(1)训练误差和验证误差都很严重,但两者之间仅有一点差距:这可能意味着模型过于简单,表达能力不足,是欠拟合
(2)训练误差明显低于验证误差:表明严重的过拟合,而这并不总是一件坏事,尤其在深度学习领域,最好的预测模型往往在训练集上的表面比在验证集上好,最终关心的是验证误差,而非训练误差与验证误差的差距。

  当数据样本包含了 x x For different values ​​of x , a polynomial function with a function order equal to the number of data samples can perfectly fit the training set. The following figure describes the relationship between polynomial order and under/overfitting.

Effect of model complexity on underfitting/overfitting

  The fewer samples in the training dataset, the more likely (more severe) overfitting to occur, while the generalization error generally decreases as the amount of training data increases.

  9. [ Polynomial regression ] An example of using a third-order polynomial to generate and test data labels. The final result shows that when trying to use a polynomial with an order too high to train the model, there is not enough data for learning high-order terms. The coefficient makes it close to zero, so it is susceptible to the noise in the training data. Although the training loss is effectively reduced, the test loss is still high. The complex model causes overfitting to the data . The next section will discuss how to deal with overfitting Combined problems, such as weight decay and dropout .

4.3 Weight Decay

  1. [ From order adjustment to weight decay ] Assuming that we have as much high-quality data as possible, we can focus on regularization techniques. Looking back at the previous example, limiting the number of features is a common technique to alleviate overfitting. The natural expansion of polynomials to multiple variables is called monomial , which can also be said to be the product of variable powers. The order is the sum of powers, such as x 1 2 x 2 x_1^2x_2x12x2and x 3 x 5 2 x_3x_5^2x3x52Both are 3-degree monomials, even a small change in the order will significantly increase/reduce the model complexity , making the model hover between oversimplification and overcomplexity, so a finer-grained tool is needed to adjust the model complexity - weight attenuation.

  2. Weight decay is one of the most widely used regularization techniques, often referred to as L 2 L_2L2Regularization , which measures the complexity of a function by its distance from zero, so how to accurately measure the distance between a function and zero ? A simple way is through the linear function f ( x ) = w ⊤ xf(\mathbf{x})=\mathbf{w^\top x}f(x)=w Norm of the weight vector inx∥ w ∥ 2 \Vert\mathbf{w}\Vert^2w2. Therefore, the norm of the weight vector itself is added to the optimization as a penalty item, thereby limiting the growth of the weight vector from being too large. Use regularization constantλ \lambdaλ balances this new penalty loss, resulting in the following loss:
L ( w , b ) + λ 2 ∥ w ∥ 2 L(\mathbf{w},b)+\frac{\lambda}{2}\Vert\mathbf {w}\Vert^2L(w,b)+2lw2

  3. [ Comparison of regularization with different norms ] L 2 L_2L2 Regularized linear models form the classic ridge regression algorithm, and one reason for using it is that it imposes a large penalty on large components of the weight vector, biasing the learning algorithm towards models that distribute weights evenly over a large number of features . In contrast, L 1 L_1L1 The penalty under lasso regression composed of regularized linear models will cause the model to concentrate weights on a small number of features and clear other weights to zero. This is called feature selection and may be required in other scenarios .

  4. [ L 2 L_2L2正则化小时小批量温法梯度方法手机]
w ← ( 1 − η λ ) w − η ∣ B ∣ ∑ i ∈ B x ( i ) ( w ⊤ x ( i ) + b − y ( i ) ) \mathbf {w}\leftarrow(1-\eta\lambda)\mathbf{w}-\frac{\eta}{\vert\mathcal{B}\vert}\sum_{i\in\mathcal{B}}\mathbf {x}^{(i)}(\mathbf{w^\top\mathbf{x}}^{(i)}+by^{(i)})w(1h l ) wBhiBx(i)(wx(i)+by( i ) )
  Thus, weight decay provides a continuous mechanism for adjusting function complexity; in addition, whetherfor the corresponding biasb 2 b^2b2 The penaltyneeds to be determined according to different applications. Generally speaking, the network output layer will not be regularized.

  5. [ Example of weight decay ] Generate data using the following formula:
y = 0.05 + ∑ i = 1 d 0.01 xi + ϵ where ϵ ∼ N ( 0 , 0.0 1 2 ) . y=0.05+\sum_{i=1}^ {d}0.01x_i+\epsilon\ \text{where}\ \epsilon\sim\mathcal{N}(0,0.01^2).y=0.05+i=1d0.01xi+ϵ where ϵ  N(0,0.012 ).
. .
the default in this book is to use a simple heuristic thatapplies weight decay across all layers of a deep network.

4.4 Temporary Retirement Method and Discarding Method

  We want the model to mine features deeply, i.e. spread their weights across many features, rather than relying too heavily on a few potentially spurious associations .
  [ Generalization and flexibility ] When more samples are given instead of features , linear models usually do not overfit, but the reliability of this model generalization comes at a price, that is, the relationship between features is not considered For interactions , for each feature, a positive or negative weight must be assigned, while other features are ignored. This fundamental trade-off between generalization and flexibility is described as a bias-variance trade-off :
(1) Linear models: high bias, low variance, similar results can be obtained on different random data samples.
(2) Deep neural network: high variance and low bias.
  This section focuses on the study of practical tools that tend to improve the generalization of deep networks. Classical generalization theory holds that simple models
  should be targeted in order to close the gap between training and testing performance . The norm of a parameter represents a useful measure of simplicity. Another angle of simplicity is smoothness , i.e., a function should not be sensitive to small changes in its input , such as image noise in image classification. In 2014, Srivastava et al. ( Srivastava et al., 2014 ) proposed an idea: during training, before computing subsequent layers, inject noise into each layer of the network in an unbiased manner to enhance smoothness , this idea is called dropout , each intermediate activity value h h h with the temporary withdrawal probabilityppp is given by the random variableh ' h'h replacement, the expected value remains unchanged after the replacement is completed, it seems that some neurons are discarded during the training process...
h ′ = { 0 probability is ph 1 − p other cases h'=\begin{cases}0& probability is p\ \\frac{h}{1-p}&other cases\end{cases}h={ 01phwith probability pother situations

  [ Retirement method in practice ] The multi-layer perceptron with 1 hidden layer and 5 hidden units as shown in the figure applies the resignation method to ppThe probability of p puts the hidden units to zero, which is equivalent to removing some units during the output process and backpropagation. In this way,the calculation of the output layer cannot be overly dependent on any one of the hidden units, makingthe network more stable.

Multilayer perceptron before and after dropout

Implemented from scratch...

4.5 Forward Propagation, Back Propagation and Computational Graphs

  This section will dive into the details of backpropagation with some basic mathematics and computational graphs.

  1. [ Forward Propagation ] Calculate and store the results of each layer in the neural network in order (from the input layer to the output layer).
  Intermediate variable: z = W ( 1 ) x \mathbf{z}=\mathbf{W}^{(1)}\mathbf{x}z=W( 1 ) x, the activation vector:h = ϕ ( z ) \mathbf{h}=\phi(\mathbf{z})h=ϕ ( z ) , output layer vector:o = W ( 2 ) h \mathbf{o}=\mathbf{W}^{(2)}\mathbf{h}o=W( 2 ) h, a single data sample loss term:L = l ( o , y ) L=l(\mathbf{o},y)L=l(o,y )L 2 L_2L2Form: s = λ 2 ( ∥ W ( 1 ) ∥ F 2 + ∥ W ( 2 ) ∥ F 2 ) s=\frac{\lambda}{2}(\Vert\mathbf{W}^{(1). )}\Vert^2_F+\Vert\mathbf{W}^{(2)}\Vert^2_F)s=2l(W(1)F2+W(2)F2) , the final regularization loss/objective function:J = L + s J=L+sJ=L+s . The entire forward propagation process can be visualized using the following calculation graph:

Computational graph of forward propagation

  1. [ Backpropagation ] A method of computing the gradient of a neural network's parameters by traversing the network in reverse order from the output layer to the input layer according to the chain rule . Suppose there is a function Y = f ( X ) Y=f(X)Y=f(X) Z = g ( Y ) Z=g(Y) Z=g ( Y ) , where the input X, Y, and Z are tensors of arbitrary shape, the derivative of Z with respect to X can be calculated by the chain rule:
∂ Z ∂ X = prod ( ∂ Z ∂ Y , ∂ Y ∂ X ) \frac {\partial Z}{\partial X}=\text{prod}(\frac{\partial Z}{\partial Y},\frac{\partial Y}{\partial X})XZ=prod(YZ,XY)
The purpose of backpropagation is that each input is relative to the network parameterW ( 1 ) \mathbf{W}^{(1)}W( 1 )W ( 2 ) \mathbf{W}^{(2)}WThe gradient of ( 2 ) ∂ J / ∂ W ( 1 ) \partial J/\partial \mathbf{W}^{(1)}J/W(1) ∂ J / ∂ W ( 2 ) \partial J/\partial \mathbf{W}^{(2)} J/W( 2 ) . The calculation order is opposite to the forward propagation. Applying the chain rule starts from the result of the calculation graph. The first step is to calculate the objective functionJ = L + s J=L+sJ=L+s relative to the loss termLLL and the regular itemsss 的梯度:
∂ J ∂ L = 1  and  ∂ J ∂ s = 1 \frac{\partial J}{\partial L}=1\ \text{and}\ \frac{\partial J}{\partial s}=1 LJ=1 and sJ=1
Next, calculate the objective function with respect to the output layer variableo \mathbf{o}o的梢度:
∂ J ∂ o = prod ( ∂ J ∂ L , ∂ L ∂ o ) = ∂ L ∂ o ∈ R q \frac{\partial J}{\partial\mathbf{o}}=\text{prod }(\frac{\partial J}{\partial L},\frac{\partial L}{\partial\mathbf{o}})=\frac{\partial L}{\partial\mathbf{o}}\ in\mathbb{R}^qoJ=prod(LJ,oL)=oLRq
next computes the gradient of the regularization term with respect to the two parameters:
∂ s ∂ W ( 1 ) = λ W ( 1 ) and ∂ s ∂ W ( 2 ) = λ W ( 2 ) \frac{\partial s}{ \partial\mathbf{W}^{(1)}}=\lambda\mathbf{W}^{(1)}\text{and}\ \frac{\partial s}{\partial\mathbf{W}^ {(2)}}=\lambda\mathbf{W}^{(2)}W(1)s=λW(1)and W(2)s=λW( 2 )
The gradient of the model parameters close to the output layer can now be computed:
∂ J ∂ W ( 2 ) = prod ( ∂ J ∂ o , ∂ o ∂ W ( 2 ) ) + prod ( ∂ J ∂ s , ∂ s ∂ W ( 2 ) ) = ∂ J ∂ oh ⊤ + λ W ( 2 ) \frac{\partial J}{\partial\mathbf{W}^{(2)}} =\text{prod}(\frac{\partial J}{\partial\mathbf{o}},\frac{\partial \mathbf{o}}{\partial\mathbf{W}^{(2)}})+ \text{prod}(\frac{\ partial J}{\partial s},\frac{\partial s}{\partial\mathbf{W}^{(2)}}) =\frac{\partial J}{\partial\mathbf{o}}\ mathbf{h}^\top+\lambda\mathbf{W}^{(2)}W(2)J=prod(oJ,W(2)o)+prod(sJ,W(2)s)=oJh+λW( 2 )
and about the model parameterW ( 1 ) \mathbf{W}^{(1)}WThe gradient of ( 1 ) needs to be backpropagated along the output layer to the hidden layer. The gradient calculation about the output of the hidden layer is as follows:
∂ J ∂ h = prod ( ∂ J ∂ o , ∂ o ∂ h ) = W ( 2 ) ⊤ ∂ J ∂ o \frac{\partial J}{\partial\mathbf{h}}=\text{prod}(\frac{\partial J}{\partial\mathbf{o}},\frac{\partial\ mathbf{o}}{\partial\mathbf{h}})=\mathbf{W}^{(2)^\top}\frac{\partial J}{\partial\mathbf{o}}hJ=prod(oJ,ho)=W(2)oJ
Intermediate variable z \mathbf{z}z 的梯度 ∂ J / ∂ z ∈ R h \partial J/\partial\mathbf{z}\in\mathbb{R}^h J/zRh needs to useelement-wise multiplication operatorsymbol, use⊙ \odot shows:
∂ J ∂ z = prod ( ∂ J ∂ h , ∂ h ∂ z ) = ∂ J ∂ h ⊙ ϕ ′ ( z ) \frac{\partial J}{\partial\mathbf{z}}=\text {prod}(\frac{\partial J}{\partial\mathbf{h}},\frac{\partial\mathbf{h}}{\partial\mathbf{z}})=\frac{\partial J} {\partial\mathbf{h}}\odot\phi'(\mathbf{z})zJ=prod(hJ,zh)=hJϕ (z)
finally gets the gradient of the model parameters closest to the input layer∂ J / ∂ W ( 1 ) ∈ R h × d \partial J/\partial\mathbf{W}^{(1)}\in\mathbb{ R}^{h\times d}J/W(1)Rh×d
∂ J ∂ W ( 1 ) = prod ( ∂ J ∂ z , ∂ z ∂ W ( 1 ) ) + prod ( ∂ J ∂ s , ∂ s ∂ W ( 1 ) ) = ∂ J ∂ z x ⊤ + λ W ( 1 ) \frac{\partial J}{\partial\mathbf{W}^{(1)}} =\text{prod}(\frac{\partial J}{\partial\mathbf{z}},\frac{\partial \mathbf{z}}{\partial\mathbf{W}^{(1)}})+ \text{prod}(\frac{\partial J}{\partial s},\frac{\partial s}{\partial\mathbf{W}^{(1)}}) =\frac{\partial J}{\partial\mathbf{z}}\mathbf{x}^\top+\lambda\mathbf{W}^{(1)} W(1)J=prod(zJ,W(1)z)+prod(sJ,W(1)s)=zJx+λW(1)

When training a neural network, forward propagation and backpropagation are interdependent, such as:
(1) The calculation of the regular term during forward propagation depends on the model parameter W ( 1 ) \mathbf{W}^{(1)}W( 1 )W ( 2 ) \mathbf{W}^{(2)}WThe current values ​​of ( 2 ) , while they are given by the optimization algorithm according to the backpropagation of the most recent iteration.
(2)Gradient∂ J / ∂ h \partial J/\partial\mathbf{h}during backpropagationThe calculation of ∂ J / h depends on the hidden variable h given by the forward propagation\mathbf{h}The current value of h .
Therefore, after initializing the model parameters, use forward propagation and backpropagation alternately:use the gradient given by backpropagation to update the model parameters, andbackpropagation also reuses the intermediate values ​​stored in the forward propagation(to avoid repeated calculations ), thus,(video memory)than pure prediction

4.6 Numerical stability and model initialization

  The previous models all initialize model parameters with a preset distribution. However, the selection of the initialization scheme actually plays a pivotal role in neural network learning, which is crucial to maintaining numerical stability . Which activation function we choose and how we initialize the parameters can determine how quickly the optimization algorithm converges . A poor choice can lead to issues like exploding or vanishing gradients during training, a topic that is discussed in detail in this section, including some heuristics.

  1. [ Gradient Vanishing and Gradient Explosion ] Consider a model with LLLayer L , inputx \mathbf{x}x and outputo \mathbf{o}o 's deep network. Each layerlll is transformed byfl f_lflDefinition, the parameter of the transformation is the weight W ( l ) \mathbf{W}^{(l)}W( l ) , whose hidden layer variable ish ( l ) \mathbf{h}^{(l)}h(l) (令 h ( 0 ) = x \mathbf{h}^{(0)}=\mathbf{x} h(0)=x ), the network can be expressed as:
h = fl ( h ( l − 1 ) ) so o = f L ∘ ⋯ ∘ f 1 ( x ) \mathbf{h}=f_l(\mathbf{h}^{(l- 1)})\text{ Therefore}\mathbf{o}=f_L\circ\dots\circ f_1(\mathbf{x})h=fl(h( l 1 ) )soo  =fLf1( x )
  can beo \mathbf{o}oFor any set of parametersW ( l ) \mathbf{W}^{(l)}WThe gradient of ( l ) is written as:
∂ W ( l ) o = . . . \partial_{\mathbf{W}^{(l)}}\mathbf{o}=...W(l)o=...
  the gradient isL − 1 L-1L1 matrixM ( 1 ) ⋅ ⋯ ⋅ M ( l + 1 ) \mathbf{M}^{(1)}\cdot\dots\cdot\mathbf{M}^{(l+1)}M(1)M( l + 1 ) and gradient vectorv ( l ) \mathbf{v}^{(l)}v( l ) , ​​which is susceptible to numerical underflow, a common trick is to switch to logarithmic space, thereby shifting the pressure from the mantissa to the exponent. Unfortunately, the matrix M ( l ) \mathbf{M} ^{(l)}M( l ) may have a variety of eigenvalues, which may be large or small, causing the gradient obtained by the product to be large or small. If the parameter update is too large, the stable convergence of the broken block modelwill causethe problem of gradientexplosionif the parameter update is too small, it will hardly move during each update, resulting in the failure of the model to learnand causingthe problem of gradient disappearance.

  The sigmoid function is a common cause of gradient disappearance , because its gradient is close to 0 when the input is large or small. Therefore, the more stable ReLU family of functions has become the default choice for practitioners . And randomly generated Gaussian random matrices can cause matrix products to explode, which is caused by the initialization of the deep network, and there is no chance for the gradient descent optimizer to converge.

  2. [ Symmetry ] Another problem in neural network design is the symmetry inherent in parameterization . Assuming a simple multi-layer perceptron with one hidden layer and two hidden units, the weight W for the first layer ( 1 ) \mathbf{W}^{(1)}W( 1 ) is rearranged, and the output layer weights are also rearranged to obtain the same function. That is to say, there is a symmetry between the hidden units of each layer. If the iteration cannot break this symmetry, the expressive ability of the network cannot be achieved. One wayto solve this problemto do parameter initializationproper regularizationduring optimization.

  3. [ Parameter initialization ] In the previous chapters, the normal distribution is generally used to initialize the weight values ​​by default . This method is usually very effective for moderately difficult problems. The Xavier initialization method is introduced below :
  the output of some non-linear fully connected layers is given by the following formula:
oi = ∑ j = 1 n in wijxj {o}_i=\sum^{n_{\text{in}}}_ {j=1}w_{ij}x_joi=j=1ninwijxj
  The weights are drawn independently from the same distribution, which is assumed to have zero mean and variance σ 2 \sigma^2p2 (not necessarily Gaussian), assuming layerxj x_jxjThe input also has zero mean and variance γ 2 \gamma^2c2 , also withwij w_{ij}wijare independent of each other, from which oi {o}_i can be calculatedoimean and variance of .
E [ oi ] = . . . = 0 Var [ oi ] = . . . = n in σ 2 γ 2 E[o_i]=...=0\\\ \\ \text{Var}[o_i]=. ..=n_{\text{in}}\sigma^2\gamma^2And [ thei]=...=0 there is [ iti]=...=ninp2 c2
  One way to keep variance constant is to setn in σ 2 = 1 n_{\text{in}}\sigma^2=1ninp2=1 , considering the backpropagation process also faces similar problems, needn out σ 2 = 1 n_{\text{out}}\sigma^2=1noutp2=1 , however it is impossible to satisfy these two conditions at the same time, only need to satisfy
1 2 ( n in + n out ) σ 2 = 1 or equivalent to σ = 2 n in + n out \frac{1}{2}(n_ {\text{in}}+n_{\text{out}})\sigma^2=1\text{ or equivalently}\sigma=\sqrt{\frac{2}{n_{\text{in} }+n_{\text{out}}}}21(nin+nout) p2=1 or equivalent to σ  =nin+nout2
  This is the basis of the standard and practical Xavier initialization ( Glorot and Bengio, 2010 ), from mean 0, variance σ 2 = 2 n in + n out \sigma^2=\frac{2}{n_{\text{in }}+n_{\text{out}}}p2=nin+nout2Sampling weights from a Gaussian distribution of , or choosing to draw weights from a uniform distribution: note the uniform distribution U ( − a , a ) U(-a,a)U(a,a ) has a variance ofa 2 3 \frac{a^2}{3}3a2, bring it into σ 2 \sigma^2p2 conditions, the initialization range can be obtained:
U ( − 6 n in + n out , 6 n in + n out ) U\left(-\sqrt{\frac{6}{n_{\text{in}}+n_ {\text{out}}}},\sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}\right)U(nin+nout6 ,nin+nout6 )
  Although the "no nonlinearity" assumption is easily violated in neural networks,the Xavier initialization method has been shown to be effective in practice.

  4. [ Supplementary content ] The above content is still the fur of the current parameter initialization method. Deep learning frameworks usually implement more than a dozen heuristic methods. Heuristics for other cases. If you are interested, you can study papers that analyze various heuristic methods.

4.7 Environment and distribution shift

  Where the data comes from , and ultimately how to process the output of the model , are fundamental concerns that require attention. This can be traced to many failed machine learning deployments, such as models that are good based on test set accuracy but fail catastrophically when the data distribution suddenly changes , or where the model deployment itself perturbs the data distribution , e.g.
  training An applicant default risk model to predict who will repay a loan or default, the model finds that applicants default in relation to their shoes (shoe wearers repay, sneaker wearers default), the model may then tend to favor all wearers Applicants for leather shoes give out loans and applicants for sneakers are rejected, and once the model starts making decisions based on footwear and customers start to understand and change their behavior, it won't be long before all applicants wear leather shoes without a commensurate increase in credit, That is: by introducing the model's decisions into the environment, it is possible to corrupt the model .
  This section focuses on uncovering these common problems, stimulating thinking, identifying risks early, and mitigating losses: some solutions are simple (requiring the right data ), others are difficult ( implementing reinforcement learning systems ), and some solutions require complete Step beyond statistical predictions and tackle some thorny questions about the ethical application of algorithms.
  Five areas:
  - Types of distribution shift
  - Distribution shift examples
  - Distribution shift correction
  - Taxonomy of learning problems
  - Fairness, accountability, and transparency in machine learning

4.7.1 Types of distribution shifts

  Consider a binary classification problem: Distinguish between cats and dogs
  1. Covariate shift , the input distribution changes over time, while the label function does not change, such as: the training set consists of real photos, while the test set only contains cartoon pictures, if there is no method To adapt to the new field, there may be trouble.
  2. Label shift , as opposed to covariate shift, where label edge probabilities can change. For example, predicting a patient's disease may be judged based on symptoms. Even if the relative prevalence of the disease changes over time, label shift is still an appropriate assumption because the disease will cause symptoms.
  3. Conceptual deviation , such as: changing the data source in the geographical location, it is found that the distribution of the name of "soft drink" has a considerable conceptual deviation, so it is best to use the knowledge of the gradual deviation in time or space.

4.7.2 Example of distribution shift

  Discuss some specific cases where covariate shift or concept shift is not apparent.

  1. Medical diagnosis , the distribution of collected training data and the distribution of data encountered in practice may be very different. For example, a company researches a blood test method, mainly for a disease that affects elderly men, and hopes to use them However, obtaining blood samples from healthy men is much more difficult than obtaining blood samples from existing patients, so the company solicited blood donations from students at a university for a fee. It is conceivable that these samples may encounter extremes. Covariate shift of , which is difficult to correct by conventional methods.

  2. Self-driving cars , a self-driving company wants to use machine learning to develop a "curb detector". Due to the high cost of obtaining real annotation data, they came up with a method to use the synthetic data of the game engine as additional training data , however, this is a disaster in real car applications, because the curbs are rendered with a simple texture, and all curbs are rendered with the same texture, which is bound to cause the curb detector to quickly learn This "feature".

  3. Non-stationary distribution , which occurs when the distribution changes slowly and is not fully updated, such as: (1) Build a spam filter that can detect all spam well, but the sender of spam will be smart, Create a new message that doesn't look like the old spam; (2) build a product recommendation system that works well throughout the winter, but still recommends Santa hats after Christmas.

  4. More: (1) Face detector, which works well on all benchmarks, but fails on test data, problematic example is a close-up where a face fills the entire image; (2 ) image classifier, the number of each category is nearly average in the data set, but the actual label distribution of photos in the real world is obviously uneven.

4.7.3 Distribution shift correction

  In some cases, motion strategies can be used to account for these offsets and do a better job.
  1. [ Empirical vs. actual risk ], to approximate true risk...
  2. [ Covariate bias correction ]
  3. [ Label bias correction ]
  4. [ Concept bias correction ]

4.7.4 Classification of Learning Problems

  1. [ Batch learning ] Access a set of training features and labels, use these features and labels to concatenate f ( x ) f(\mathbf{x})f ( x ) , and then deploying this model to score new data from the same distribution, the system is rarely updated after being deployed.
  2. [Online learning] In addition to "batch" learning, individual "online" learning is also possible. First observexi \mathbf{x}_ixi, resulting in an estimated value f ( xi ) f(\mathbf{x}_i)f(xi) , and get a reward or loss, many practical problems fall into this category, such as:stock forecasting, predict tomorrow's stock price, trade based on this forecast, and at the end of the day evaluate whether the forecast is profitable, that is,online learninggoes through the following cycle Continuously improve the model:
model ft → data xt → estimate ft ( xt ) → observation yt → loss l ( yt , ft ( xt ) ) → model ft + 1 \text{model }f_t\rightarrow\text{data }\mathbf{ x}_t\rightarrow\text{estimate }f_t(\mathbf{x}_t)\rightarrow\text{observation }y_t\rightarrow\text{loss }l(y_t,f_t(\mathbf{x}_t))\rightarrow \text{model}f_{t+1}model ftdata xtestimate ft(xt)observation ytloss l(yt,ft(xt))model ft+1
  3. [ Slot machine ] is a special case of the above problem, and most learning problems have a continuously parameterized function fff , but in the slot machine problem, only finite handles can be pulled, that is, the actions that can be taken are limited, and a stronger theoretical guarantee of optimality can be obtained.
  4. [Control] In many cases, the environment will remember what the model has done before, but not necessarily in a confrontational way, such as: the coffee boiler controller observes different temperatures according to whether the boiler was heated before, PID (proportional -integral-derivative) controller algorithm is a popular choice, e.g., a user's behavior on a news site depends on what has been shown to her before (most news are only read once, etc.).
  5. [Reinforcement learning] emphasizes how to act based on the environment to maximize the expected benefits. Chess, Go, backgammon, and StarCraft are examples of applications of reinforcement learning.

4.7.5 Fairness, Accountability and Transparency in Machine Learning

  When we deploy machine learning, it is often not just to optimize a predictive model, but to provide a tool that is (partially or completely) used for automated decision-making. This leap from prediction to decision-making not only raises new technical issues, Ethical issues must also be considered: (1) which groups the model is applicable to and which groups it is not valid for; (2) when the model transforms predictions into actions, the potential cost sensitivity of making mistakes in various ways, and what social value is affected; (3) Notice how predictive models lead to feedback loops.
  For example: Should the news an individual see be determined by the Facebook pages they like? This is a case study of the ethical dilemmas that a machine learning career needs to face.

4.8 Actual Kaggle Competition: Housing Price Prediction

Guess you like

Origin blog.csdn.net/lj164567487/article/details/129188151