Introduction to Deep Learning Basics [6 (1)]: Model tuning: attention mechanism [multi-head attention, self-attention], regularization [L1, L2, Dropout, Drop Connect], etc.

insert image description here
[Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

insert image description here
The column introduces in detail: [Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

This column is mainly to facilitate beginners to quickly grasp relevant knowledge. Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Column Subscription: Deep Learning Introduction to Advanced Columns

1. Attention mechanism

In the field of deep learning, models often need to receive and process a large amount of data. However, at a certain moment, only a small amount of certain data is important. This situation is very suitable for the Attention mechanism to shine.

For example, Figure 2 shows the result of a machine translation. In this example, we want to translate "who are you" into "who are you". The traditional model processing method is a seq-to-seq model. It includes an encoder end and a decoder end, where the encoder end encodes "who are you", and then passes the information of the entire sentence to the decoder end, and the decoder decodes "who are you". In this process, the decoder decodes word by word. During each decoding process, if too much information is received, it may cause internal confusion of the model, resulting in erroneous results. We can use the Attention mechanism to solve this problem. As can be seen from Figure 2, when generating "you", it has a lot to do with the word "you" and has little to do with "who are", so we hope that in this process The Attention mechanism can be used to focus more on "you" instead of "who are", thereby improving the performance of the overall model.

Since the Attention mechanism was proposed, there have been many different Attention application methods, but the avenues are common, and they all focus the attention of the model on important things. In the follow-up of this article, some classic or commonly used Attention mechanisms will be selected for discussion.

Remark: Unconscious saliency attention is more common in deep learning.

1.1 Use the machine translation task to show you the calculation of the Attention mechanism

Talking about the Attention mechanism alone will be a bit abstract and boring, so we might as well take the machine translation task as an example to understand the use of the Attention mechanism by explaining how the Attention mechanism is applied in the machine translation task.

What is a machine translation task? Taking Chinese-to-English translation as an example, machine translation translates a string of Chinese sentences into corresponding English sentences, as shown in Figure 1.

Figure 1 shows a classic machine translation structure Seq-to-Seq, and adds Attention calculation to it. The Seq-to-Seq structure consists of two parts: Encoder and Decoder. Among them, Encoder is used to encode Chinese sentences, and these codes will be provided to Decoder for use later; Decoder will decode according to the data of Encoder. Let's take Figure 1 as an example to explain the decoding process of the Decoder in detail.

More specifically, Figure 1 shows how the calculations are performed when generating the word "machine". First, the output state q 2 q_2 of the previous momentq2And the output of Encoder h = [ h 1 , h 2 , h 3 , h 4 ] h=[h_1,h_2,h_3,h_4]h=[h1,h2,h3,h4] Carry out the Attention calculation to get a context at the current moment, which can be organized like this with the formula:

[ a 1 , a 2 , a 3 , a 4 ] = s q r t m a x ( [ s ( q 2 , h 1 ) , s ( q 2 , h 2 ) , s ( q 2 , h 3 ) , s ( q 2 , h 4 ) ] ) c e m t e x t t e x t t = ∑ i = 1 4 a i ⋅ h i \begin{array}{c}[a_1,a_2,a_3,a_4]=sqrt{max}([s(q_2,h_1),s(q_2,h_2),s(q_2,h_3),s(q_2,h_4)])\\ cemtext{text}t=\sum_{i=1}^{4}a_i\cdot h_i\end{array} [a1,a2,a3,a4]=sqrtmax([s(q2,h1),s(q2,h2),s(q2,h3),s(q2,h4)])ce m t e x t t e x t t=i=14aihi

Let's explain, where s ( qi , hj ) s(q_i,h_j)s(qi,hj) represents the attention scoring function, which is a scalar, and its size describes the degree of attention on the results of these Encoders at the current moment. This function will be discussed later. Then use softmax to normalize this result, and finally use weighted evaluation to obtain the context vector context at the current moment. This context can be interpreted as: "I love" has been available so far, and based on this, we shouldpay more attention tothe content of the source Chinese sentence in the next moment. This is a complete calculation about the Attention mechanism.

Finally, fuse this context with the output "love" of the previous moment as the input of the RNN unit at the current moment.

In Figure 1, the output result of the previous step is continued to be fused. For example, "love" is fused in the above description. In some implementations, the output of the previous step is not integrated. The default is q 2 q_2q2It is also reasonable that the "love" information is already carried in .

1.2 Formal introduction of attention mechanism

Earlier we introduced the overall calculation of the Attention mechanism through the machine translation task. But there is still a small tail that has not been expanded, that is, the calculation of the attention scoring function . Now we will discuss this matter in the future. But before talking about this function, let's first summarize the calculation of the Attention mechanism above. Figure 2 describes the calculation principle of the Attention mechanism in detail.

Suppose now we want to input a set of H = [ h 1 , h 2 , h 3 , . . . , hn ] H=[h1,h2,h3,...,h_n]H=[ h 1 ,h 2 ,h 3 ,...,hn] Use the Attention mechanism to calculate important content, here often requires aquery vector q(this vector is often related to the task you do, such as theq 2 q_2q2) , and then calculate the query vector q and each input hi h_i through a scoring functionhiCorrelation between , a score is obtained. Next, use softmax to normalize these scores. The normalized result is that the query vector q is in each input hi h_ihiThe attention distribution on a = [ a 1 , a 2 , a 3 , . . . , an ] a=[a1,a2,a3,...,a_n]a=[ a 1 ,a 2 ,a 3 ,...,an] , where each value and the original inputH = [ h 1 , h 2 , h 3 , . . . , hn ] H=[h1,h2,h3,...,h_n]H=[ h 1 ,h 2 ,h 3 ,...,hn] one-to-one correspondence. Takeai a_iaiFor example, the relevant calculation formula is as follows:

a i = s o f t w a r e ( s ( h i , q ) ) = e x p ( s ( h i , q ) ) ∑ j = 1 n e x p ( s ( h j , q ) ) a_i=software(s(h_i,q))=\dfrac{exp(s(h_i,q))}{\sum_{j=1}^n exp(s(h_j,q))}\quad ai=software(s(hi,q))=j=1nexp(s(hj,q))exp(s(hi,q))

Finally, according to these attention distributions, information can be selectively extracted from the input information H. The commonly used information extraction method here is a "soft" information extraction (Figure 2 shows a "soft" Attention), that is, to weight and sum the input information according to the attention distribution, and the final result context reflects the content that the model should pay attention to at present:

context = ∑ i = 1 n a i ⋅ h i \text{context}=\sum_{i=1}^n a_i\cdot h_i context=i=1naihi

Now let's solve the small tail that has not been expanded before - the scoring function, which can be calculated in the following ways:

加性模型: s ( h , q ) = v T t a n h ( W h + U q ) s(h,q)=v^Ttanh(Wh+Uq)\quad\text{} s(h,q)=vTtanh(Wh+Uq)

Dot product model: s ( h , q ) = h T qs(h,q)=h^Tqs(h,q)=hTq

Scaled dot product model: s ( h , q ) = h T q D s(h,q)=\frac{h^{T}q}{\sqrt{D}}\quads(h,q)=D hTq

Bilinear model: s ( h , q ) = h TW qs(h,q)=h^{T}W qs(h,q)=hT Wq

The parameters W, U, and v in the above formula are all learnable parameter matrices or vectors, and D is the dimension of the input vector. Next, let's analyze the differences in the calculation methods of these scores.

  • The additive model introduces learnable parameters, and maps the query vector q and the original input vector h to different vector spaces for calculation and scoring. Obviously, compared with the additive model, the dot product model has better computational efficiency.

  • In addition, when the dimension of the input vector is relatively high, the dot product model usually has a relatively large variance, resulting in a relatively small gradient of the Softmax function. Therefore scaling the dot product model smooths the fractional values ​​by dividing by a square root term, which is equivalent to smoothing the final attention distribution, alleviating this problem.

Finally, the bilinear model can be reshaped as s ( hi , q ) = h TW q = h T ( UTV ) q = ( U h ) T ( V q ) , s(h_{i},q)=h^ {T}W q=h^{T}(U^{T}V)q=(Uh)^{T}(V_q),s(hi,q)=hT Wq=hT(UT V)q=(Uh)T(Vq) , that is to calculate the dot product after linearly transforming the query vector q and the original input vector h respectively. Compared with the dot product model,the bilinear model introduces asymmetry when calculating the similarity.

1.3 Attention mechanism-related variants

1.3.1 Hard Attention Mechanism

In the classic attention mechanism chapter, we use a soft attention method for the Attention mechanism, which weights and fuses each input vector through the attention distribution. The Hard Attention mechanism does not use this method. It selects one of the input vectors as the output according to the attention distribution. There are two options here:

  • In the attention distribution, the input vector corresponding to the item with the largest score is selected as the output of the Attention mechanism.

  • Random sampling is performed according to the attention distribution, and the sampling results are used as the output of the Attention mechanism.

Hard attention selects the output of Attention through the above two methods, which will make the functional relationship between the final loss function and the attention distribution irreducible, resulting in the inability to use the backpropagation algorithm to train the model. Hard attention usually requires the use of reinforcement learning to train . Therefore, the general deep learning algorithm will use soft attention to calculate,

1.3.2 Key-value pair attention mechanism¶

Suppose our input information is no longer H = [ h 1 , h 2 , h 3 , . . . , hn ] H=[h1,h2,h3,...,h_n]H=[ h 1 ,h 2 ,h 3 ,...,hn] , but a more general key-value pair( K , V ) = [ ( k 1 , v 1 ) , ( k 2 , v 2 ) , . . . , ( kn , vn ) ] (K,V)=[(k1,v1),(k2,v2),...,(k_n,v_n)](K,V)=[( k 1 ,v1 ) , _( k 2 ,v2 ) , _...,(kn,vn)] , the relevant query vector is still q. In this mode, the query vector q and the corresponding keyki k_ikiCalculate the attention weight ai a_iai

a i = s o f t w a r e ( s ( k i , q ) ) = e x p ( s ( k i , q ) ) ∑ j = 1 n e x p ( s ( k j , q ) ) a_i=software(s(k_i,q))=\dfrac{exp(s(k_i,q))}{\sum_{j=1}^n exp(s(k_j,q))} ai=software(s(ki,q))=j=1nexp(s(kj,q))exp(s(ki,q))

After calculating the attention distribution on the input data, use the attention distribution and the corresponding value in the key-value pair to perform weighted fusion calculation:

c o n t e x t = ∑ i = 1 n a i ⋅ v i context=\sum\limits_{i=1}^{n}a_i\cdot v_i context=i=1naivi

Obviously, when the key values ​​are the same k = vk=vk=v , the key-value pair attention degenerates into an ordinary classical attention mechanism.

1.3.3. Multi-head attention mechanism

Multi-Head Attention uses multiple query vectors Q = [ q 1 , q 2 , . . . , qm ] Q=[q1,q2,...,qm]Q=[q1,q2 , _...,q m ] in parallel from the input information( K , V ) = [ ( k 1 , v 1 ) , ( k 2 , v 2 ) , . . . , ( kn , vn ) ] (K,V)=[( k1,v1),(k2,v2),...,(kn,vn)](K,V)=[( k 1 ,v1 ) , _( k 2 ,v2 ) , _...,(kn,v n )] to select multiple sets of information. During the query process, each query vector qi will focus on different parts of the input information, that is, analyze the current input information from different angles.

Suppose aij a_{ij}aijRepresents the i-th query vector qi q_iqiand the jth input information kj k_jkjThe attention weight, contexti represents the query vector qi q_iqiThe calculated Attention output vector. It is calculated as:

a i j = s q r t m a x ( s ( k j , q i ) ) = e x p ( s ( k j , q i ) ) ∑ i = 1 n e x p ( s ( k i , q i ) ) context i = ∑ j = 1 n a i j ⋅ v j \begin{aligned}a_{ij}=sqrt{max}(s(k_j,q_i))&=\frac{exp(s(k_j,q_i))}{\sum_{i=1}^n exp(s(k_i,q_i))}\\ \text{context}_i&=\sum_{j=1}^n a_{ij}\cdot v_j\end{aligned}\quad aij=sqrtmax(s(kj,qi))contexti=i=1nexp(s(ki,qi))exp(s(kj,qi))=j=1naijvj

Finally, the results of all query vectors are concatenated as the final result:

c o n t e x t = c o n t e x t 1 ⊕ context 2 ⊕ context 3 ⊕ … ⊕ context m context=context_1\oplus\textit{context}_2\oplus\textit{context}_3\oplus\ldots\oplus\textit{context}_m context=context1context2context3contextm

⊕ in the formula represents the vector concatenation operation.

1.4 Self-attention mechanism

In the previous content, we will use a query vector q and the corresponding input H = [ h 1 , h 2 , . . . , hn ] H=[h1,h2,...,h_n]H=[ h 1 ,h 2 ,...,hn] for attention calculation, where the query vector q is often related to the task. For example, in the machine translation task based on Seq-to-Seq, this query vector q. can be the output state vector of the Decoder at the previous moment, as shown in Figure 1 .

However, in the self-attention mechanism (self-Attention), the query vector here can also be generated using input information, instead of selecting a query vector related to the above task. It is equivalent to that after the model reads the input information, it determines the most important information based on the input information itself.

The self-attention mechanism often adopts the Query-Key-Value (Query-Key-Value) model. It may be discussed with the self-attention mechanism in BERT, as shown in Figure 2.

In Figure 2, the input information H = [ h 1 , h 2 ] H=[h1,h2]H=[ h 1 ,h 2 ] , where each row in the blue matrix represents an input vector. In addition, there are $W_q, W_k, W_v$3 matrices in Figure 2, which are responsible for converting the input information H to the corresponding query space Q = [ q1 , q 2 ] Q=[q1,q2]Q=[q1,q 2 ] , key spaceK = [ k 1 , k 2 and value space V = [ v 1 , v 2 ] K=[k1,k2 and value space V=[v1,v2]K=[ k 1 ,k2 and the value space V=[ v1 , _v2 ] _

[ q 1 = h 1 W q q 2 = h 2 W q ] ⇒ Q = H W q \begin{bmatrix}q_1=h_1W_q\\ q_2=h_2W_q\end{bmatrix}\Rightarrow Q=HW_q [q1=h1Wqq2=h2Wq]Q=HWq

[ k 1 = h 1 W k k 2 = h 2 W k ] ⇒ K = H W k \begin{bmatrix}k_1=h_1W_k\\ k_2=h_2W_k\end{bmatrix}\Rightarrow K=HW_k [k1=h1Wkk2=h2Wk]K=HWk

[ v 1 = h 1 W v v 2 = h 2 W v ] ⇒ V = H W v \begin{bmatrix}v_1=h_1W_v\\ v_2=h_2W_v\end{bmatrix}\Rightarrow V=HW_v [v1=h1Wvv2=h2Wv]V=HWv

After obtaining the expressions Q, K and V of the input information in different spaces, it is advisable to use h 1 h_1h1Take this as an example to calculate an attention output vector context 1 context_1 for this positioncontext1, which represents the content that the model should focus on at this position, as shown in Figure 3.

It can be seen that after obtaining the expressions Q, K and V of the original input H in the query space, key space and value space, the calculation of q 1 q_1q1at h 1 h_1h1and h 2 h_2h2Score s 1 1 s_11s11 ands 1 2 s_12s12. The score calculation here uses the dot product operation. The scores are then scaled and normalized using softmax toobtainh1Attention distribution for this position: a 1 1 a_11a11 anda 1 2 a_12a12 , they represent that the model is currently ath 1 h_1h1This position requires input information h 1 h_1h1and h 2 h_2h2level of attention. Finally, according to the attention distribution of the position for v 1 v_1v1and v 2 v_2v2Perform a weighted average to obtain the Attention vector context 1 context_1 of the final h1 positioncontext1

In the same way, you can get the Attention vector context2 of the second position, or continue to expand the input sequence to get more contexti context_icontexti, the principle is the same. The calculation process of the attention mechanism:

Suppose there is currently input information H = [ h 1 , h 2 , . . . , hn ] H=[h1,h2,...,h_n]H=[ h 1 ,h 2 ,...,hn] , we need to use the self-attention mechanism to get the output of each positioncontext = [ context 1 , context 2 , . . . , contextn ] context=[context1,context2,...,context_n]context=[context1,context2,...,contextn]

  • First, the original input needs to be mapped to query space Q, key space K, and value space V. The relevant calculation formulas are as follows:

Q = H W q = [ q 1 , q 2 , … , q n ] K = H W k = [ k 1 , k 2 , … , k n ] V = H W v = [ v 1 , v 2 , … , v n ] \begin{array}{c}Q=HW_q=[q_1,q_2,\ldots,q_n]\\ K=HW_k=[k_1,k_2,\ldots,k_n]\\ V=HW_v=[v_1,v_2,\ldots,v_n]\end{array} Q=HWq=[q1,q2,,qn]K=HWk=[k1,k2,,kn]V=HWv=[v1,v2,,vn]

  • Next, we will calculate the attention distribution for each position and weight the corresponding results:

c o n t e x t i = ∑ j = 1 n s o f t w a r e ( s ( q i , k j ) ) ⋅ v j context_i=\sum\limits_{j=1}^n{software(s(q_i,k_j))}\cdot v_j contexti=j=1nsoftware(s(qi,kj))vj

Among them, s ( qi , kj ) s(q_i,k_j)s(qi,kj) is the fractional value after the above dot product and scaling.

  • Finally, in order to speed up the calculation efficiency, here you can actually use the matrix calculation method to calculate the Attention output vectors of all positions at once:

c o n t e x t = s o f t w a r e ( Q K T D k ) V context=software(\dfrac{QK^T}{\sqrt{D_k}})V context=software(Dk QKT)V

Congratulations, I believe you already know the principle of the self-attention mechanism very well.

2. Regularization

As we shift to the right, our model tries to learn fine details and noise well from the training data, which ends up performing poorly on unseen data. That is, when the model is shifted to the right, the complexity of the model increases, so that the training error decreases, but the test error does not decrease. As shown below.

If you've built neural networks before, you know how complex they are. This makes them more prone to overfitting.

Regularization is a technique of slightly modifying a learning algorithm to make the model generalize better. This in turn improves the performance of the model on unseen data.

2.1 How Regularization Helps Reduce Overfitting

Let us consider a neural network that is overfitting to the training data as shown in the figure below.

If you've studied the concept of regularization in machine learning, you'll have a fair idea of ​​the regularization penalty coefficient. In deep learning, it actually penalizes the weight matrix of the nodes.

Suppose our regularization coefficients are so high that some weight matrices are almost equal to zero.

This will result in a simpler linear network and a slight underfitting of the training data.

Such large regularization coefficient values ​​are not very useful. We need to optimize the value of the regularization coefficient to obtain a well-fitting model as shown in the figure below.

Regularization can avoid algorithm overfitting, which usually occurs when the input data learned by the algorithm cannot reflect the real distribution and there is some noise. In the past few years, researchers have proposed and developed a variety of regularization methods suitable for machine learning algorithms, such as data enhancement, L2 regularization (weight decay), L1 regularization, Dropout, Drop Connect, random pooling, and early stopping.

In addition to generalization reasons, Occam's razor and Bayesian estimation also support regularization. According to the principle of Occam's razor, among all possible models, the model that can explain the known data well and is very simple is the best model. From the perspective of Bayesian estimation, the regularization term corresponds to the prior probability of the model.

2.2 Data Augmentation

Data augmentation is an important tool to improve the performance of algorithms and meet the needs of deep learning models for large amounts of data. Data augmentation artificially augments the training dataset by adding transformations or perturbations to the training data. Data augmentation techniques such as flipping images horizontally or vertically, cropping, color shifting, dilation and rotation are commonly applied in visual representation and image classification.

For details on the method of data enhancement in the visual field, please refer to: Data augmentation, follow-up.

2.3 L1 and L2 regularization

L1 and L2 regularization are the most commonly used regularization methods. L1 regularization adds a regularization term to the objective function to reduce the sum of the absolute values ​​of the parameters ; while in L2 regularization, the purpose of adding a regularization term is to reduce the sum of the squares of the parameters . According to previous research, many parameter vectors in L1 regularization are sparse vectors, because many models lead to parameters approaching 0, so it is often used in feature selection settings. The most common regularization method in machine learning is to impose an L2 norm constraint on the weights.

The standard regularized cost function is as follows:

θ = argmin θ 1 N ∑ i = 1 N ( L ( y ^ i , y ) + λ R ( w ) ) \theta=argmin_\theta\dfrac{1}{N}\sum\limits_{i=1} ^N(L(\hat{y}_i,y)+\lambda R(w))i=argminiN1i=1N(L(y^i,y)+λR(w))

where the regularization term R(w) is:

R L 2 ( w ) = ∣ ∣ W ∣ ∣ 2 2 R_{L_2}(w)=||W||_2^2 RL2(w)=∣∣W22

Another way to penalize the absolute sum of weights is L1 regularization:

R L 1 ( w ) = ∣ ∣ W ∣ ∣ 1 R_{L_1}(w)=||W||_1 RL1(w)=∣∣W1

L1 regularization is not differentiable at zero, so the weights grow by a constant factor towards zero. Many neural networks use a first-order step in the weight decay formulation to solve the non-convex L1 regularization problem. An approximate variant of the L1 norm is:

∣ ∣ W ∣ ∣ 1 = ∑ k = 1 Q w k 2 + ϵ \left|\left|W\right|\right|_1=\sum\limits_{k=1}^Q\sqrt{w_k^2+\epsilon} W1=k=1Qwk2+ϵ

Another regularization method is a mixture of L1 and L2 regularization, the elastic net penalty.

From a Bayesian point of view, the entire optimization problem is a Bayesian maximum a posteriori estimation, where the regularization term corresponds to the prior information in the posterior estimation, and the loss function corresponds to the likelihood function in the posterior estimation. The product of is the form of the corresponding Bayesian maximum a posteriori estimate

2.3.1 Bayesian inference analysis method

There are also conclusions for the L1 norm and the L2 norm:

  • The L1 norm is equivalent to setting a parameter to 1 for the model parameter θ α \frac{1}{\alpha}a1The zero-mean Laplace prior distribution for

  • The L2 norm is equivalent to setting a covariance 1 α \frac{1}{\alpha} for the model parameter θa1The zero-mean Gaussian prior distribution for

1. The L2 norm is equivalent to setting a zero-mean Gaussian prior distribution for the model parameter θ

Taking the linear model as an example, the conclusion can be extended to any model, and the linear model equation can be expressed as:

Y = θ TX + ϵ Y=\theta^TX+\epsilon\quad\text{}Y=iTX+ϵ

In particular, ϵ ∼ N ( 0 , σ 2 ) , θ i ∼ N ( 0 , τ 2 ) \epsilon\sim N(0,\sigma^2)\text{,}\theta_i\sim N(0 ,\tau^2)ϵN(0,p2 ),iiN(0,t2 ), then there are:

p ( ϵ i ) = 1 2 π σ 2 exp ( − ϵ i 2 2 σ 2 ) p(\epsilon_i)=\dfrac{1}{\sqrt{2\pi\sigma^2}}exp(-\dfrac {\epsilon_i^2}{2\sigma^2})p ( ϵi)=2 p.s _2 1exp(2 p2ϵi2)

p ( yi ∣ xi ; θ ) = 1 2 π σ 2 exp ( − ( yi − θ T xi ) 2 2 σ 2 ) p(y_i|x_i;\theta)=\dfrac{1}{\sqrt{2\ pi\sigma^2}}exp(-\dfrac{(y_i-\theta^Tx_i)^2}{2\sigma^2})p ( andixi;i )=2 p.s _2 1exp(2 p2(yiiTxi)2)

Compute the maximum a posteriori estimate:

argmax θ ln L ( θ ) = argmax θ ( ln ∏ i = 1 np ( yi ∣ xi ; θ ) + lp ( θ ) ) = ln ∏ i = 1 n 1 2 π σ 2 exp ( − ( yi − θ T xi ) 2 2 σ 2 ) + ln ∏ j = 1 d 1 2 π τ 2 exp ( − θ j 2 2 τ 2 ) = − 1 2 σ 2 ∑ i = 1 n ( yi − θ T xi ) 2 − 2 τ 2 ∑ j = 1 d θ j 2 + nln σ 2 π − dln τ 2 π argmax_{\theta}ln L(\theta)=argmax_\theta(ln\prod\limits_{i=1}^np(; y_i|x_i;\theta)+lp(\theta)) \\=ln\prod\limits_{i=1}^n\dfrac{1}{\sqrt{2\pi\sigma^2}}exp(- \dfrac{(y_i-\theta^Tx_i)^2}{2\sigma^2})+ln\prod\limits_{j=1}^d\dfrac{1}{\sqrt{2\pi\tau^ 2}}exp(-\dfrac{\theta_j^2}{2\tau^2})\quad\=-\dfrac{1}{2\sigma^2}\sum_{i=1}^n( y_i-\theta^Tx_i)^2-\dfrac{1}{2\tau^2}\sum_{j=1}^d\theta_j^2+nln\sigma\sqrt{2\pi}-dln\tau \sqrt{2\pi}\quadargmaxil n L ( θ )=argmaxi(lni=1np ( andixi;i )+lp(θ))=lni=1n2 p.s _2 1exp(2 p2(yiiTxi)2)+lnj=1d2 square meters2 1exp(2 sq2ij2)=2 p21i=1n(yiiTxi)22 sq21j=1dij2+n l ns2 p.m dlnτ2 p.m

Maximize the above formula, remove the negative sign, and the unified parameter can be transformed into:

argmin θ ln L ( θ ) = ∑ i = 1 n ( yi − θ T xi ) 2 + λ ∑ j = 1 d θ j 2 argmin_{\theta}ln L(\theta)=\sum_{i=1}; ^n(y_i-\theta^Tx_i)^2+\lambda\sum_{j=1}^d\theta_j^2argminil n L ( θ )=i=1n(yiiTxi)2+lj=1dij2

The above formula is just the cost function of the linear regression problem in the L2 norm, so the conclusion is verified.

2. The L1 norm is equivalent to setting a Laplace prior distribution for the model parameter θ

Taking the linear model as an example, the conclusion can be extended to any model, assuming ϵ ∼ N ( 0 , σ 2 ) , θ i ∼ L aplace ( 0 , b ) \epsilon\sim N(0,\sigma^2)\text {,}\theta_i\sim Laplace(0,b)ϵN(0,p2 ),iiLaplace(0,b ) Then there are:

argmax θ ln L ( θ ) = ln ∏ i = 1 np ( yi ∣ xi ; θ ) + lnp ( θ ) = ln ∏ 1 2 π σ 2 exp ( − ( yi − θ T xi ) 2 2 σ 2 ) + ln ∏ j = 1 d 1 2 bexp ( − ∣ θ j ∣ b ) = 1 2 σ 2 ∑ i = 1 n ( yi − θ T xi ) 2 − 1 b ∑ i = 1 n ∣ θ j ∣ − nln σ 2 π − din 2 b argmax_{\theta}ln L(\theta)=ln\prod\limits_{i=1}^np(y_i|x_i;\theta)+ln p(\theta)\quad\quad\ quad \\ \begin{aligned}&=ln\prod\frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(y_i-\theta^Tx_i)^2}{2 \sigma^2})+ln\prod\limits_{j=1}^d\frac{1}{2b}exp(-\frac{|\theta_j|}{b})\\ &=\frac{1 }{2\sigma^2}\sum_{i=1}^n(y_i-\theta^Tx_i)^2-\frac{1}{b}\sum\limits_{i=1}^n|\theta_j |-nln\sigma\sqrt{2\pi}-d in2b\end{aligned}argmaxil n L ( θ )=lni=1np ( andixi;i )+lnp(θ)=ln2 p.s _2 1exp(2 p2(yiiTxi)2)+lnj=1d2 b1exp(bθj)=2 p21i=1n(yiiTxi)2b1i=1nθjn l ns2 p.m d in 2 b

Maximizing the above formula, removing the negative sign and unified parameters, becomes the minimization:
argmin θ ln L ( θ ) = ∑ i = 1 n ( yi − θ T xi ) + λ ∑ j = 1 d ∣ θ j ∣ argmin_\theta lnL(\theta)=\sum\limits_{i=1}^n(y_i-\theta^Tx_i)+\lambda\sum\limits_{j=1}^d|\theta_j|argminil n L ( θ )=i=1n(yiiTxi)+lj=1dθj

The above formula is just the cost function of the linear regression problem under the L1 norm regularization, so the conclusion is verified.

If the error conforms to a Gaussian distribution with a mean of 0, then the result of the maximum likelihood estimation method is the least squares method, which is why the error definition is often used ∑ i = 1 n ( yi − θ T xi ) 2 \sum_{i=1}^ {n}(y_{i}-\theta^{T}x_{i})^{2}i=1n(yiiTxi)2 , because this formula is derived based on probability

2.4 Dropout

Dropout refers to randomly discarding some neurons during the training process of the neural network to reduce the complexity of the neural network and prevent overfitting. The implementation method of Dropout is very simple: in each iterative training, a certain number of neurons in each layer are randomly shielded with a certain probability, and the network formed by the remaining neurons is used to continue training.

Figure 1 is a schematic diagram of Dropout. The left side is a complete neural network, and the right side is the network structure after applying Dropout. After applying Dropout, the neurons marked with × will be deleted from the network so that they do not transmit signals to the subsequent layers. During the learning process, which neurons are discarded is randomly determined, so the model does not rely too much on certain neurons, which can inhibit overfitting to a certain extent.

  • Application example

When predicting a scene, the signals of all neurons will be forwarded, which may lead to a new problem: because some neurons are randomly discarded during training, the total size of the output data will become smaller. For example: the calculation of its L1 norm will be smaller than when Dropout is not used, but the neurons are not discarded during prediction, which will cause the distribution of data during training and prediction to be different. To solve this problem, Paddle supports the following two methods:

  • downscale_in_infer

During training, a part of neurons is randomly discarded at a ratio r, and their signals are not transmitted backwards; during prediction, the signals of all neurons are transmitted backwards, but the value on each neuron is multiplied by (1−r).

  • upscale_in_train

During training, some neurons are randomly discarded at a ratio r, and their signals are not transmitted backwards, but the values ​​on those neurons that are retained are divided by (1−r); during prediction, the signals of all neurons are transmitted backwards, without do any processing.

Take the flying paddle framework as an example: in the Dropout API, the mode parameter is used to specify which way to operate the neuron.

paddle.nn.Dropout(p=0.5, axis=None, mode=”upscale_in_train”, name=None)

The main parameters are as follows:

  • p (float): The probability of setting the input node to 0, that is, the discard probability, default value: 0.5. The parameter's drop probability for elements is for each element, not for all elements. For example, assuming there are 12 numbers in the matrix, the dropout with a probability of 0.5 may not necessarily have 6 zeros.

  • mode(str) : The implementation of the discarding method, there are two types of 'downscale_in_infer' and 'upscale_in_train', and the default is 'upscale_in_train'.

# dropout操作
import paddle
import numpy as np

# 设置随机数种子,这样可以保证每次运行结果一致
np.random.seed(100)
# 创建数据[N, C, H, W],一般对应卷积层的输出
data1 = np.random.rand(2,3,3,3).astype('float32')
# 创建数据[N, K],一般对应全连接层的输出
data2 = np.arange(1,13).reshape([-1, 3]).astype('float32')
# 使用dropout作用在输入数据上
x1 = paddle.to_tensor(data1)
# downgrade_in_infer模式下
drop11 = paddle.nn.Dropout(p = 0.5, mode = 'downscale_in_infer')
droped_train11 = drop11(x1)
# 切换到eval模式。在动态图模式下,使用eval()切换到求值模式,该模式禁用了dropout。
drop11.eval()
droped_eval11 = drop11(x1)
# upscale_in_train模式下
drop12 = paddle.nn.Dropout(p = 0.5, mode = 'upscale_in_train')
droped_train12 = drop12(x1)
# 切换到eval模式
drop12.eval()
droped_eval12 = drop12(x1)

x2 = paddle.to_tensor(data2)
drop21 = paddle.nn.Dropout(p = 0.5, mode = 'downscale_in_infer')
droped_train21 = drop21(x2)
# 切换到eval模式
drop21.eval()
droped_eval21 = drop21(x2)
drop22 = paddle.nn.Dropout(p = 0.5, mode = 'upscale_in_train')
droped_train22 = drop22(x2)
# 切换到eval模式
drop22.eval()
droped_eval22 = drop22(x2)
    
print('x1 {}, \n droped_train11 \n {}, \n droped_eval11 \n {}'.format(data1, droped_train11.numpy(),  droped_eval11.numpy()))
print('x1 {}, \n droped_train12 \n {}, \n droped_eval12 \n {}'.format(data1, droped_train12.numpy(),  droped_eval12.numpy()))
print('x2 {}, \n droped_train21 \n {}, \n droped_eval21 \n {}'.format(data2, droped_train21.numpy(),  droped_eval21.numpy()))
print('x2 {}, \n droped_train22 \n {}, \n droped_eval22 \n {}'.format(data2, droped_train22.numpy(),  droped_eval22.numpy()))

The result of the program running is as follows:

x1 
 [[[[0.54340494 0.2783694  0.4245176] [0.84477615 0.00471886 0.12156912] [0.67074907 0.82585275 0.13670659]]
 		 [[0.5750933  0.89132196 0.20920213] [0.18532822 0.10837689 0.21969749] [0.9786238  0.8116832  0.17194101]]
		 [[0.81622475 0.27407375 0.4317042 ] [0.9400298  0.81764936 0.33611196] [0.17541045 0.37283206 0.00568851]]]
		[[[0.25242636 0.7956625  0.01525497] [0.5988434  0.6038045  0.10514768] [0.38194343 0.03647606 0.89041156]]
		 [[0.98092085 0.05994199 0.89054596] [0.5769015  0.7424797  0.63018394] [0.5818422  0.02043913 0.21002658]]
		 [[0.5446849  0.76911515 0.25069523] [0.2858957  0.8523951  0.9750065 ] [0.8848533 0.35950786 0.59885895]]]] 
 droped_train11 
 [[[[0.         0.2783694  0.4245176 ] [0.         0.00471886 0.        ] [0.         0.82585275 0.        ]]
	 [[0.         0.         0.20920213] [0.18532822 0.10837689 0.        ] [0.9786238  0.         0.17194101]]
	 [[0.81622475 0.27407375 0.        ] [0.         0.         0.33611196] [0.17541045 0.37283206 0.00568851]]]
	[[[0.25242636 0.         0.        ] [0.5988434  0.6038045  0.10514768] [0.38194343 0.         0.89041156]]
	 [[0.98092085 0.         0.        ] [0.5769015  0.7424797  0.        ] [0.5818422  0.02043913 0.        ]]
	 [[0.5446849  0.76911515 0.        ] [0.         0.8523951  0.9750065 ] [0.         0.35950786 0.59885895]]]], 
 droped_eval11 
 [[[[0.27170247 0.1391847  0.2122588 ] [0.42238808 0.00235943 0.06078456] [0.33537453 0.41292638 0.0683533 ]]
	 [[0.28754666 0.44566098 0.10460106] [0.09266411 0.05418845 0.10984875] [0.4893119  0.4058416  0.08597051]]
	 [[0.40811238 0.13703687 0.2158521 ] [0.4700149  0.40882468 0.16805598] [0.08770522 0.18641603 0.00284425]]]
	[[[0.12621318 0.39783126 0.00762749] [0.2994217  0.30190226 0.05257384] [0.19097172 0.01823803 0.44520578]]
	 [[0.49046043 0.02997099 0.44527298] [0.28845075 0.37123984 0.31509197] [0.2909211  0.01021957 0.10501329]]
	 [[0.27234244 0.38455757 0.12534761] [0.14294785 0.42619756 0.48750326] [0.44242665 0.17975393 0.29942948]]]]
 x1
 [[[[0.54340494 0.2783694  0.4245176 ] [0.84477615 0.00471886 0.12156912] [0.67074907 0.82585275 0.13670659]]
   [[0.5750933  0.89132196 0.20920213] [0.18532822 0.10837689 0.21969749] [0.9786238  0.8116832  0.17194101]]
   [[0.81622475 0.27407375 0.4317042 ] [0.9400298  0.81764936 0.33611196] [0.17541045 0.37283206 0.00568851]]]
  [[[0.25242636 0.7956625  0.01525497] [0.5988434  0.6038045  0.10514768] [0.38194343 0.03647606 0.89041156]]
   [[0.98092085 0.05994199 0.89054596] [0.5769015  0.7424797  0.63018394] [0.5818422  0.02043913 0.21002658]]
   [[0.5446849  0.76911515 0.25069523] [0.2858957  0.8523951  0.9750065 ] [0.8848533  0.35950786 0.59885895]]]]
 droped_train12 
 [[[[0.         0.5567388  0.8490352 ] [0.         0.         0.24313824] [0.         0.         0.        ]]
   [[0.         0.         0.41840425] [0.37065643 0.         0.        ] [1.9572476  0.         0.        ]]
   [[0.         0.         0.        ] [0.         1.6352987  0.6722239 ] [0.3508209  0.         0.01137702]]]
  [[[0.         1.591325   0.03050994] [1.1976868  1.207609   0.        ] [0.76388687 0.         1.7808231 ]]
   [[0.         0.         0.        ] [1.153803   0.         0.        ] [1.1636844  0.         0.42005315]]
   [[1.0893698  0.         0.50139046] [0.5717914  1.7047902  0.        ] [0.         0.7190157  0.        ]]]]
 droped_eval12 
 [[[[0.54340494 0.2783694  0.4245176 ] [0.84477615 0.00471886 0.12156912] [0.67074907 0.82585275 0.13670659]]
   [[0.5750933  0.89132196 0.20920213] [0.18532822 0.10837689 0.21969749] [0.9786238  0.8116832  0.17194101]]
   [[0.81622475 0.27407375 0.4317042 ] [0.9400298  0.81764936 0.33611196] [0.17541045 0.37283206 0.00568851]]]
  [[[0.25242636 0.7956625  0.01525497] [0.5988434  0.6038045  0.10514768] [0.38194343 0.03647606 0.89041156]]
   [[0.98092085 0.05994199 0.89054596] [0.5769015  0.7424797  0.63018394] [0.5818422  0.02043913 0.21002658]]
   [[0.5446849  0.76911515 0.25069523] [0.2858957  0.8523951  0.9750065 ] [0.8848533  0.35950786 0.59885895]]]]
 x2 
 [[ 1.  2.  3.] [ 4.  5.  6.] [ 7.  8.  9.] [10. 11. 12.]], 
 droped_train21 
 [[ 1.  2.  3.] [ 4.  5.  6.] [ 0.  0.  9.] [ 0. 11.  0.]]
 droped_eval21 
 [[0.5 1.  1.5] [2.  2.5 3. ] [3.5 4.  4.5] [5.  5.5 6. ]]
 x2 
 [[ 1.  2.  3.] [ 4.  5.  6.] [ 7.  8.  9.] [10. 11. 12.]]
 droped_train22 
 [[ 2.  0.  6.] [ 0. 10.  0.] [14. 16. 18.] [ 0. 22. 24.]]
 droped_eval22 
 [[ 1.  2.  3.] [ 4.  5.  6.] [ 7.  8.  9.] [10. 11. 12.]]

From the above running results, we can see that after dropout, some elements in the tensor become 0. This is the function realized by dropout. By randomly setting the elements of the input data to 0, the joint adaptation between neuron nodes is eliminated and weakened. and enhance the generalization ability of the model.

In the program, we set the random dropout ratio to 0.5, use two different strategies for dropout, and print the output of the network layer in training and evaluation mode respectively. Among them, data x1 simulates the output data of the convolutional layer, and data x2 simulates the input data of the fully connected layer. Normally, we will add dropout to the fully connected layer, so here we analyze the case where the output of the previous layer is x2, and the case where the output of the previous layer is x1 is basically similar.

x2 is defined as follows:

x 2 = [ 1 2 3 4 5 6 7 8 9 10 11 12 ] x_2=\begin{bmatrix}1&2&3\\ 4&5&6\\ 7&8&9\\ 10&11&12\end{bmatrix} x2= 147102581136912

When the mode in the paddle.nn.Dropout API is set to 'downscale_in_infer', it can be observed that in the training mode, some elements become 0, and the values ​​of other elements have not changed. At this time, x 2 train x_{2_\textit{ train}}x2trainfor:

x 2 = [ 1 2 3 4 5 6 0 0 9 0 11 0 ] x_2=\begin{bmatrix}1&2&3\\ 4&5&6\\ 0&0&9\\ 0&11&0\end{bmatrix} x2= 1400250113690

In the verification mode, all elements are retained, but the values ​​of all elements are scaled, and the scaling coefficient is (1−r), that is, (1−0.5)=0. At this time, x2_eval is:

x 2 = [ 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 ] x_2=\begin{bmatrix}0.5&1&1.5\\ 2&2.5&3\\ 3.5&4&4.5\\ 5&5.5&6\end{bmatrix}\quad x2= 0.523.5512.545.51.534.56

And when the mode in paddle.nn.Dropout API is set to 'upscale_in_train', it can be observed that in the training mode, some elements become 0, and the values ​​of other elements are scaled, and the scaling coefficient is 1/1−r, that is 1/(1−0.5)=2 , at this time x2_train is:

x 2 = [ 2 0 6 0 10 0 14 16 18 0 22 24 ] x_2=\left[\begin{array}{ccc}2&0&6\\ 0&10&0\\ 14&16&18\\ 0&22&24\end{array}\right] x2= 201400101622601824

In verification mode, all elements are retained, and the values ​​of all elements have not changed. At this time, x2_eval is:

x 2 = [ 1 2 3 4 5 6 7 8 9 10 11 12 ] x_2=\begin{bmatrix}1&2&3\\ 4&5&6\\ 7&8&9\\ 10&11&12\end{bmatrix}\quad x2= 147102581136912

2.5 Drop Connect

DropConnect is another regularization strategy published on ICML2013 to reduce algorithm overfitting , which is a generalization of Dropout. In the process of Drop Connect, a randomly selected subset of network architecture weights needs to be set to zero, instead of setting to zero a randomly selected subset of activation functions for each layer in Dropout. Both Drop Connect and Dropout can achieve limited generalization performance since each unit receives input from a random subset of past layer units. Drop Connect is similar to Dropout in that it involves introducing sparsity into the model, except that it introduces sparsity in the weights rather than in the layer's output vectors. For a DropConnect layer, the output can be written as:

r = a ( ( M ∗ W ) v ) r=a((M*W)v) r=a((MW)v)

where r is the output of a layer, v is the input of a layer, W is the weight parameter, M is a binary matrix encoding the connection information, where M ij Bernoulli ( p ) M_{ij} Bernoulli(p )MijB er no u ll i ( p ) . _ During training, each element in M ​​is independently performed on the samples. Basically instantiate a different connection for each instance. Additionally, these biases are masked during training.

2.5.1 The difference between dropout and dropconncet

  • Dropout is to randomly clear the output of the hidden layer node to 0, aiming at the output.

  • DropConnect is to clear the input weight of each node connected to it with a probability of 1-p; it is for the input.

2.5.2 Training of DropConnect

When using DropConnect, it is necessary to randomly sample an M matrix (element values ​​are all 0 or 1, commonly known as mask matrix) for each example and each echo. The algorithm flow of the training part is as follows:

DropConnect can only be used in the fully connected network layer (same as dropout). If convolution is used in the network, the hidden layer nodes when using patch convolution do not use DropConnect, so there is an Extract feature step in the above process , this step is the propagation process of those non-fully connected layers in front of the network, such as convolution + pooling.

2.5.3 Reasoning of DropConnect

When performing inference in the Dropout network, all weights W are scaled by a coefficient p (the author proves that this approximation is problematic in some occasions). When inferring on DropConnect, the weight of each input (each hidden layer node is connected to multiple inputs) is used to sample the Gaussian distribution. The mean and variance of the Gaussian distribution are of course related to the previous probability value p, and the satisfied Gaussian distribution is:

u N ( p W v , p ( 1 − p ) ( W ∗ W ) ( v ∗ v ) ) uN(pWv,p(1-p)(W*W)(v*v)) u N ( p W v ,p(1p)(WW)(vv))

The reasoning process is as follows:

It can be seen from the above process that when performing inference, each weight needs to be sampled, so the DropConnect speed will be slower.

According to the author's point of view, both Dropout and DropConnect are similar to the model average, and Dropout is 2 ∣ m ∣ 2^{|m|}2m average of models, while DropConnect is2 ∣ M ∣ 2^{|M|}2M the average of the models. (m is a vector, M is a matrix, and the modulus indicates the number of corresponding elements in the matrix or vector). From this point of view, the DropConnect model has a stronger average ability, because ∣ M ∣ > ∣ m ∣ |M|>|m |M>m

2.6 Early stop method

Early stopping can limit the number of training iterations the model needs to minimize the cost function. Early stopping is often used to prevent poor generalization of overrepresented models during training. If the number of iterations is too small, the algorithm is prone to underfitting (small variance, large deviation), and if the number of iterations is too large, the algorithm is prone to overfitting (large variance, small deviation). Early stopping solves this problem by determining the number of iterations without manual setting of specific values.

Early stopping is a cross-validation strategy, that is, a part of the training set is reserved as a validation set. Stop training the model as soon as it sees poor performance on the validation set.

In the above figure, we stop the training of the model at the dashed line, because after this point the model starts to overfit on the training data.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/130267251