Gradient descent algorithm and deriving training data based on gradient information

Mainly to connect the relationship between gradients in mathematics and gradients in deep learning

what is gradient

Gradient is a vector, with both magnitude and direction.
Size: the module of the gradient is equal to the maximum rate of change that can be obtained at that point.
Direction: the direction when the maximum change is taken at this point.

What is it used for?
It can be used to quickly find the extreme value of a multi-dimensional variable function. In deep learning, it is generally to find the minimum value of the loss function.
For problems such as finding the minimum or extreme value, the first thing that usually comes to mind is to solve the differential equation. However, it is not easy to solve when the function is very complex. How can computers save the country? Through its strong computing power, it can try a lot and solve the problem step by step. The function value is tested. The previous people summarized a rule for this test process - the Delta rule : it is a heuristic algorithm. The core idea is to use the gradient descent algorithm to find the maximum value and let the target converge to near the best solution.
Insert image description here

The actual method of the algorithm is to try to change the weight w i step by step , and gradually find the minimum value of the loss function l

The question also arises, why use ∂ l ∂ wi \frac{\partial l}{\partial w_i}wilAs a step length? In fact, this problem is also, for the gradient grad f = ( ∂ l ∂ x 1 , ∂ l ∂ x 2 , … ) grad\ f=(\frac{\partial l}{\partial x_1},\frac{\ partial l}{\partial x_2},\ldots)grad f=(x1l,x2l,... ) , why is the gradient equal to the following sequence, so that it has the properties of the maximum rate of change and its direction?

Before discussing, we must first mention what is the rate of change

Take the one-variable function as an example y = f ( x ) y=f(x)y=The rate of change of a certain
functionof f ( x ) : △ y △ x = f ( x 0 + △ x ) − f ( x 0 ) △ x \frac{\triangle y}{\triangle x} = \frac{f( x_0 + \triangle x) - f(x_0)}{\triangle x}xy=xf(x0+x)f(x0)
The rate of change of a function at a certain point : lim ⁡ △ x → 0 f ( x 0 + △ x ) − f ( x 0 ) △ x = f ′ ( x 0 ) \lim_{\triangle x\to 0} \frac{f (x_0 + \triangle x) - f(x_0)}{\triangle x} = f'(x_0)limx0xf(x0+x)f(x0)=f(x0)

Rate of change: the limit of the ratio of the increment of the function to the increment of the independent variable in a certain direction (in a one-variable function, a certain direction is the x-axis direction)

Expand to the binary function z = f ( x , y ) z=f(x, y)z=f(x,y )
According to the above definition of the rate of change, a certain direction of the independent variable is required. The direction at this time is jointly affected by x and y. Then assume that the unit vector in this direction is el ⃗ = ( cos ⁡ α ,cos ⁡ β ) \vec{e_l} = (\cos\alpha, \cos\beta)el =(cosa ,cosβ ) , the increment in this direction is t, then the increment vector (independent variable) is( t cos ⁡ α , t cos ⁡ β ) (t\cos\alpha,t\cos\beta)(tcosa ,tcosβ),则 变化率 = lim ⁡ t → 0 + f ( x 0 + t cos ⁡ α , y 0 + t cos ⁡ β ) − f ( x 0 , y 0 ) t = f x ( x 0 , y 0 ) cos ⁡ α + f y ( x 0 , y 0 ) cos ⁡ β 变化率=\lim_{t \to 0^{+}} \frac{f(x_0 + t\cos\alpha, y_0 + t\cos\beta) - f(x_0, y_0)}{t} = f_x(x_0, y_0)\cos\alpha + f_y(x_0, y_0)\cos\beta rate of change=t0+limtf(x0+tcosa ,y0+tcosb )f(x0,y0)=fx(x0,y0)cosa+fy(x0,y0)cosb

The derivation process:
Insert image description here

What is the direction α when the rate of change is maximum?
Let’s express the rate of change as the inner product of two vectors, where g ⃗ = ( fx ( x 0 , y 0 ) , fy ( x 0 , y 0 ) ) , el ⃗ = ( cos ⁡ α , cos ⁡ β ) \vec{g} = (f_x(x_0, y_0), f_y(x_0, y_0)) , \vec{e_l} = (\cos\alpha, \cos\beta)g =(fx(x0,y0),fy(x0,y0)),el =(cosa ,cosβ )
is infinitesimal⁡ t → 0 + f ( x 0 + t cos ⁡ α , y 0 + t cos ⁡ β ) − f ( x 0 , y 0 ) t = fx ( x 0 , y 0 ) cos ⁡ α + fy ( x 0 , y 0 ) cos ⁡ β = g ⃗ ⋅ el ⃗ = ∣ g ⃗ ∣ ∣ el ⃗ ∣ cos ⁡ θ \lim_{t \to 0^{+}} \frac{ f(x_0 + t\cos\alpha, y_0 + t\cos\beta) - f(x_0, y_0)}{t} = f_x(x_0, y_0)\cos\alpha + f_y(x_0, y_0)\cos\ beta = \vec{g} \cdot \vec{e_l} = \lvert \vec{g} \rvert \vec{e_l} \rvert \cos \thetat0+limtf(x0+tcosa ,y0+tcosb )f(x0,y0)=fx(x0,y0)cosa+fy(x0,y0)cosb=g el =g el cosθwhere θ
is∣g ⃗ ∣ \lvert \vec{g}\rvertg ∣ el ⃗ ∣ \lvert \vec{e_l}\rvertel The angle between ∣ , when the coordinates of a certain point are known,g ⃗ = ( fx ( x 0 , y 0 ) , fy ( x 0 , y 0 ) ) \vec{g} = (f_x(x_0, y_0), f_y (x_0, y_0))g =(fx(x0,y0),fy(x0,y0)) value can be determined, that is, ∣ g ⃗ ∣ \lvert \vec{g}\rvertg can be obtained as a certain value, and the unit vector∣ el ⃗ ∣ = 1 \lvert \vec{e_l}\rvert=1el =1 , then whencos ⁡ θ = 1 \cos\theta=1cosi=The change rate is maximum at 1 , wheng ⃗ ∥ el ⃗ \vec{g} \parallel \vec{e_l}g el
That is, this direction α is g ⃗ = ( fx ( x 0 , y 0 ) , fy ( x 0 , y 0 ) ) \vec{g} = (f_x(x_0, y_0), f_y(x_0, y_0))g =(fx(x0,y0),fy(x0,y0)) direction, the change rate is the largest

It can be generalized to higher dimensional functions.
For function f (a, b, c, …) f(a,b,c,\ldots)f(a,b,c,),向量 ( ∂ f ∂ a , ∂ f ∂ b , ∂ f ∂ c , … ) (\frac{\partial f}{\partial a}, \frac{\partial f}{\partial b}, \frac{\partial f}{\partial c}, \ldots) (af,bf,cf,... ) is the gradient grad f grad\ fwith the value and direction of the maximum change rategrad f
g r a d   f = ( ∂ f ∂ a , ∂ f ∂ b , ∂ f ∂ c , … ) grad \ f = (\frac{\partial f}{\partial a}, \frac{\partial f}{\partial b}, \frac{\partial f}{\partial c}, \ldots) grad f=(af,bf,cf,)
Gradientgrad f grad\ fg r a d f  direction, functionf ( a , b , c , … ) f(a,b,c,\ldots)f(a,b,c,) has the largest numerical change
at point( a 0 , b 0 , c 0 , … ) (a_0,b_0,c_0,\ldots)(a0,b0,c0,) for example, the gradient at this point is( ∂ f ∂ a 0 , ∂ f ∂ b 0 , ∂ f ∂ c 0 , … ) (\frac{\partial f}{\partial a_0}, \frac{\ partial f}{\partial b_0}, \frac{\partial f}{\partial c_0}, \ldots)(a0f,b0f,c0f,) , that is,( a 0 , b 0 , c 0 , … ) (a_0,b_0,c_0,\ldots)(a0,b0,c0,) ( a 0 + t ∂ f ∂ a 0 , b 0 + t ∂ f ∂ b 0 , c 0 + t ∂ f ∂ c 0 , … ) (a_0 + t\frac{\partial f}{\partial a_0}, b_0 + t\frac{\partial f}{\partial b_0}, c_0 + t\frac{\partial f}{\partial c_0}, \ldots) (a0+ta0f,b0+tb0f,c0+tc0f,... ) movement is to move in the direction with the largest change in the function value f

In the gradient descent algorithm, the update of weights is the same as the above example. We hope to find the minimum value of the loss function l, so update the weights wi ← wi − η ∂ l ∂ wi w_{i}\leftarrow w_{ step by step i} - \eta\frac{\partial l}{\partial w_{i}}wiwithewil,使用 ∂ l ∂ w i \frac{\partial l}{\partial w_{i}} wilas a step length

Some information hidden in the gradient

In the paper "Soteria: Provable Defense against Privacy Leakage in Federated Learning from Representation Perspective", the author further explores the gradient ∂ l ∂ wi \frac{\partial l}{\partial w_{i}}wilWhat information is hidden. This implicit information may allow an eavesdropper to restore the original data and labels based on the gradient after eavesdropping on the gradient information.

Insert image description here

Insert image description here
If the division of batches is equal to the division of classes, the following formula will be obtained
Insert image description here

The batch division of equation (1) and equation (2) is shown in the figure below. The training data in each batch belongs to the same label and the same category.
Insert image description here

The author mainly analyzes the implicit information of the linear connection layer

The linear connection layer is the last layer

When the linear connection layer is the last layer, ∂ l ∂ wi \frac{\partial l}{\partial w_{i}}wilTo further split
Insert image description here
it, r is the input of this layer, b is the output, y is the result of normalizing b (the sum of all items of y is 1), and then XORed with the real label to generate y c , and then calculate the loss function l

The weight W is a matrix, solving ∂ l ∂ W \frac{\partial l}{\partial W}Wl
Insert image description here

W = [ W 1 , W 2 , … , W k ] W = [W_1, W_2, \ldots, W_k] W=[W1,W2,,Wk]
W 1 = [ w 11 , w 21 , … , w n 1 ] T W_1 = [w_{11}, w_{21}, \ldots, w_{n1}]^T W1=[w11,w21,,wn 1]T

∂ l ∂ w 11 = ∂ l ∂ b 1 ∂ b 1 ∂ w 11 = ∂ l ∂ b 1 ∂ ( w 11 r 1 + w 12 r 2 + … ) ∂ w 11 = ∂ l ∂ b 1 r 1 \frac{\partial l}{\partial w_{11}} = \frac{\partial l}{\partial b_1}\frac{\partial b_1}{\partial w_{11}} = \frac{\partial l}{\partial b_1}\frac{\partial (w_{11}r_1 + w_{12}r_2 + \ldots)}{\partial w_{11}} = \frac{\partial l}{\partial b_1}r_1 w11l=b1lw11b1=b1lw11(w11r1+w12r2+)=b1lr1
∂ l ∂ w 21 = ∂ l ∂ b 2 ∂ b 2 ∂ w 21 = ∂ l ∂ b 2 ∂ ( w 21 r 1 + w 22 r 2 + … ) ∂ w 21 = ∂ l ∂ b 2 r 1 \frac{\partial l}{\partial w_{21}} = \frac{\partial l}{\partial b_2}\frac{\partial b_2}{\partial w_{21}} = \frac{\partial l}{\partial b_2}\frac{\partial (w_{21}r_1 + w_{22}r_2 + \ldots)}{\partial w_{21}} = \frac{\partial l}{\partial b_2}r_1 w21l=b2lw21b2=b2lw21(w21r1+w22r2+)=b2lr1
     ⋮ \ \ \ \ \vdots     
∂ l ∂ w n 1 = ∂ l ∂ b n ∂ b n ∂ w n 1 = ∂ l ∂ b n r 1 \frac{\partial l}{\partial w_{n1}} = \frac{\partial l}{\partial b_n}\frac{\partial b_n}{\partial w_{n1}} = \frac{\partial l}{\partial b_n}r_1 wn 1l=bnlwn 1bn=bnlr1

整合之后
∂ l ∂ W 1 = [ ∂ l ∂ b 1 r 1 ∂ l ∂ b 2 r 1 ⋮ ∂ l ∂ b n r 1 ] = [ ∂ l ∂ b 1 ∂ l ∂ b 2 ⋮ ∂ l ∂ b n ] r 1 \frac{\partial l}{\partial W_1} = \begin{bmatrix} \frac{\partial l}{\partial b_1}r_1 \\ \frac{\partial l}{\partial b_2}r_1 \\ \vdots \\ \frac{\partial l}{\partial b_n}r_1 \end{bmatrix} = \begin{bmatrix} \frac{\partial l}{\partial b_1} \\ \frac{\partial l}{\partial b_2} \\ \vdots \\ \frac{\partial l}{\partial b_n} \end{bmatrix}r_1 W1l= b1lr1b2lr1bnlr1 = b1lb2lbnl r1
推广到任意行向量
∂ l ∂ W i = [ ∂ l ∂ b 1 r i ∂ l ∂ b 2 r i ⋮ ∂ l ∂ b n r i ] = [ ∂ l ∂ b 1 ∂ l ∂ b 2 ⋮ ∂ l ∂ b n ] r i = ∂ l ∂ b r i \frac{\partial l}{\partial W_i} = \begin{bmatrix} \frac{\partial l}{\partial b_1}r_i \\ \frac{\partial l}{\partial b_2}r_i \\ \vdots \\ \frac{\partial l}{\partial b_n}r_i \end{bmatrix} = \begin{bmatrix} \frac{\partial l}{\partial b_1} \\ \frac{\partial l}{\partial b_2} \\ \vdots \\ \frac{\partial l}{\partial b_n} \end{bmatrix}r_i = \frac{\partial l}{\partial \mathbf{b}}r_i Wil= b1lrib2lribnlri = b1lb2lbnl ri=blri
Then integrate the row vectors into a matrix to obtain the following formula
∂ l ∂ W = [ ∂ l ∂ W 1 , ∂ l ∂ W 2 , … , ∂ l ∂ W k ] = [ ∂ l ∂ b 1 ∂ l ∂ b 2 ⋮ ∂ l ∂ bn ] [ r 1 , r 2 , … , rk ] = ∂ l ∂ b ( r ) T \frac{\partial l}{\partial W} = \begin{bmatrix} \frac{\partial l} {\partial W_1}, \frac{\partial l}{\partial W_2},\ldots,\frac{\partial l}{\partial W_k} \end{bmatrix} = \begin{bmatrix} \frac{\partial l}{\partial b_1} \\ \frac{\partial l}{\partial b_2} \\ \vdots \\ \frac{\partial l}{\partial b_n} \end{bmatrix}\begin{bmatrix} r_1 ,r_2,\ldots,r_k \end{bmatrix} = \frac{\partial l}{\partial \mathbf{b}}(\mathbf{r})^TWl=[W1l,W2l,,Wkl]= b1lb2lbnl [r1,r2,,rk]=bl(r)T

If the loss function l uses the cross-entropy loss function (when the category is c): lossc = − log ⁡ ebc ∑ k = 1 nebk loss_c = - \log\frac{e^{b_c}}{\sum_{k=1}^ {n}e^{b_k}}lossc=logk=1nebkebc

Then ∂ l ∂ b \frac{\partial l}{\partial \mathbf{b}} in the above formulablYou can try to express
that when the label of the data is class c,
∂ lc ∂ b 1 = ∂ ( − log ⁡ ebceb 1 + … + ebc + … + ebn ) ∂ b 1 = ∂ ( − log ⁡ ebc + log ⁡ ( eb 1 + … + ebc + … + ebn ) ) ∂ b 1 = eb 1 eb 1 + … + ebc + … + ebn = y 1 \begin{align} \frac{\partial l_c}{\partial b_1} &= \frac{\partial (- \log \frac{e^{b_c}}{e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}})}{\partial b_1} = \frac{\partial ( - \log e^{b_c} + \log (e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}))}{\partial b_1} \nonumber \\ &= \frac{e^{b_1}}{e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}} = y_1 \nonumber \end{align}b1lc=b1(logeb1++ebc++ebnebc)=b1(logebc+log(eb1++ebc++ebn))=eb1++ebc++ebneb1=y1
∂ l c ∂ b c = ∂ ( − log ⁡ e b c e b 1 + … + e b c + … + e b n ) ∂ b c = ∂ ( − log ⁡ e b c + log ⁡ ( e b 1 + … + e b c + … + e b n ) ) ∂ b c = − 1 + e b c e b 1 + … + e b c + … + e b n = y c − 1 \begin{align} \frac{\partial l_c}{\partial b_c} &= \frac{\partial (- \log \frac{e^{b_c}}{e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}})}{\partial b_c} = \frac{\partial ( - \log e^{b_c} + \log (e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}))}{\partial b_c} \nonumber \\ &= -1 + \frac{e^{b_c}}{e^{b_1} + \ldots + e^{b_c} + \ldots + e^{b_n}} = y_c - 1 \nonumber \end{align} bclc=bc(logeb1++ebc++ebnebc)=bc(logebc+log(eb1++ebc++ebn))=1+eb1++ebc++ebnebc=yc1
∂ l c ∂ b = [ ∂ l c ∂ b 1 ⋮ ∂ l c ∂ b c ⋮ ∂ l c ∂ b n ] = [ y 1 ⋮ y c − 1 ⋮ y n ] % code \frac{\partial l_c}{\partial \mathbf{b}} = \begin{bmatrix} \frac{\partial l_c}{\partial b_1} \\ \vdots \\ \frac{\partial l_c}{\partial b_c}\\ \vdots \\ \frac{\partial l_c}{\partial b_n} \end{bmatrix} = \begin{bmatrix} y_1 \\ \vdots \\ y_c - 1 \\ \vdots \\ y_n \end{bmatrix} blc= b1lcbclcbnlc = y1yc1yn
And there is the relationship ∣ yc − 1 ∣ = ∣ y 1 ∣ + … + ∣ yc − 1 ∣ + ∣ yc + 1 ∣ + … |y_c - 1| = |y_1| + \ldots + |y_{c-1} | + |y_{c+1}| + \ldotsyc1∣=y1++yc1+yc+1+

因为∑ iyi = 1 \sum_{i} y_i = 1iyi=1

By observing ∂ lc ∂ b \frac{\partial l_c}{\partial \mathbf{b}}blcIt can be found that when the training data is c, the amplitude of row c is the largest, and all rows are multiplied by the same row vector (r) T (\mathbf{r})^T(r)T时, ∂ l ∂ W i = ∂ l ∂ b ( r ) T \frac{\partial l}{\partial W_i} = \frac{\partial l}{\partial \mathbf{b}}(\mathbf{r})^T Wil=bl(r)The T line still has the largest amplitude. Then after the attacker eavesdrops on the gradient information, he finds the line with the largest amplitude. This line is(yc − 1) (r) T (y_c - 1)(\mathbf{r})^T(yc1)(r)T , where( yc − 1 ) (y_c - 1)(yc1 ) is a constant,( yc − 1 ) ( r ) T (y_c - 1)(\mathbf{r})^T(yc1)(r)T can be viewed approximately as( r ) T (\mathbf{r})^T(r)T. _ In this way, the input r \mathbf{r}of this layer is obtained.r

Conclusion: Through gradient ∇ W \nabla WW can get the input datar \mathbf{r}r , and also knows the category class c to which this data belongs.

Linear layer as middle layer

When the linear layer is used as an intermediate layer, the following data transformation will be performed

Insert image description here
To simplify it, it is the following expression
Insert image description here

The activation function σ \sigmaσ plays a very important role, its main function is to add a non-linear operation to all hidden layers and output layers

Taking a three-layer fully linearly connected neural network as an example
Insert image description here
z 1 = W 1 x + B 1 z 2 = W 2 σ ( z 1 ) + B 2 z 3 = W 3 σ ( z 2 ) + B 3 \begin{ align} & z_1 = W_1x + B_1 \nonumber \\ & z_2 = W_2\sigma(z_1) + B_2 \nonumber \\ & z_3 = W_3\sigma(z_2) + B_3 \nonumber \end{align}z1=W1x+B1z2=W2s ( z1)+B2z3=W3s ( z2)+B3
Then all ∂ l ∂ W i \frac{\partial l}{\partial W_i}WilExpressible

The Wi here is different from the previous one. The previous one refers to a column of a weight matrix, which is a vector. Here it represents the weight of which layer, which is a matrix.

∂ l ∂ W 1 = ∂ l ∂ z 3 ∂ z 3 ∂ W 1 = ∂ l ∂ z 3 ( ∂ ( W 3 σ ( z 2 ) + B 3 ) ∂ σ ( z 2 ) ∂ σ ( z 2 ) ∂ z 2 ∂ ( W 2 σ ( z 1 ) + B 2 ) ∂ σ ( z 1 ) ∂ σ ( z 1 ) ∂ z 1 ∂ ( W 1 x + B 1 ) ∂ W 1 ) = ∂ l ∂ z 3 ( W 3 ⋅ σ ′ ( z 2 ) ⋅ W 2 ⋅ σ ′ ( z 1 ) ⋅ x ) ∂ l ∂ W 2 = ∂ l ∂ z 3 ∂ z 3 ∂ W 2 = ∂ l ∂ z 3 ( ∂ ( W 3 σ ( z 2 ) + B 3 ) ∂ σ ( z 2 ) ∂ σ ( z 2 ) ∂ z 2 ∂ ( W 2 σ ( z 1 ) + B 2 ) ∂ W 2 ) = ∂ l ∂ z 3 ( W 3 ⋅ σ ′ ( z 2 ) ⋅ σ ( z 1 ) ) ∂ l ∂ W 3 = ∂ l ∂ z 3 ∂ z 3 ∂ W 3 = ∂ l ∂ z 3 ∂ ( W 3 σ ( z 2 ) + B 3 ) ∂ W 3 = ∂ l ∂ z 3 σ ( z 2 ) \begin{align} & \frac{\partial l}{\partial W_1} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_1} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\ partial \sigma(z_2)}{\partial z_2} \frac{\partial(W_2\sigma(z_1) + B_2)}{\partial \sigma(z_1)} \frac{\partial\sigma(z_1)}{\partial z_1} \frac{\partial(W_1x + B_1)}{\partial W_1}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) \nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial(W_1x + B_1)}{\partial W_1}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) \nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial(W_1x + B_1)}{\partial W_1}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) \nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) \nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) \nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\nonumber \\ & \frac{\partial l}{\partial W_2} = \frac{\partial l}{\partial z_3} \frac{\partial z_3}{\partial W_2} = \frac{\partial l}{\partial z_3}(\frac{\partial(W_3\sigma(z_2) + B_3)}{\partial \sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\sigma(z_2)} \frac{\partial \sigma(z_2)}{\partial z_2} \frac{\partial (W_2\sigma(z_1) + B_2)}{\partial W_2}) = \frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\frac{\partial z_3}{\partial W_3} = \frac{\partial l}{\partial z_3} \frac{\partial(W_3\sigma(z_2) + B_3)}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}\frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \end{align}W1l=z3lW1z3=z3l(σ(z2)(W3s ( z2)+B3)z2σ(z2)σ(z1)(W2s ( z1)+B2)z1σ(z1)W1(W1x+B1))=z3l(W3p(z2)W2p(z1)x)W2l=z3lW2z3=z3l(σ(z2)(W3s ( z2)+B3)z2σ(z2)W2(W2s ( z1)+B2))=z3l(W3p(z2)s ( z1))W3l=z3lW3z3=z3lW3(W3s ( z2)+B3)=z3ls ( z2)
Sort it out
∂ l ∂ W 3 = ∂ l ∂ z 3 σ ( z 2 ) ∂ l ∂ W 2 = ∂ l ∂ z 3 ( W 3 ⋅ σ ′ ( z 2 ) ⋅ σ ( z 1 ) ) = ∂ l ∂ z 3 ( ∂ z 3 ∂ z 2 ′ ⋅ σ ′ ( z 2 ) ⋅ σ ( z 1 ) ) ∂ l ∂ W 1 = ∂ l ∂ z 3 ( W 3 ⋅ σ ′ ( z 2 ) ⋅ W 2 ⋅ σ ′ ( z 1 ) ⋅ x ) = ∂ l ∂ z 3 ( ∂ z 3 ∂ z 2 ′ ⋅ σ ′ ( z 2 ) ⋅ ∂ z 2 ∂ z 1 ′ ⋅ σ ′ ( z 1 ) ⋅ x ) \begin{align} & \frac{\partial l}{\partial W_3} = \frac{\partial l}{\partial z_3}\sigma(z_2) \nonumber \\ & \frac{\partial l}{\partial W_2} = \ frac{\partial l}{\partial z_3}(W_3 \cdot \sigma'(z_2) \cdot \sigma(z_1)) = \frac{\partial l}{\partial z_3}(\frac{\partial z_3} {\partial z'_2} \cdot \sigma'(z_2) \cdot \sigma(z_1)) \nonumber \\ & \frac{\partial l}{\partial W_1} = \frac{\partial l}{\ partial z_3}(W_3 \cdot \sigma'(z_2) \cdot W_2 \cdot \sigma'(z_1) \cdot x) = \frac{\partial l}{\partial z_3}(\frac{\partial z_3}{ \partialz'_2} \cdot \sigma'(z_2) \cdot \frac{\partial z_2}{\partial z'_1} \cdot \sigma'(z_1) \cdot x) \nonumber \end{align}W3l=z3ls ( z2)W2l=z3l(W3p(z2)s ( z1))=z3l(z2z3p(z2)s ( z1))W1l=z3l(W3p(z2)W2p(z1)x)=z3l(z2z3p(z2)z1z2p(z1)x)
If all other terms of an equation are known and only one term is unknown, then this term can be found.
Insert image description here
Blue means known and red means can be found. The above formula can gradually find the input data x of the neural network.

Because most activation functions have similar structures and sparsity, the paper uses z to approximate z' when deriving

Therefore, it is very possible for an attacker to restore the original training data through calculation after eavesdropping on the gradient information. Therefore, it is necessary to perturb the training data or the intermediate training process to a certain extent in order to achieve the purpose of safely protecting the original data. .

Guess you like

Origin blog.csdn.net/x_fengmo/article/details/131069059