GAN: confrontation generation network, the difference between forward propagation and back lane propagation

Table of contents

GAN: Against Generative Networks

loss function

The discriminator starts to fluctuate greatly, first adjust the discriminator

 Unification of generated samples and real samples: true and falseEdit

 Text and picture editing

 Avatar to emoticon packageEdit

 Avatar to 3D Editing

backpropagation

1. Forward propagation (forward)

2. Backward propagation (backward): Get the weight parameter formula and find the optimal path

The Four Fundamental Equations of Backpropagation

chain rule error summation

 Gradient descent weight parameter updateEdit


GAN: Against Generative Networks

 

 

 

loss function

 

The discriminator starts to fluctuate greatly, first adjust the discriminator

 Unification of generated samples and real samples: true and false are difficult to distinguish

 

 

 Image dataset generation

 

 Text only picture

 Avatar to emoticon package

 Avatar to 3D

 Bayesian: Posterior

 

 

 

backpropagation

 

 


Forward propagation: Input through the input layer, go all the way forward, and output a result through the output layer. As shown in the figure, it refers to 1, x1, x2, xn, multiplied by weights, and the bias value b0 is added, and then the total sum is performed, and the result is calculated after activation by the activation function. This process is called forward propagation .
Backpropagation: The process of updating weights in reverse through the output . Specifically, the output location will generate a model output, and a difference is calculated through this output and the original data. Reverse the forward calculation process. Update weights by difference and learning rate .

1. Forward propagation (forward)

A simple understanding is to use the output of the previous layer as the input of the next layer, and calculate the output of the next layer until the operation reaches the output layer. Next, let's describe it mathematically:

Weights

bias

Let wjkl be the weight from the kth neuron in the l−1 layer to the jth neuron in the lth layer, bjl be the bias of the jth neuron in the lth layer, and ajl be the activation of the jth neuron in the lth layer value (the output of the activation function, which guarantees the non-linearity of the model).

For Layer 2 output a1(2), a2(2), a3(2),

a1(2)=σ(z1(2))=σ(w11(2)x1+w12(2)x2+w13(2)x3+b1(2))

a2(2)=σ(z2(2))=σ(w21(2)x1+w22(2)x2+w23(2)x3+b2(2))

a3(2)=σ(z3(2))=σ(w31(2)x1+w32(2)x2+w33(2)x3+b3(2))

For the output a1(3) of Layer 3,

a1(3)=σ(z1(3))=σ(w11(3)a1(2)+w12(3)a2(2)+w13(3)a3(2)+b1(3))

a2(3)=σ(z2(3))=σ(w21(3)a1(2)+w22(3)a2(2)+w23(3)a3(2)+b2(3))

It can be seen from the above that it is more complicated to express the output one by one by using the algebraic method, but it is more concise if the matrix method is used. Generalizing the above example and writing it in the form of matrix multiplication,

z(l)=W(l)a(l−1)+b(l)

a(l)=σ(z(l))

Where σ is an activation function, such as Sigmoid, ReLU, PReLU, etc.

2. Backward propagation (backward): Get the weight parameter formula and find the optimal path

In fact, backpropagation refers only to the method used to compute gradients . Another algorithm, such as stochastic gradient descent, uses this gradient to learn. In principle, backpropagation can compute the derivative of any function .

Before understanding the backpropagation algorithm, let's briefly introduce the chain rule:

The chain rule in calculus (in order not to be confused with the chain rule in probability) is used to calculate derivatives of composite functions. Backpropagation is an algorithm that computes the chain rule, using an efficient specific order of transport.

Let x be a real number, and f and g be functions that map from real numbers to real numbers. Suppose y=g(x) and z=f(g(x))=f(y) . Then the chain rule is: dzdx=dzdydydx .

The core of the backpropagation algorithm is the partial derivative expressions ∂C∂W and ∂C∂b of the cost function C to the parameters in the network (weight W and bias b of each layer) . These expressions describe the extent to which the cost function value C varies with changes in weight W or bias b. Simple understanding of the BP algorithm: If the current cost function value is far from the expected value , then we adjust the value of the weight W or bias b to make the new cost function value closer to the expected value (the greater the difference from the expected value, the weight W Or the greater the range of bias b adjustment). This process is repeated until the final cost function value is within the error range, then the algorithm stops.

The BP algorithm can tell us how the parameters of the network change in each iteration of the neural network. Understanding this process is very helpful for us to analyze network performance or optimize the process, so we should try our best to understand this point.

To calculate the partial derivative expressions ∂C/∂W and ∂C/∂b during backpropagation, we first make two assumptions about the cost function, taking the quadratic loss function as an example:

Where n is the total number of training samples x, y=y(x) is the desired output, namely ground truth, L is the number of layers of the network, and aL(x) is the output vector of the network.

Assumption 1: The total cost function can be expressed as the average of the sum of the cost functions of individual samples:

 The significance of this assumption is that because we can only calculate ∂Cx/∂Wx and ∂C/∂b of a single training sample during backpropagation, under this assumption, we can get the overall ∂ by calculating the average of all samples C/∂W and ∂C/∂b.

Hypothesis 2: The cost function can be expressed as a function of the network output Loss=C(aL). For example, the quadratic cost function of a single sample x can be written as:

The Four Fundamental Equations of Backpropagation

How changes to weights W or biases b affect the cost function C is key to understanding backpropagation. Ultimately, this means that we need to calculate ∂C/∂Wjkl and ∂C/∂bjkl for each. Before discussing the basic equations, we introduce the concept of error δ, where δjl represents the error of neuron j in layer l .

As shown in the figure above, assuming that there is a little devil making trouble for the jth neuron in layer l, he changes the weight output of this neuron by Δzjl, then the activation output of this neuron is δ(zjl+Δzjl), and then this The error propagates backward layer by layer, causing the final cost function to change by ∂C/∂zjlΔzjl. Now this little devil has turned around and it wants to help us reduce the value of the cost function as much as possible (make the network output more in line with expectations). Assuming that ∂C∂zjl is a large positive or negative value at the beginning, the little devil makes the cost function smaller by choosing a ∂C/∂zjl in the opposite direction to ∂C/∂zjl (this is the well-known gradient descent method). As the iteration progresses,  ∂C/∂zjl will gradually tend to 0 , so the improvement effect of Δzjl on the cost function will be minimal. At this time, the little devil will tell you proudly: "I have found the optimal solution (local optima)". This inspires us to use  ∂C/∂zjl  to measure the error of neurons:  δjl=∂C∂zjl .

Let's take a look at how the four basic equations come from.

1. The error equation of the output layer

If you understand the above things, this equation should not be difficult to understand. The first term ∂C∂ajL on the right side of the equation measures how quickly the cost function changes with the final output of the network, while the second term σ(1)(zjL) is Measures how quickly the activation function output changes with zjL. When the activation function is saturated, that is, σ(1)(zjL)≈0, no matter how large ∂C∂ajL is, eventually δjL≈0, the output neuron enters the saturation region and stops learning.

Both terms in the equation are easy to calculate if the cost function is a quadratic cost function:

can get:

In the same way, σ(1)(zjL) can be obtained by calculating the partial derivative of zjL for the activation function σ(z), which can be rewritten into a matrix form:

⊙ is the Hadamard product, that is, the dot product of the matrix.

chain rule error summation

 Gradient descent weight parameter update

 

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132390963