[Deep Learning Notes] Shallow Neural Network

This column is the study notes of the artificial intelligence course "Neural Network and Deep Learning" of NetEase Cloud Classroom. The video is jointly produced by NetEase Cloud Classroom and deeplearning.ai. The speaker is Professor Andrew Ng. Interested netizens can watch the video of NetEase Cloud Classroom for in-depth study. The link of the video is as follows:

https://mooc.study.163.com/course/2001281002

Netizens who are interested in neural networks and deep learning are also welcome to communicate together~

Table of contents

1 The structure of the neural network

2 Activation function

3 Random initialization


1 The structure of the neural network

        You can  stack many sigmoid units to build a neural network. Each node of a neural network corresponds to two computational steps: a linear combination of the outputs of the previous layer ( the z  -value), and a nonlinear activation ( the a-  value).

        For a neural network containing 2 layers of sigmoid units,  x the input features are represented by the first layer parameters W^{[1]}, \, b^{[1]}and the second layer parameters  W^{[2]}, \, b^{[2]}, there are

z^{[1]} = W^{[1]} \, x + b^{[1]}

a^{[1]} = \sigma(z^{[1]})

z^{[2]} = W^{[2]} \, x + b^{[2]}

a^{[2]} = \sigma(z^{[2]})

        The neural network can be divided into input layer (Input Layer ), hidden layer ( Hidden Layer ) and output layer ( Output Layer ). The neural network in the above figure is called a two-layer neural network ( 2 Layer Neural Network ) , and the input layer is not calculated because the input layer does not contain parameters and a nonlinear activation process.

        In a neural network that uses supervised learning, the training set contains the input x  and the output  y , the meaning of the hidden layer is that in the training set, you can't see the value of the intermediate nodes.

2 Activation function

        When building a neural network, you can choose which activation function to use for the hidden layers, and which activation function to use for the output units.

        The tanh function is a translated version of the sigmoid function. In general, the tanh function is better than the sigmoid function. But these two functions have a disadvantage : when z is large or small, the gradient value of the function is close to 0 , this problem is called "gradient disappearance problem".

        Two other commonly used activation functions are the ReLU function and the leaked ReLU function.

3 Random initialization

        When training a neural network, the choice of initialization weights is very important. For logistic regression, you can set the initial weights to 0. But for a neural network, when the initial value is all 0, the hidden units in the neural network are performing exactly the same calculation, and the number of hidden units will be meaningless at this time.

        The solution to the problem is to initialize the weights randomly W . The usual method is to use the random function to randomly generate values. In order to avoid the slowdown of the gradient descent method due to too large initial weights, you can multiply a small coefficient, such as 0.01, but the bias value b is Can be initialized to 0.

Guess you like

Origin blog.csdn.net/sxyang2018/article/details/131446702