1, analog biological neurons
2, the hierarchy
input layer, a hidden layer (1,2), the output layer; lines will be appreciated that as the weight parameter w. W need to specify the size of the neural network (matrix size)
Neural Network Process: input data; before calculating the propagation loss value; gradient back propagation calculation; using a gradient parameter update
3, the structure of the nonlinear (activation function)
After activation function applied to the previous layer weighting parameters:
4, the activation function
4.1 Sigmoid activation function
The backpropagation derivation operations:
When the absolute value of x is large, the derivative is nearly zero, the gradient prone to disappear in the chain rule, so that no further updates to the weight parameter, the neural network can not converge, and therefore most of the later neural network does not use this as a function of activation function.
4.2 ReLU activation function
ReLU activation function can solve the problem of the disappearance of the gradient, on the other hand guide the sake of convenience, and therefore the subsequent neural networks typically use this function as activation function.
5, items important role in neural networks regularization
Since some outliers, neural networks are prone to overfitting, regularization penalty term can effectively inhibit the over - fitting, enhance the generalization ability of neural networks.
The more neurons (equivalent weight parameters), the more energy can express complex models, but the greater the risk of over-fitting
6, data preprocessing
In the center of 0 (mean subtracted) and the normalization process (standard deviation divided by eliminating x, y-axis different floating).
7, the weight w and the bias term b initialization
Weights can be initialized to the same value, otherwise the back-propagation is updated in one direction, corresponding to the iterative neural network is too slow. Initialization typically Gaussian random initialization or
b can be a constant value (1 or 0) is initialized.
8,Drop-out
Full connection: For n-1 and n layers, any node n-1 layer, and the n-th layer are connecting all nodes. I.e., when the n-th layer of each node calculation is performed, the input activation function n-1 is the weighted sum of all the nodes of the layer.
Full connectivity is a good model, but a lot of time to network, train speed will be very slow, and prone to over-fitting phenomenon.
In order to solve these problems randomly without regard to the training section at each neuron (some of the parameters of heavy weights is not updated), namely Drop-out operation in the following figure:
reduction parameters although involved in the training, but we can increase the number of iterations to make up this defect.