Deep learning scattered knowledge points (continuously updated)

1. Steps of gradient descent algorithm:

a. Initialize weights and biases with random values

b. Pass the input into the neural network to get the output value

c. Calculate the error between the predicted value and the true value

d. Adjust the corresponding weight value for each neuron that produces an error to reduce the error

e. Repeat the iteration until the best weight is obtained


2. A series of data preprocessing (rotation, translation, scaling) needs to be done before the data is fed into the neural network, and the neural network itself cannot complete these transformations


3. The Bagging operation is similar to Dropout in the neural network. Bagging (the bagging method, in parallel with the boosting method) is a technique that repeatedly samples (with replacement) from the data according to a uniform probability distribution.


4. When training the neural network, the loss function loss does not drop in the first few epochs, which may be because: the learning rate is too low, the regular parameter is too high, or it falls into a local optimal solution


5. For many high-dimensional non-convex functions, local minima (and maxima) are actually far less than another class of points with zero gradient: saddle points. Some points near the saddle point have a larger cost than the saddle point, while others have a smaller cost. 



6. PCA (Principal Component Analysis) extracts the direction in which the variance of the data distribution is relatively large, and also plays a role in dimensionality reduction. If the hidden hidden in the neural network can achieve dimensionality reduction, it can extract features with predictive ability.


7. Both CNN and RNN share weights.


8. Batch normalization (BN) in the neural network handles the overfitting principle because the same data has different normalized values ​​in different batches, which is equivalent to data enhancement


9. The size of the input image is 200×200, and it goes through one layer of convolution (kernel size 5×5, padding 1, stride 2), pooling (kernel size 3×3, padding 0, stride 1), and another layer of convolution After (kernel size 3×3, padding 1, stride 1), what is the output feature map size? 97

analyze:

padding is the size of the edge to expand outward, stride is the step size of each move

output size = (input size + padding*2 - kernel size)/stride + 1

The first layer of convolution: output = (200 + 2-5) / 2 + 1 = 99.5 (take 99)

Second layer pooling: output = (99-3)/1 +1 = 97

The third layer of convolution: output = (97+2-3)/1 +1 = 97

The convolutional layer is rounded down, and the pooling layer is rounded up : the / sign is used directly in the convolutional layer formula, the remainder is removed, and the rounding is down. In the pool layer, the ceil function is used, which is rounded up.


10. HK algorithm, the weight vector is calculated under the minimum mean square error criterion, which is suitable for linearly separable and nonlinearly separable cases. For the linearly separable case, the optimal weight vector is given, and for the nonlinearly separable case, it can be discriminated. out, exit the iteration.


11. Three dense matrices A(m*n), B(n*p), C(p*q), m<n<p<q, to ​​calculate ABC, how to calculate the most efficient? (AB)C

(AB)C: m*n*p+m*p*q multiplications

A(BC): n*p*q+m*n*p multiplications


12. The following figure is a gradient descent graph of a neural network training with four hidden layers using the sigmoid function as the activation function. This neural network encounters the problem of vanishing gradients.  The first hidden layer corresponds to D, the second hidden layer corresponds to C, the third hidden layer corresponds to B, and the fourth hidden layer corresponds to A (A is the first layer of backpropagation, and the learning speed is fast)


13. Suppose we suddenly encountered a problem during training. After a few cycles, the error decreases instantly. You think there is a problem with the data, so you draw the data and find that the problem may be caused by the excessive skewness of the data. How? solve?


Solution: Principal Component Analysis (PCA) and normalization of the data. The error decreases instantaneously. The general reason is that multiple data samples have strong correlation and are suddenly hit by fitting, or data samples with large variance are suddenly hit by fitting. Therefore, principal component analysis (PCA) and normalization of the data can improve this problem.


14、我们可以观察到误差出现了许多小的"涨落"。 这种情况我们应该担心吗?


不需要,只要在训练集和交叉验证集上有累积的下降就可以了

为了减少这些“起伏”,可以尝试增加批尺寸(batch size)。具体来说,在曲线整体趋势为下降时, 为了减少这些“起伏”,可以尝试增加批尺寸(batch size)以缩小batch综合梯度方向摆动范围. 当整体曲线趋势为平缓时出现可观的“起伏”, 可以尝试降低学习率以进一步收敛. “起伏”不可观时应该提前终止训练以免过拟合。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324578736&siteId=291194637