Common problems in machine learning (3)

This article continues to organize the most popular machine learning problems and reference solutions on various platforms. If you have any questions, please add them~
1. What are the advantages of the activation function ReLU compared to Sigmoid?
(1) Relu has a small amount of calculation;
(2) Relu has no saturation area and the gradient disappears;
(3) The convergence is faster when Relu is used to estimate nonlinear functions. According to the AlexNet paper, it is about 6 times faster than sigmoid.
Finally, Relu has a disadvantage that after the output is 0, the gradient cannot be transferred in the reverse direction, and the Relu dies. .
2. How big is the dropout rate generally set?
Mainly according to your own needs, this is also a hyper parameter, which can be adjusted according to the quality of the result.

Generally, there is less dropout in the input layer, and the dropout rate is 0.1 or even 0.
It can be slightly larger in the middle, such as 0.5. One purpose of dropout is to generate different network structures through dropout during each batch of training. When dropout is 0.5, the network structure will change more, so 0.5 is very commonly used.
3. How does a feedforward neural network choose the number of hidden layers?
Generally, it is tested through cross validation, and there is no absolutely correct criterion.
However, some are common experiences of everyone (including Jeff Heaton), which are worth referring to.
The number of hidden layers
is generally one, because it is sufficient in most cases. In theory, a feedforward neural network with two hidden layers can represent arbitrary nonlinear decision boundaries. So, lets say 1 to 2 hidden layers.
The number of nodes in the hidden layer
Assuming that input layer is Nx and the number of nodes in the output layer is Ny, then the number of nodes in each hidden layer Nh is generally between Nx and Ny. How to choose it depends on the specific problem. Analyze and then do cross-validation.
4. What are the assumptions of logistic regression?
Logistic Regression has relatively few prerequisites.
Compared with the linearity, independence, normality, and equal variance of linear regression, we only need linearity and independence here.
Independence: errors must be independent
Linear : The independent variable and the logit function are linearly related.
5. How to parallelize large linear regression?
(1) If the projection matrix is ​​used to find linear regression, then this problem is equivalent to how to perform parallel calculation of the matrix;
(2) If you use numerical methods to find linear regression, you can consider mini-batch, for example, batch-size is 40, and you have 4 CPUs, then each CPU calculates 10 points, and then adds them together;
(3) If distributed computing is considered (that is, the data is cut into N blocks), each machine or cpu can be allocated to one-Nth of the data, calculated separately, and finally the N groups of regression coefficients are averaged. This method is commonly used in big data companies, such as Linkin and Google.
6. What does the robustness of a machine learning algorithm mean?
There is no strict quantitative definition of machine learning algorithm robustness. Robustness, as the name suggests, means robustness. The robustness of machine learning models is similar. Mainly in two aspects.
(1) Error points or errors in the data. There are often misplaced data in the training set, and similarly, there may be some errors in the predicted samples. A robust machine learning model can not be affected by the erroneous data in these training sets, and can still bypass the clouds and see the essence.
(2) The distributions of training samples and prediction samples are not the same. A robust model is that even when the data distribution of the test set is different from that of the training set, the model can give better prediction results.
7. How to perform cross-validation on time series?
Ordinary cross-validation cannot be used, as there will be data leakage.
One solution is to guarantee the time sequence and always use the following data as the test set.
For example, the data is from January to December.
Then you can:
train 1 to 6 months, validate
on 7 months train 2 to 7 months, validate
on 8 months train 3 to 8 months, validate
on 9 months train 4 to 9 months, validate on
5 months to 10 months Training in October, validating on
November 6-November training, validating on December

The disadvantage of this is that the data from January to June is never used as a validation set, so according to the cross-validation results, it is somewhat biased.
If you have other good methods, please correct me~
8. What is the difference between cosine similarity and inner product?
They express different meanings and cannot be easily generalized.
The cosine distance only considers the angle difference, and the inner product comprehensively considers the angle difference and the length difference.
For example, there are two objects A and B, and their vectors are represented as A(1,1,0) and B(0,1,1). The cosine similarity does not consider the length of the vector, so A(1,1,0) and The similarity of C(0,3,3) is the same as that of A and B.
The inner product is recommended if the length of the vector itself has a real effect on the similarity (under an understanding of the meaning). For example, to score several attributes of a product, 1 means uncertain, and 5 means very certain, then A(1,1,1), B(4,4,4), C(5,5,5) three goods, B is more similar to C according to the inner product. But cosine similarity cannot distinguish the similarity of A, B, and C here.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325951526&siteId=291194637