Common machine learning face questions and answers

1. Linear regression analysis, the goal is to minimize the residuals. Is a function of the residual sum of squares parameter, in order to find a residual peak value, so that the number of residual partial derivative parameter is zero, and the residuals would be zero, i.e., the residual zero mean .

2. m th element set to n element sets mapped to number n ^ m.

3. m th element set to n element sets the single shot is:!, When m = n, is A (m, m) = m ( a) when m ≠ n, is 0.

4. m th element set to n element sets Surjective is: When Mn, complexity, classification needs discussed: When m = n + 1, is C (m, 2) A ( n, n) = m ( m-1) n! / 2 ( two)

5. The the X1 an X2 + X3 + ...... + + positive integers with a solution Xn = m For the general case (m-1) C (n -1)
which has a non-negative integer solution of (m + n-1) C (n- 1) species

6. a neuron output is -0.01, you may use the activation function Tanh
Here Insert Picture Description

7. gradient descent parameters have been trained to be NaN, there are possible reasons:

1). Explosion gradient
2). In the calculation process, there has been a case where the dividend is 0
3). Batchsize much learning rate or set. Because logits INF output becomes too large, this will be taken in the evaluation log gradient becomes nan, nan abbreviation is not a number, not a rational number expressed. So in addition to transfer a small learning rate this solution, another solution can also be added to the loss regularization term.
4). Cost function forget to add a minimum value of the parameter log.
5). Errors in data, the data itself contains a nan data, wrong data network can not lead to convergence.

8. Please tell us about, SVM with linear kernel in when when using Gaussian kernel?

   When the preferred feature data extraction, information contained sufficiently large, many of the problems that can be linearly separable linear core. If the number of small features, moderate number of samples, is not sensitive to time, problems are encountered when linearly inseparable Gaussian kernel may be used to achieve better results.

9. give you some data sets, how to classify?

   Note: From the size of the data, wherein, if the angle of missing

   A: selecting a different model, such as the LR or the SVM, according to the data type decision tree. If more feature dimensions may be selected SVM model, a large number of samples can be selected if the LR model, but requires a data preprocessing model LR; if the missing values can be selected more decision trees. After completion of the model selected, the corresponding objective function is determined.

10. Can you explain how the matrix positive definite judgment, and the Hessian matrix is ​​applied gradient descent in the positive definiteness

10.1 qualitative determination MATRIX

  Determining whether a positive definite matrix, all features can be seen this matrix value is not smaller than 0, if the condition is satisfied, it is determined that the semi-definite, when all the eigenvalues ​​are greater than 0, it is determined positive definite.

10.2 Hessian matrix being applied in GD Qualitative

  In determining the feasibility of optimization algorithms play a significant role positive definiteness of Hessian matrix. If the Hessian positive definite matrix, the function of the second order partial derivative greater than 0 it is a function of the rate of change of the state is incremented, the Newton method and the like GD method, the Hessian matrix can be readily determined whether a function converge to a local or global optimization.

Published 378 original articles · won praise 43 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_43283397/article/details/104932399
Recommended