Machine Learning Algorithm Post: Common interview questions and answers are sorted out and updated continuously

1. What is the difference between decision tree ID3 and C4.5? What are their advantages?

First of all, we must know what ID3 and C4.5 are. They and the CART algorithm are both heuristics for decision trees.

Note: CART uses the Gini index to divide the training subset. CART (Classification and Regression Trees): Binary division (binary tree). The measurement indicators for classification are the Gini indicator (minimized Gini index) and Towing; for regression problems, the measurement indicator is the least square residual.
Gini Index (Gini Impurity): Indicates the probability that a randomly selected sample in the sample set will be classified incorrectly.

Because the optimal decision tree is selected from all possible decision trees, this is an NP-complete problem, which is difficult to solve in polynomial time (or there is no such solution at all). Therefore, in reality, the decision tree learning algorithm uses heuristic methods to approximate this optimization problem. The so-called heuristic method is to recursively select the optimal feature and segment the training data according to the feature.

ID3 algorithm:

The information gain is used to select features, and the information gain reflects the degree of uncertainty reduction after a given condition. Every time the decision tree is forked to select attributes, we will select the attribute with the highest information gain as the splitting attribute. Only in this way, the impurity of the decision tree will decrease faster.

Disadvantages: Information gain is biased towards features with more values

Reason: When the value of the feature is large, it is easier to obtain a more pure subset according to the feature division, so the entropy after the division is lower, because the entropy before the division is constant, so the information gain is greater, so the information gain Features with more values ​​are more biased.
Therefore, the following information gain ratio is used.

C4.5 algorithm:

The C4.5 algorithm is similar to ID3 with partial improvements. In the process of decision tree generation, the C4.5 algorithm uses the information gain ratio to select features.

Information gain ratio: information gain ratio = penalty parameter * information gain

The essence of the information gain ratio: is to multiply a penalty parameter on the basis of the information gain. When the number of features is large, the penalty parameter is small; when the number of features is small, the penalty parameter is large.

Disadvantage: Feature with less information gain than bias value.
Reason: When the feature value is less, the value of HA(D) is smaller, so its reciprocal is larger, so the information gain is larger. Therefore, it is biased towards features with fewer values.

Using information gain ratio: Based on the above shortcomings, not directly select the maximum rate of information gain feature, but first find out the candidate feature information gain above-average features, and then select these features in information gain rate highest feature.

difference:

The ID3 algorithm uses information gain for feature selection. The information gain reflects the degree of uncertainty reduction after a given condition. The more the feature value, the higher the certainty. Therefore, this method has more priority choices. The characteristics of the trend, poor generalization.

The C4.5 algorithm uses the information gain ratio to select features, and to a certain extent penalizes features with more values, but there is a tendency to preferentially select features with fewer values.

ID3 can only handle discrete variables, the other two can handle continuous variables. C4.5 When dealing with continuous variables, after sorting the data, find the dividing lines of different categories as the dividing points, and convert the continuous attributes to Boolean types according to the dividing points, thereby converting the continuous variables into discrete types of multiple value spaces Change face. Since CART will perform binary division of features during its construction, that is, binary cutting method, it can be well applied to continuous variables.

ID3 and C4.5 can generate multi-branch branches on each node, and each feature will not be reused between levels, while CART will only generate two branches per node, so a binary tree will be formed in the end, and Each feature can be reused.

ID3 and C4.5 use pruning to weigh the accuracy and generalization ability of the tree. CART directly uses all the data to find all possible tree structures for comparison.

ID3 and C4.5 can only be used for classification, and CART can also be used for regression (the regression tree uses the least square error criterion).

2. Reasons for over-fitting and how to prevent it

Take the decision tree as an example:

There are several reasons for the over-fitting phenomenon ,

First: In the process of building the decision tree, there is no reasonable restriction (pruning) on ​​the growth of the decision tree;
Second: In the modeling process, more output variables are used, and more variables are prone to over-simulation. Close;
Third: There are some noisy data in the sample, and the noisy data has a lot of interference to the construction of the decision tree, and the noisy data is not effectively eliminated.

For the preventive measures of over-fitting phenomenon , there are the following methods,

First: select reasonable parameters for pruning, which can be divided into pre-pruning and post-pruning. We generally use the post-pruning method to do it;
second: K-folds cross-validation, dividing the training set into K parts, and then Perform K times of cross-validation, each time using K-1 as the training sample data set, and the other as the test set;
third: reduce the features, calculate the correlation between each feature and the corresponding variable, the common one is Pearson Correlation coefficients are used to eliminate variables with less correlation. Of course, there are other methods for feature selection, such as feature selection based on decision trees, and feature selection through regularization.

3. Principles and formula derivation of several models (SVM, LR, GBDT, EM)

1. SVM principle

The goal of SVM is to find an optimal hyperplane in the data feature space to divide the data into two categories. This optimal hyperplane maximizes the distance from the nearest point to it. These points are called support vectors. SVM can be used to solve two-class or multi-class problems.

Specific formula derivation:
https://blog.csdn.net/qq_22613769/article/details/106723679

2. The principle of LR (linear regression)

The core idea of ​​linear regression is to obtain the straight line that best fits the data.

Formula derivation:
https://www.cnblogs.com/mantch/p/10135708.html

2.5 Similarities and differences between SVM and LR

Same point:

  1. Both SVM and LR belong to classification algorithms
  2. When not using the kernel function, it is a linear classifier
  3. Supervised learning algorithm
  4. Discriminant model

difference:

  1. The biggest difference is the difference in the loss function (SVM: maximize the geometric interval of the support vector; LR: maximize the likelihood estimation)
  2. Consider the difference in data (SVM: only consider the support vectors near the hyperplane; LR: consider all data)
  3. For nonlinear problems (SVM: kernel function; LR: not using kernel function)
  4. Data standardization (SVM: required because of distance calculation; LR: not required)
  5. Regular term (SVM: comes with regular term; LR: need to add regular term), which is why SVM is a structural risk minimization algorithm! The so-called structural risk minimization means to seek a balance between training error and model complexity to prevent over-fitting, so as to minimize the real error.

3. Principle of GBDT

Gradient Boosting Decision Tree (GBDT) is a gradient boosting decision tree.
The output of the GBDT model is the accumulation of several decision trees it contains. Each decision tree is a fitting to the combined prediction residuals of the previous decision tree, which is a kind of "correction" to the results of the previous model. It is an iterative decision tree algorithm. The algorithm consists of multiple decision trees, and the conclusions of all trees are added together to make the final answer. Gradient boosting trees can be used for regression problems (called CART regression trees at this time) and classification problems (called classification trees at this time).

Algorithm steps:

  1. Initialization is to estimate the constant value that minimizes the loss function. It is a tree with only one root node, that is, ganma is a constant value.
  2. (A) Calculate the value of the negative gradient of the loss function in the current model and use it as an estimate of the residual
    (b) Estimate the regression leaf node area to fit the approximate value of the residual
    (c) Use linear search to estimate the value of the leaf node area , To minimize the loss function
    (d) update the regression tree
  3. Get the final output model f(x)

Formula derivation: https://www.jianshu.com/p/005a4e6ac775

4. Principle of EM (Maximum Expectation)

EM algorithm, the full name is Expectation Maximization Algorithm. The expected maximum algorithm is an iterative algorithm used for maximum likelihood estimation or maximum posterior probability estimation of probability parameter models containing hidden variables (Hidden Variable).

The core idea of ​​the EM algorithm is very simple, divided into two steps: Expection-Step and Maximization-Step. E-Step mainly estimates parameters by observing data and existing models, and then uses this estimated parameter value to calculate the expected value of the likelihood function; and M-Step is to find the corresponding parameter when the likelihood function is maximized. Since the algorithm guarantees that the likelihood function will increase after each iteration, the function will eventually converge.

Formula derivation: https://zhuanlan.zhihu.com/p/78311644

5. XGBoost principle

XGBoost is an improvement of the gradient boosting algorithm. Newton's method is used to solve the extreme value of the loss function, and the loss function is Taylor expanded to the second order. In addition, a regularization term is added to the loss function. The objective function during training consists of two parts, the first part is the loss of the gradient boosting algorithm, and the second part is the regularization term.

XGBoost is essentially a GBDT, but strives to maximize speed and efficiency.

The core algorithm idea of ​​XGBoost is:

  1. Continuously adding trees and continuously performing feature splitting to grow a tree. Each time you add a tree, you are actually learning a new function f(x) to fit the residual of the last prediction.
  2. When we have finished training and get k trees, we have to predict the score of a sample. In fact, according to the characteristics of this sample, each tree will fall to a corresponding leaf node, and each leaf node corresponds to a score.
  3. Finally, you only need to add up the scores corresponding to each tree to get the predicted value of the sample.

How is XGBoost different from GBDT?

In addition to the algorithmic differences from traditional GBDT, XGBoost has also done a lot of optimizations in engineering implementation. In general, the differences and connections between the two can be summarized into the following aspects:

  1. GBDT is a machine learning algorithm, and XGBoost is an engineering implementation of this algorithm.
  2. When using CART as a base classifier, XGBoost explicitly adds a regular term ( number of leaf nodes + L2 regularization of leaf node weight ) to control the complexity of the model, which helps prevent over-fitting and improve the generality of the model. Ability.
  3. GBDT only uses the first-order derivative information of the cost function during model training. XGBoost performs a second-order Taylor expansion on the cost function, and can use both the first-order and second-order derivatives at the same time.
  4. Traditional GBDT uses CART as the base classifier, and XGBoost supports multiple types of base classifiers, such as linear classifiers.
  5. The traditional GBDT uses all the data in each iteration, while XGBoost uses a strategy similar to that of a random forest and supports sampling of the data.
  6. Traditional GBDT is not designed to deal with missing values, XGBoost can automatically learn the processing strategy for missing values.

For details, see: https://www.cnblogs.com/mantch/p/11164221.html

6. RF (random forest) principle

Random forest, as the name implies, is to build a forest in a random way. There are many decision trees in the forest, and there is no correlation between each decision tree in the random forest. After getting the forest, when a new input sample enters, let each decision tree in the forest make a judgment separately to see which category the sample belongs to (for the classification algorithm), and then see which one If one category is selected the most, then predict which category the sample belongs to.

Two randoms of random forest:

  1. Random sampling (bootstrap): The bootstrap method is used to randomly sample k new self-service sample sets (bootstrap) with replacement, and k classification trees (ID3, C4.5, CART) are constructed from this, which is a sample disturbance . Note that this is different from GBDT subsampling. The sub-sampling of GBDT is sampling without replacement, while the sub-sampling of Bagging is sampling with replacement .
  2. Random attributes: randomly select a subset of attributes, the number is k, and then select an optimal attribute from this subset for division. Belongs to attribute disturbance .

4. The principle, advantages, disadvantages and improvements of K-means

Principle:
The idea of ​​K-Means algorithm is very simple. For a given sample set, the sample set is divided into K clusters according to the distance between the samples. Make the points in the clusters as close together as possible, and make the distance between the clusters as large as possible.

If expressed in a data expression, assuming that the cluster is divided into, our goal is to minimize the squared error E:

The main advantages of K-Means are:

1) The principle is relatively simple, the implementation is also very easy, and the convergence speed is fast.

2) The clustering effect is better.

3) The interpretability of the algorithm is relatively strong.

4) The main parameter that needs to be adjusted is only the number of clusters k.

The main disadvantages of K-Means are:

1) The selection of K value is not easy to grasp (improvement: you can get a cluster center through a K-means algorithm by giving a suitable value to k at the beginning. For the obtained cluster center, according to the obtained k Clustering distance situation, merge the closest cluster, so the number of cluster centers is reduced, when it is used for the next clustering, the corresponding number of clusters is also reduced, and finally an appropriate number of clusters is obtained. A judgment value E can be used to determine the number of clusters to obtain a suitable position to stop without continuing to merge the cluster centers. Repeat the above cycle until the judgment function converges, and finally obtain a clustering result with a better number of clusters).

2) It is more difficult to converge for data sets that are not convex (improvement: density-based clustering algorithms are more suitable, such as DESCAN algorithm)

3) If the data of each hidden category is unbalanced, for example, the amount of data of each hidden category is seriously unbalanced, or the variance of each hidden category is different, the clustering effect is not good.

4) Using an iterative method, the result obtained is only a local optimum.

5) Sensitive to noise and abnormal points (improvement 1: LOF algorithm for outlier detection, by removing outliers and then clustering, can reduce the impact of outliers and outliers on the clustering effect; improvement 2: Change it to find the median of points, this clustering method is K-Mediods clustering (K median).

6) Selection of initial clustering centers (improvement 1: k-means++; improvement 2: dichotomous K-means, see here and here for related knowledge).

For details, please see: https://blog.csdn.net/u014465639/article/details/71342072

Five, bagging and boosting

Baggging and Boosting are both methods of model fusion, which can form a strong classifier after fusion of weak classifiers, and the effect after fusion will be better than the best weak classifier.

1、Bagging

Bagging is the bagging method, and the algorithm process is as follows:

  1. Extract the training set from the original sample set. Bootstraping focus on the use of each round from the original sample method with replacement of extracting n training samples (in the training set, some samples may be drawn to the many times, and some samples may once have not been drawn). A total of k rounds of extraction are performed to obtain k training sets. (K training sets are independent of each other)
  2. Each time a training set is used to obtain a model, and k training sets are used to obtain a total of k models. (Note: There is no specific classification algorithm or regression method here. We can use different classification or regression methods according to specific problems, such as decision trees, perceptrons, etc.)
  3. For classification problems: the k models obtained in the previous step are voted to obtain the classification results; for regression problems, the mean value of the above models is calculated as the final result. (The importance of all models is the same)

2、Boosting

The AdaBoosting method uses all samples each time, and each round of training changes the weight of the samples. The goal of the next round of training is to find a function f to fit the residual of the previous round. It stops when the residual is small enough or reaches the maximum number of iterations set. Boosting will reduce the weight of the samples that are correct in the previous round of training, and increase the weight of the wrong samples. (The right residual is small, the wrong residual is large)

The boosting method of gradient boosting is to use the partial derivative of the cost function to the model function f trained in the previous round to fit the residuals.

The difference between Bagging and Boosting: https://blog.csdn.net/qq_24753293/article/details/81067692

To Be Continued。。。。。。

Guess you like

Origin blog.csdn.net/weixin_44414948/article/details/114867459