Common problems in machine learning (2)

1. What are the disadvantages of the KNN algorithm?
(1) The computational cost is very high.
① Since KNN must calculate the distance between the classification data and each training data, it is very time-consuming; ② The
KNN algorithm must save all the data sets. If the training data set is large, it requires a lot of storage. space;
(2) unable to handle categorical variables
(3) very sensitive to scaling of variables
(4) difficult to handle variables with different units and different numerical ranges
(5) poor performance on high dimensional data
(6) poor interpretability, The choice problem of rule
(7)k cannot be given like a decision tree. k always walks between overfitting and underfitting. The k of choosing k by cross validation is globally applicable, so KNN always inevitably overfits in some regions and underfits in others. Therefore, KNN must do feature selection first, otherwise these irrelevant features will affect the classification effect, because KNN cannot learn which features are important and which are not.
(8) If the K-nearest neighbor model is used for regression, a more obvious defect is that KNN cannot do the regression prediction of out of sample. Because when using KNN for regression, the predicted value is the average of several values ​​around it. Therefore, the predicted value cannot exceed the maximum value of the predicted sample, nor can it be smaller than the minimum value of the sample. This disadvantage is actually the same as that of regression trees.

2. What does nonparametric model mean? Which models are considered nonparametric?
If a machine learning model is determined by only a limited number of parameters, then the model is a parametric model. The "limited number" here means that the number is small and does not change with the number of samples. Before knowing the amount of data, it is already known how many parameters need to be estimated.
The parametric model has a relatively simple structure and only needs to estimate a small number of parameters. This is usually because the parametric model already makes strong assumptions about the probability distribution before the parameters are estimated.
Linear regression, for example, assumes a linear relationship and normality of residuals.
For example, the Gaussian mixture model assumes that each cluster is a Gaussian distribution.
Like logistic regression, etc.

If a machine learning model cannot be determined by a limited number of parameters, then the model is a non-parametric model. Here "limited number" refers to a small number that does not vary with the number of samples.
For example, the k-nearest neighbor model is non-parametric, and the parameters of the model are determined by each data point.
For example, k-Means is also non-parametric, and the parameters of the model are determined by each data point.
Such as decision trees, random forests, SVMs, etc.

3. What is the difference between hyperparameter and parameter?
The parameters are usually obtained automatically from the training set data during the model training process.
Hyperparameters are usually set manually before the model is trained. The purpose is to make the model perform better when training the parameters.
When we talk about parameter tuning, we generally refer to hyperparameter tuning.
Taking LASSO regression as an example, the coefficients in the regression model are parameters, and the penalty coefficients of the regular term are hyperparameters.
Simply put, the parameters inside the model are parameters, and the parameters input from the outside are hyperparameters.

4. What does data leakage mean?
Data leakage means that data that should not be used is used, such as
(1) when training the model, the data and information of the test set are
used (2) when the current data is used in the future
(3) when the cross-validation is used for parameter adjustment, Use the validation set information to participate in the model building
Specifically , let’s talk about the third point, such as standardizing the features, the correct method should be to standardize on the training set and then apply it to the validation set, rather than standardize first and then divide the validation set. For another example, to perform PCA dimensionality reduction on the data, PCA should be performed on the training set and then applied to the validation set, rather than performing PCA on the entire data set. This is usually ignored.

5. How does gbdt discretize continuous features?
The discretization of continuous features can be done by the decision tree itself, rather than gradient boost.
Decision trees do this.
(1) First sort this continuous variable. For example, age, sort the values ​​of age in all samples from small to large.
(2) Under the assumption that the data is not repeated, if there are n samples, there should be n-1 intervals between the sorted data.
(3) The decision tree will try (fork) these n-1 intervals one by one. Each different fork will bring a different gini index. We finally choose the fork with the smallest gini index as the optimal score. cross.
This is done in theory, but in practice it is for some computational optimization, possibly some random searches, not necessarily traversals.
The above process divides the continuous variable into two (the first discretization), for example, age is divided into 0 to 20 years old and 20 to 100 years old.
Next, as the decision tree continues to grow, the previously bisected continuous features may be selected again. For example, if the fork between 20 and 100 years old is selected, we repeat the above three steps again. This time the result could be 20 to 35, 35 to 100 years old.
Iteratively, such a continuous variable is discretized continuously until the model reaches the stopping condition.

6. Categorical variables, one hot coding, increased dimension, how to deal with it?
There are several ideas to try:
(1) It depends on the logic and meaning behind the variable, and whether there is a way to combine it according to the meaning of the categorical variable itself.
(2) Merge according to the target value. For example, if your target is 0-1 binary prediction, if this categorical variable takes A, 90% is 1; when taking B, 89% is 1. Then A and B can be merged together. Finally do one hot. Similar approach if your goal is regression.
(3) Sort the classification of categorical variables according to the frequency, and keep the classifications that have accumulated to 90% or 95%, and the smallest 10% or 5% can be combined into one class.
(4) hashing trick, random merge.

7. Using PCA to reduce dimensionality, how many dimensions is more appropriate?
(1) If it is for data visualization, it can be reduced to 1 dimension (line), 2 dimension (plane), or 3 dimension (stereo).
(2) If the dimension is reduced for the purpose of establishing a prediction model, the more common method is to see how many principal components explain the percentage of variance, such as 99%, 95%, and 90%.
(3) Another method is Kaiser's Rule, which retains all singular values ​​greater than 1.
(4) There is also a method similar to the elbow method, which draws the curve of the number of principal components and the percentage of variance explained, and finds the point of the elbow .

8. Besides PCA, what other dimensionality reduction methods are there?
(1) High Correlation
If the correlation of two features is greater than a certain threshold (set by yourself, such as 0.3), delete one of them.
(2) Low variance
If the data variance of a feature is less than a certain threshold (set by yourself), delete it.
(3) Missing
If there are many missings in this column, delete it.
(4) Random Forests
After Random Forests training, the importance of all features can be returned, and we can select a part of the features with the highest importance, such as 20%.
(5) Stepwise selection
selects features step by step, which can be selected forward or eliminated backward.
(6) Random Projection
is similar to PCA, but this projection is random, not orthogonal like PCA.
(7) t-SNE
t-SNE is a "distance-preserving" transformation from a high-dimensional space to a low-dimensional space. If two points have a "distance" of 1 in a 100-dimensional space, we want to find a mapping that maps these two points to a low-dimensional (say, 2-dimensional) space, and their distance is also 1. The effect achieved in this way is that the points that are far away in the original space are far away in the new low-dimensional space; the points that are close in the original space are also close in the new low-dimensional space. This so-called "distance" is not a real distance, but a similarity. The calculation of the similarity of two data points is mainly based on the Euclidean distance of the two points, and some standardized processing is performed on it. The assumption of the t-distribution is used in the processing. This mapping from high dimension to low dimension first needs to set a random initial point, and then optimize it so that the two "distances" are equal.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325935961&siteId=291194637