[Machine learning notes] Summary of key points of machine learning knowledge

Machine learning knowledge summary

 

1. What are the common classifications and commonly used algorithms for machine learning?

Machine learning is divided into four types, namely supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.

(1) Supervised Learning: Refers to each training data sample that enters the algorithm has a corresponding expected value, which is the target value. The process of performing machine learning is actually the process of mapping feature values ​​and target queues; through the past The method of training the characteristics of some data and the final result is the supervised learning method; the training data source of the supervised learning algorithm needs to be composed of the feature value and the target queue.

 

Because supervised learning depends on the marking of each sample, you can get the exact target value that each feature sequence maps to, so it is often used in regression and classification scenarios. Common supervised learning algorithms are shown in the following table:

 

algorithm

Specifically include

Classification algorithm

 

K-Nearest Neighbor (KNN), Naive Bayesian algorithm, Decision Tree, Random forest, GBDT (GradientBoostingDecisionTree) and support vector machine ( Support Vector Machine, SVM), etc.

Regression algorithm

 

Logistic Regressive, Linear Regression, etc.

 

One problem with supervised learning is that the cost of obtaining the target value is relatively high.

 

(2) Unsupervised Learning (Unsupervised Learning): Refers to a machine learning algorithm that does not rely on marking data for training samples, it is mainly used to solve some clustering scenarios. Common unsupervised learning algorithms are shown in the following table:

 

algorithm

Specifically include

Clustering Algorithm

 

K-Means (k-means clustering algorithm, k-means clustering algorithm), DBSCAN (clustering algorithm, Density-Based Spatial Clustering of Applications wit), etc.

 

Recommendation algorithm

 

Collaborative filtering (Collaborative Filtering), etc.

 

 

Compared with supervised learning, a major advantage of unsupervised learning is that it does not depend on marking data.

 

(3) Semi-supervised learning (Semi-supervised Learning): The machine learning algorithm is used by marking parts of the sample. Many semi-supervised learning algorithms are variants of the supervised learning algorithm.

 

(4) Reinforcement learning (Reinforcement Learning): It is a more complex type of machine learning, emphasizing that the system continuously interacts with the outside world, obtains feedback from the outside world, and then determines its own behavior.

 

In summary, supervised learning mainly solves classification and regression scenarios, unsupervised learning mainly solves clustering scenarios, semi-supervised learning solves some classification scenarios where marking data is difficult to obtain, and reinforcement learning mainly aims at the continuous needs Reasoning scene. The specific classification is shown in the following table:

 

algorithm

Specifically include

Supervised learning

Logistic regression, K nearest neighbors, naive Bayes, random forest, support vector machine

Unsupervised learning

K-means, DBSCAN, collaborative filtering, LDA

Semi-supervised learning

Label propagation

Reinforcement learning

Hidden Markov

 

2. The difference between supervised learning and unsupervised learning

Supervised learning: training through existing training samples (that is, known data and its corresponding outputs) to obtain an optimal model, and then use this model to map all new data samples to the corresponding output results , Simple judgment on the output results to achieve the purpose of classification, then this optimal model also has the ability to classify unknown data.

Unsupervised learning: We do not have any training data samples in advance and need to directly model the data.

Supervised learning: Learning the labeled training samples to classify and predict the data outside the training sample set as much as possible. (LR, SVM, BP, RF, GBDT)

 

3. The causes and solutions of overfitting

If you blindly improve the prediction ability of the training data, the complexity of the selected model will often be very high. This phenomenon is called overfitting. What is shown is that the error during model training is small, but the error is large during testing.

Causes:

(1) Reasons for overfitting

  •     Problems with sample data.
  •     The sample size is too small
  •     The sampling method is wrong, and the extracted sample data cannot be effective enough to represent business logic or business scenarios. For example, the sample conforms to the normal distribution, but is sampled according to the average distribution, or the sample data cannot represent the distribution of the overall data
  •     The noise data in the sample is too disturbing

(2) Model problem

  •     High model complexity and too many parameters
  •     Decision tree model without pruning
  •     There are enough weight learning iterations (Overtraining) to fit the noise in the training data and the unrepresentative features in the training examples.

Solution:

(1) Sample data:

  •     Increase the number of samples, reduce the dimensions of the samples, and add verification data
  •     Sampling method should be consistent with the business scenario
  •     Cleaning noise data

(2) Model or training problems

  •     To control the complexity of the model, prefer to choose a simple model, or use model fusion technology.
  •     Use prior knowledge to add regular items. L1 regularization is more likely to produce sparse solutions, and L2 regularization tends to make the parameter w tend to 0.
  •     Cross-validation
  •     Don't overtrain, stop iterating before converging when optimizing.
  •     Decision tree model without pruning
  •     Weight loss

 

4. The difference and advantages and disadvantages of linear classifier and non-linear classifier

If the model is a linear function of parameters and there is a linear classification surface, then it is a linear classifier, otherwise it is not.

Common linear classifiers are: LR, Bayesian classification, single-layer perceptron, linear regression

Common non-linear classifiers: decision tree, RF, GBDT, multi-layer perceptron SVM (both linear kernel or Gaussian kernel)

The linear classifier is fast and easy to program, but the fitting effect may not be very good

The programming of the nonlinear classifier is complicated, but the effect fitting ability is strong

 

5. The difference between LR (Logistics Regression) and Liner SVM

Linear SVM and LR are both linear classifiers

Linear SVM does not directly depend on the data distribution, and the classification plane is not affected by one type of point; LR is affected by all data points. If the data is of different types, strongly unbalance generally needs to do balancing on the data first.

Linear SVM depends on the distance measurement of data expression, so the data needs to be normalized (normalized); LR is not affected by it. Linear SVM depends on the coefficient of penalty, and validation needs to be done in the experiment

n The performance of Linear SVM and LR will be affected by outlier. In terms of sensitivity, it is difficult to make a clear conclusion about who is better.

 

6. Common classification algorithms

SVM, neural network, decision tree, random forest, logistic regression, KNN, Bayes

 

7. Comparison of SVM, LR and decision tree

Model complexity: SVM supports kernel functions and can handle linear and nonlinear problems; the LR model is simple and has fast training speed, suitable for processing linear problems; decision trees are easy to overfit and need pruning.

Loss function: SVM hinge loss; LR L2 regularization; adaboost exponential loss.

Data sensitivity: SVM added tolerance is not sensitive to outlier, only care about support vectors, and needs to be normalized first; LR is sensitive to remote points

Data volume: Use LR for large data volume, and use SVM nonlinear kernel for small data volume and few features.

 

8. Distance measurement in clustering algorithms

The distance measurement in the clustering algorithm generally uses the Minkowski distance, which corresponds to different distances when p takes different values, such as the Manhattan distance when p = 1, and the Euclidean distance when p = 2, p = In the case of inf, it becomes Chebyshev distance, as well as jaccard distance, power distance (the more general form of Minkowski), cosine similarity, weighted distance, and Mahalanobis distance (similar weighting) as distance measures need to be Non-negativity, identity, symmetry and direct transmission, Minkovsky satisfies that property when p> = 1. For some discrete attributes such as {airplane, train, ship}, it cannot be directly at the attribute value Calculate the distance above, these are called unordered attributes, you can use VDM (ValueDiffrence Metrix), the VDM distance between two discrete values ​​a, b on the attribute u is defined as

It indicates the number of samples of a on the attribute u in the i-th cluster. When the importance of different attributes in the sample space is different, the weighted distance can be used. Generally, if the attributes of all attributes are considered to be the same, the features must be normalized. Generally speaking, distance requires a similarity measure. The greater the distance, the smaller the similarity. The distance used for the similarity measure does not necessarily have to satisfy all the properties of the distance measure, such as directness. For example, people and horses, people and horses are closer, and then people and horses may be far away.

 

9. Explain the Bayes formula and naive Bayes classification solution method

Bayesian formula

Bayesian optimal classifier that minimizes classification errors is equivalent to maximizing the posterior probability

The main difficulty in estimating the posterior probability based on the Bayesian formula is that the conditional probability is the joint probability on all attributes, which is difficult to estimate directly from a limited training sample. Naive Bayes classifier adopts the assumption of attribute conditional independence. For the known categories, it is assumed that all attributes are independent of each other. In this way, the Naive Bayes classification is defined as: if there are enough independent and identically distributed samples, it can be estimated directly according to the number of samples in each class.

 

In the discrete case, the prior probability can be estimated using the sample size or in the discrete case, the maximum likelihood can be estimated according to the assumed probability density function. Naive Bayes can be used for both continuous and discrete variables. If you estimate directly based on the number of occurrences, there will be a case where the item is 0 and the product is 0, so some smooth methods are generally used, such as Laplace correction.

 

10. Why do some machine learning models need to normalize the data?

http://blog.csdn.net/xbmatrix/article/details/56695825

Normalization is to limit the data you need to process to a certain range after processing (through an algorithm).

1) After normalization, the speed of gradient descent to find the optimal solution is accelerated. The contour becomes smooth, and it can converge faster when the gradient descent is solved. If it is not normalized, the gradient descent process is easy to go, it is difficult to even converge

2) Changing dimensional expressions to non-dimensional expressions may improve accuracy. Some classifiers need to calculate the distance between samples (such as Euclidean distance), such as KNN. If the range of a feature value range is very large, the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with a small range of values ​​is more important)

3) Prior models of logistic regression and other models assume that the data follow a normal distribution.

The types of normalization are: linear normalization, standard deviation normalization, nonlinear normalization

 

11. The difference between normalization and standardization?

Normalization :

1) Turn the data into decimals between (0.1)

2) Turn dimensional expression into dimensionless expression

Common ones include linear conversion, logarithmic function conversion, inverse cotangent function conversion, etc.

standardization:

The normalization of data is to scale the data so that it falls into a small specific interval. It is often used in the processing of certain comparison and evaluation indicators to remove the unit limit of the data and convert it into a dimensionless pure value, so that indicators of different units or magnitudes can be compared and weighted.

1) Min-max normalization (linear transformation)

y=((x-MinValue) / (MaxValue-MinValue))(new_MaxValue-new_MinValue)+new_minValue

2) z-score normalization (or zero-mean normalization)

y = (average value of xX) / standard deviation of X

3) Decimal scaling normalization: normalization by moving the decimal position of X

y = x / 10 to the jth power (where j makes Max (| y |) <1 the smallest integer

4). Logistic mode:

New data = 1 / (1 + e ^ (-original data))

5) Fuzzy quantization mode

New data = 1/2 + 1 / 2sin [Pi 3.1415 / (maximum value-minimum value)

 

12. Missing value processing of feature vectors

1. If there are many missing values, discard the feature directly, otherwise it may bring in a larger noise and adversely affect the result.

2. There are few missing values, and the remaining missing values ​​of the features are all within 10%, we can take many ways to deal with:

1) Take NaN directly as a feature, assuming it is represented by 0;

2) Fill with the mean;

3) Use random forest and other algorithms to predict filling

 

13. Stop condition of decision tree

 Stop until each leaf node has only one type of record, (this way is easy to overfit).

 In the other case, it stops when the record tree of the leaf node is less than a certain threshold or the information gain of the node is less than a certain threshold

 

14. What is the difference between GBDT and random forest?

The random forest uses the idea of ​​bagging. Bagging is also called bootstrap aggreagation. Multiple samples are obtained by sampling with replacement in the training sample set. A base learner is trained based on each sample set, and then the base learner Combine.

Based on bagging the decision tree, the random forest introduces random attribute selection in the training process of the decision tree. The traditional decision tree selects the optimal attribute from the current node attribute set when selecting the division attribute, while the random forest randomly selects a subset of k attributes for the node, and then selects the most attribute, k as a parameter Controls the degree of randomness introduced.

In addition, GBDT training is based on the idea of ​​Boosting, and the sample weights are updated according to errors in each iteration, so it is a serialization method of serial generation, and random forest is the idea of ​​bagging, so it is a parallelization method.

 

15. Supervised learning generally uses two types of target variables

Nominal and numeric

Nominal type: The result of the nominal target variable is only taken in a limited target set, such as true and false (nominal target variable is mainly used for classification)

Numeric: The numeric target variable can be taken from an infinite set of values, such as 0.100, 42.001, etc. (The numeric target variable is mainly used for regression analysis)

 

16. How to determine the value in K-mean?

Given a suitable cluster index, such as the average radius or diameter, as long as we assume that the number of clusters is equal to or higher than the number of real clusters, the index will rise slowly, and once trying to get less than the true number When the cluster is clustered, the index will rise sharply. The diameter of a cluster refers to the maximum distance between any two points within the cluster. The radius of a cluster refers to the maximum distance from all points in the cluster to the center of the cluster.

 

17. Using Bayesian probability to illustrate the principles of Dropout

Dropout is a model selection technique designed to avoid overfitting during training. The basic approach of Dropout is to randomly remove the dimensions of the input data X given a probability p. Therefore, it is instructive to discuss how it affects the potential loss function and the optimization problem.

 

18. What is collinearity and how is it related to fitting?

Collinearity: In multivariate linear regression, the regression estimation is inaccurate due to the high correlation between variables.

Collinearity will cause redundancy and lead to overfitting.

Solution: exclude the relevance of variables / add regular weights

 

(updating……)

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105352987