[Machine Learning] A comprehensive summary of machine learning knowledge points (supervised learning + unsupervised learning)


A simple induction is that whether there is supervision (supervised) depends on whether the input data has a label (label). If the input data is labeled, it is supervised learning; if it is not labeled, it is unsupervised learning.

According to the type of model, machine learning can be divided into two categories: supervised learning model and unsupervised learning model:
insert image description here

1. Supervised learning

Supervised learning is usually to use the training data labeled by experts to learn an input variable XXX to input variableYYFunction map of Y. Y = f ( X ) Y = f ( X )Y=f ( X ) , the training data is usually( n × x , y ) (n×x,y)(n×x,y ) , wherennn represents the size of the training sample,xxxyyy are variable XXrespectivelyXYYThe sample value of Y.

Supervised learning can be divided into two categories:

Classification problem: predict the category (discrete) to which a sample belongs. Such as judging gender, health and so on.

Regression problem: predict the corresponding real output (continuous) of a certain sample. For example, predict the average height of people in a certain area.

In addition, ensemble learning is also a kind of supervised learning. It combines the predictions of multiple different relatively weak machine learning models to predict new samples.

1.1 Single model

1.1.1 Linear regression

insert image description here
Linear regression refers to a regression model that consists entirely of linear variables. In linear regression analysis, only one independent variable and one dependent variable are included, and the relationship between the two can be approximated by a straight line. This regression analysis is called unary linear regression analysis. If two or more independent variables are included in the regression analysis, and there is a linear relationship between the dependent variable and the independent variable, it is called multiple linear regression analysis.

1.1.2 Logistic regression

insert image description here
For research YYWhen Y is categorical data,XXXYYThe influence relationship between Y , if YYY is two types such as 0 and 1 (for example, 1 is willing and 0 is unwilling, 1 is buying and 0 is not buying), this time is called binary logistic regression; ifYYIf Y is more than three categories, it is called multi-class logistic regression.

Independent variables do not have to be categorical variables, they can also be quantitative variables. If XXX is categorical data, at this time need toXXX performs dummy variable setting.

1.1.3 Lasso

insert image description here
The Lasso method is a compressed estimation method that replaces the least squares method. The basic idea of ​​Lasso is to build a L 1 L_1L1The regularized model will compress some coefficients and set some coefficients to zero during the model building process. After the model training is completed, these parameters with a weight equal to 0 can be discarded, thus making the model simpler and effectively preventing the model from overfitting. It is widely used for fitting and variable selection of multicollinear data.

1.1.4 K Nearest Neighbors (KNN)

The main difference between KNN for regression and classification is the different decision-making methods when making predictions at the end. When KNN makes classification predictions, it generally chooses the majority voting method, that is, the K samples closest to the predicted sample features in the training set are predicted as the category with the largest number of categories. When KNN does regression, the average method is generally selected, that is, the average value of the sample output of the latest K samples is used as the regression prediction value. But their theory is the same.

1.1.5 Decision tree

insert image description here
Each internal node in the decision tree is a splitting problem: a test for an attribute of the instance is specified, it splits the samples arriving at the node according to a specific attribute, and each subsequent branch of the node corresponds to One possible value for this property. Among the samples contained in the leaf nodes of the classification tree, the mode of the output variable is the classification result. In the samples contained in the leaf nodes of the regression tree, the average value of the output variable is the predicted result.

1.1.6 bp neural network

insert image description here
The bp neural network is a multilayer feed-forward network trained by the error backpropagation algorithm, and it is one of the most widely used neural network models at present. The learning rule of the bp neural network is to use the steepest descent method to continuously adjust the weights and thresholds of the network through backpropagation to minimize the classification error rate of the network (the sum of squared errors is the smallest).

BP neural network is a multi-layer feed-forward neural network, its main characteristics are: the signal is propagated forward, and the error is propagated backward. Specifically, for the following neural network model with only one hidden layer:

The process of BP neural network is mainly divided into two stages. The first stage is the forward propagation of the signal, from the input layer to the hidden layer, and finally reaches the output layer; the second stage is the backpropagation of the error, from the output layer to the hidden layer. Containing layers, and finally to the input layer, adjust the weights and biases from the hidden layer to the output layer in turn, and the weights and biases from the input layer to the hidden layer.

1.1.7 Support Vector Machine (SVM)

Support Vector Machine Regression (SVR) uses nonlinear mapping to map the data into the high-dimensional data feature space, so that the independent variable and the dependent variable have good linear regression characteristics in the high-dimensional data feature space, and fit in this feature space Then return to the original space.

Support vector machine classification (SVM) is a kind of generalized linear classifier for binary classification of data in a supervised learning manner, and its decision boundary is the maximum margin hyperplane that is solved for the learning samples.

1.1.8 Naive Bayes

insert image description here
To calculate the probability of an event occurring given that another event occurred - we will use Bayes' theorem. Suppose the prior knowledge is ddd , to compute our hypothesishhThe probability that h is true, we will use Bayes' theorem as follows:

insert image description here
The algorithm assumes that all variables are independent of each other.

1.2 Integrated Learning

Ensemble learning is a method that combines the results of different learning models (such as classifiers) to further improve the accuracy rate through voting or averaging. Generally, voting is used for classification problems; averaging is used for regression problems. This approach stems from the idea that "everyone picks up materials and the fire is high".

There are three main types of ensemble algorithms: Bagging, Boosting, and Stacking. This article will not talk about stacking.

insert image description here

1.2.1 Boosting

insert image description here

1.2.1.1 GBDT

GBDT is a Boosting algorithm based on the CART regression tree-based learner. It is an additive model. It trains a set of CART regression trees serially, and finally sums up the prediction results of all regression trees, thus obtaining a strong learner. A new tree is fitted to the negative gradient direction of the current loss function. Finally, the sum of this group of regression trees is output, and the regression result is directly obtained or the sigmod or softmax function is applied to obtain the binary classification or multi-classification results.

1.2.1.2 Adaboost

Adaboost gives a high weight to the learner with a low error rate, and a low weight to the learner with a high error rate, and combines the weak learner and the corresponding weight to generate a strong learner. The difference between the regression problem and the classification problem algorithm is that the error rate calculation method is different. The classification problem generally uses a 0/1 loss function, while the regression problem generally uses a square loss function or a linear loss function.

1.2.1.3 XGBoost

XGBoost is the abbreviation of "Extreme Gradient Boosting". The XGBoost algorithm is a type of synthetic algorithm that combines basis functions and weights to form a good fitting effect on data. Since the XGBoost model has the advantages of strong generalization ability, high scalability, and fast computing speed, it has been welcomed by the fields of statistics, data mining, and machine learning since it was proposed in 2015.

xgboost is an efficient implementation of GBDT. Unlike GBDT, xgboost adds a regularization term to the loss function; and because some loss functions are difficult to calculate derivatives, xgboost uses the second-order Taylor expansion of the loss function as the fitting of the loss function.

1.2.1.4 LightGBM

LightGBM is an efficient implementation of XGBoost. Its idea is to discretize continuous floating-point features into k discrete values ​​and construct a histogram with a width of k. Then iterate over the training data and calculate the cumulative statistics for each discrete value in the histogram. When performing feature selection, you only need to traverse to find the optimal segmentation point according to the discrete value of the histogram; and use the leaf-wise growth strategy with depth restrictions, which saves a lot of time and space. .

1.2.1.5 CatBoost

catboost is a GBDT framework based on a symmetric decision tree algorithm. It mainly solves the pain points of efficiently and reasonably processing categorical features, dealing with gradient deviation, and predicting offset problems, so as to improve the accuracy and generalization ability of the algorithm.

1.2.2 Bagging

insert image description here

1.2.2.1 Random Forest

In the process of generating many decision trees in random forest classification, random sampling is performed on the sample observations and feature variables of the modeling data set. Each sampling result is a tree, and each tree will generate a tree that matches its own attributes The rules and classification results (judgment value), and the forest finally integrates the rules and classification results (judgment value) of all decision trees to realize the classification (regression) of the random forest algorithm.

2. Unsupervised learning

Unsupervised learning problems deal with training data where only the input variable X has no corresponding output variable. It uses unlabeled training data by experts to model the structure of the data.

2.1 Clustering

Divide similar samples into a cluster. Unlike the classification problem, the clustering problem does not know the category in advance, and the natural training data does not have the label of the category.

2.1.1 K-means algorithm

insert image description here
Cluster analysis is a center-based clustering algorithm (K-means clustering), which divides samples into K classes through iteration, so that the sum of the distances between each sample and the center or mean of the class it belongs to is the smallest. Unlike hierarchical clustering and other algorithms that cluster by fields, fast cluster analysis clusters by samples.

2.1.2 Hierarchical clustering

insert image description here
Hierarchical clustering, as a kind of clustering, is to perform hierarchical decomposition on a set of given data objects, and adopt a decomposition strategy according to the hierarchical decomposition. Hierarchical clustering algorithm establishes clusters hierarchically according to data, forming a tree with clusters as nodes. If the hierarchical decomposition is performed bottom-up, it is called agglomerative hierarchical clustering, such as AGNES. The top-down hierarchical decomposition is called splitting hierarchical clustering, such as DIANA. The most commonly used method is agglomerative hierarchical clustering.

2.2 Dimensionality reduction

Dimensionality reduction refers to reducing the dimensionality of data while ensuring that meaningful information is not lost. Using feature extraction methods and feature selection methods, the effect of dimensionality reduction can be achieved. Feature selection refers to selecting a subset of the original variables. Feature extraction is to transform data from high latitude to low latitude. The well-known principal component analysis algorithm is the method of feature extraction.

2.2.1 PCA principal component analysis

insert image description here
Principal component analysis linearly combines multiple indicators with certain correlations, and uses the least dimension to explain as much information as possible in the original data for dimensionality reduction. The variables after dimensionality reduction are linearly independent of each other, and the final determined new The variable is a linear combination of the original variables, and the proportion of the principal component in the variance is also smaller the later the time, and the ability to synthesize the original information is weaker.

2.2.2 SVD singular value decomposition

Singular value decomposition (SVD) is an algorithm widely used in the field of machine learning. It can be used not only for eigenvalue decomposition in dimensionality reduction algorithms, but also for recommendation systems and natural language processing. It is the cornerstone of many algorithms.

2.2.3 LDA linear discriminant

insert image description here
The principle of linear discrimination is to project samples onto a straight line, so that the projection points of similar samples are as close as possible, and the projection points of different samples are as far away as possible; when classifying new samples, project them onto the same straight line, and then The category of the new sample is determined according to the position of the projected point.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/127896823