This article takes you to understand machine learning algorithms

1. What are the common machine learning algorithms?

KNN algorithm, linear regression method, decision tree algorithm, random forest algorithm, PCA algorithm, SVM algorithm, etc.

2. What is machine learning

Simply put, machine learning is to let the machine learn from the data, and then get a model that is more in line with the laws of reality. Through the use of the model, the machine can perform better than before. This is machine learning.

The understanding of the above sentence:

Data: Digitization of some things or regular features abstracted from real life.

Learning: based on the data, let the machine repeatedly execute a set of specific steps (learning algorithm) to extract the characteristics of things, and get a description that is closer to reality (this description is a model that itself may be a function). We call this function that roughly describes reality our learned model.

Better: We can better explain the world and solve problems related to models through the use of models.

3. Explain the difference between supervised and unsupervised machine learning?

Supervised learning requires labeled data for training. In other words, supervised learning uses ground truth, meaning we have prior knowledge about the output and the samples. The goal here is to learn a function that approximates the relationship between input and output.

Unsupervised learning, on the other hand, does not use labeled outputs. The goal here is to infer the natural structure in the dataset.

4. Introduction to KNN algorithm

Proximity algorithm, or K-Nearest Neighbor (KNN, K-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification technology. The so-called K nearest neighbors means the K nearest neighbors, which means that each sample can be represented by its closest K neighbors. The nearest neighbor algorithm is a method of classifying each record in the data set.

The k-nearest neighbor method is a basic classification and regression method, and is a commonly used method in supervised learning methods. The k-nearest neighbor algorithm assumes that a training data set is given, and the instance categories in it have been determined. When classifying, a new instance is predicted by means of majority voting according to its k nearest neighbor training instance categories.

The three elements of the k-nearest neighbor method: the distance measure, the choice of k value and the classification decision rule. Commonly used distance measures are the Euclidean distance and the more general pL distance. When the value of k is small, the k-nearest neighbor model is more complex and prone to overfitting; when the value of k is large, the k-nearest neighbor model is simpler and prone to underfitting. Therefore, the choice of k value will have a significant impact on the classification results. The selection of the k value reflects the trade-off between the approximation error and the estimation error, and the optimal k is usually selected by cross-validation.

advantage

  1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training required;

  2. suitable for classifying rare events;

  3. Especially suitable for multi-classification problems (multi-modal, objects have multiple category labels), kNN performs better than SVM.

shortcoming

  1. The main disadvantage of this algorithm in classification is that when the samples are unbalanced, such as the sample size of one class is large, while the sample size of other classes is small, it may cause that when a new sample is input, K The samples of the bulk class in the neighborhood are in the majority.
  2. Another disadvantage of this method is that the amount of calculation is large, because for each text to be classified, the distance to all known samples must be calculated to obtain its K nearest neighbors.

5. Introduction to Linear Regression

Linear regression is an analysis method that uses regression equations (functions) to model the relationship between one or more independent variables (eigenvalues) and dependent variables (target values) .

  • Features: The case of only one independent variable is called univariate regression, and the case of more than one independent variable is called multiple regression

There are two main models in linear regression, one is a linear relationship and the other is a nonlinear relationship.

6. Introduction to PCA Algorithm

PCA (principal components analysis) is principal component analysis technology, also known as principal component analysis, which aims to use the idea of ​​dimensionality reduction to convert multiple indicators into a few comprehensive indicators.

Advantages of the PCA algorithm:
1. Make the data set easier to use;
2. Reduce the computational overhead of the algorithm;
3. Remove noise;
4. Make the results easy to understand;
5. No parameter restrictions at all.

Disadvantages of the PCA algorithm:
1. The interpretation of principal components often has a certain degree of ambiguity, which is not as complete as the original sample.
2. Principal components with small contribution rates may often contain important information on sample differences, that is, they may be useful for distinguishing categories of samples (labels) ) is more useful
3. Whether the orthogonal vector space of the eigenvalue matrix is ​​unique remains to be discussed
4. Unsupervised learning

PCA algorithm solution steps:

  1. remove mean

  2. Compute the covariance matrix

  3. Compute Eigenvalues ​​and Eigenvectors of Covariance Matrix

  4. Sort the eigenvalues

  5. Keep the eigenvectors corresponding to the first N largest eigenvalues

  6. Transform the original features into the new space constructed by the N feature vectors obtained above (the last two steps achieve feature compression)**

PCA is a commonly used data analysis method. PCA transforms the original data into a set of linearly independent representations of each dimension through linear transformation, which can be used to identify and extract the main feature components of the data, by rotating the data coordinate axis to those most important directions on the data angle (maximum variance); then Through eigenvalue analysis, the number of principal components that need to be retained is determined, and other non-principal components are discarded, so as to achieve data dimensionality reduction. Dimensionality reduction makes data simpler and more efficient, thereby achieving the purpose of improving data processing speed and saving a lot of time and cost. Dimensionality reduction has also become a very widely used data preprocessing method. The PCA algorithm has been widely used in the exploration and visualization of high-dimensional data sets, and can also be used in data compression, data preprocessing, image, voice, communication analysis and processing and other fields.

7. Introduction to Support Vector Machine-SVM

SVM (Support Vector Machine) refers to a support vector machine, which is a common discriminant method. In the field of machine learning, it is a supervised learning model, usually used for pattern recognition, classification and regression analysis. It shows many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition, and can be extended to other problems such as function fitting.
Support Vector Machine (SVM) is a kind of generalized linear classifier (generalized linear classifier) ​​that performs binary classification (binary classification) on data according to supervised learning, and its decision boundary is to solve the learning samples The maximum-margin hyperplane of .

  1. The main features of SVM:
    The main idea of ​​SVM is to find a hyperplane in the high-dimensional space as the division of the two categories for two types of classification problems, so as to ensure the minimum classification error rate.
    SVM considers finding a hyperplane that meets the classification requirements, and makes the points in the training set as far away from the classification surface as possible, that is, finds a classification surface that maximizes the margin on both sides.
    The training samples on the hyperplane that passes through the points closest to the classification surface in the two types of samples and is parallel to the optimal classification surface are called support vectors.
    The optimal classification surface requires that the classification line can not only separate the two classes correctly (the training error rate is 0), but also maximize the classification interval.

8. Introduction to Random Forest Algorithm

Random forest refers to a classifier that uses multiple trees to train and predict samples.

The advantages of random forests are:

1) For many kinds of data, it can generate high-accuracy classifiers;

2) It can handle a large number of input variables;

3) It can evaluate the importance of variables when deciding on categories;

4) It can internally produce unbiased estimates of the generalized error when building the forest;

5) It contains a good method to estimate the missing data, and maintain accuracy if a large part of the data is missing;

6) The learning process is very fast.

Algorithm process

1. A sample with a sample size of N is drawn N times with replacement, and one sample is drawn each time, and finally N samples are formed. The selected N samples are used to train a decision tree as samples at the root node of the decision tree.

2. When each sample has M attributes, when each node of the decision tree needs to be split, randomly select m attributes from the M attributes, satisfying the condition m << M. Then use a certain strategy (for example, information gain) from these m attributes to select 1 attribute as the splitting attribute of the node.

3. During the formation of the decision tree, each node must be split according to step 2 (it is easy to understand, if the attribute selected by the node next time is the attribute used when its parent node was just split, the node has reached leaf node, there is no need to continue to split), until it can no longer split. Note that no pruning is performed throughout the formation of the decision tree.

4. Build a large number of decision trees according to steps 1~3, thus forming a random forest.

Random selection of data
First, sampling with replacement is taken from the original data set to construct a sub-data set, and the data volume of the sub-data set is the same as that of the original data set. Elements in different sub-datasets can be repeated, and elements in the same sub-dataset can also be repeated. Second, use sub-datasets to build sub-decision trees, put this data into each sub-decision tree, and each sub-decision tree outputs a result. Finally, if there is new data and the classification result needs to be obtained through the random forest, the output result of the random forest can be obtained by voting on the judgment result of the sub-decision tree.

Random selection of features to be selected

Similar to the random selection of the data set, each splitting process of the subtree in the random forest does not use all the features to be selected, but randomly selects certain features from all the features to be selected, and then randomly selects the selected features. Select the best features among the features. This can make the decision trees in the random forest different from each other, improve the diversity of the system, and thus improve the classification performance.

Guess you like

Origin blog.csdn.net/weixin_53795646/article/details/129583495