【Machine Learning Series】Take you to quickly learn ten machine learning algorithms

foreword

Machine learning algorithms are a class of algorithms used to learn patterns and regularities from data. These algorithms can use the input and output of training samples to infer the parameters of the model, which can then be used to predict new unknown data.

insert image description here


machine learning algorithm

When it comes to machine learning algorithms, there are many different algorithms that can be used, here are the top ten basic machine learning algorithms:

  • Linear Regression Algorithm Linear Regression
  • Support Vector Machine Algorithm (Support Vector Machine, SVM)
  • Nearest neighbor/k-nearest neighbor algorithm (K-Nearest Neighbors, KNN)
  • Logistic Regression Algorithm Logistic Regression
  • Decision Tree Algorithm Decision Tree
  • k-means algorithm K-Means
  • Random Forest Algorithm Random Forest
  • Naive Bayes Algorithm Naive Bayes
  • Dimensional Reduction Algorithm Dimensional Reduction
  • Gradient Boosting Algorithm Gradient Boosting

Machine learning algorithms can be roughly divided into three categories:

  • Supervised Learning Algorithms (Supervised Algorithms) : During the supervised learning training process, a model (function/learning model) can be learned or established from the training data set, and new instances can be inferred based on this model. The algorithm requires specific inputs/outputs, starting with deciding what kind of data to use as an example. For example, a handwritten character, or a line of handwritten text in a text recognition application. The main algorithms include neural network, support vector machine, nearest neighbor method, naive Bayesian method, decision tree, etc.
  • Unsupervised Learning Algorithms (Unsupervised Algorithms) : This type of algorithm does not have a specific target output, and the algorithm divides the data set into different groups.
  • Reinforcement Algorithms (Reinforcement Algorithms) : Reinforcement learning has strong universality and is mainly trained based on decision-making. The algorithm trains itself according to the success or error of the output result (decision), and the optimized algorithm through a large number of experience training will be able to give better results. Prediction. Under the stimulus of rewards or punishments given by the environment, similar organisms gradually form expectations of the stimulus and produce habitual behaviors that can obtain the greatest benefit. In the context of operations research and cybernetics, reinforcement learning is called "approximate dynamic programming" (ADP).

These algorithms have different advantages and applicability on different problems and data sets, and the selection of the appropriate algorithm depends on the specific tasks and data characteristics. Next, let me introduce these algorithms to you:

1. Linear Regression Algorithm Linear Regression

Regression Analysis (Regression Analysis) is a statistical data analysis method, the purpose is to understand whether two or more variables are related, the direction and strength of the correlation, and establish a mathematical model to observe specific variables to predict changes in other variables.

The modeling process of the linear regression algorithm (Linear Regression) is to use the data points to find the line of best fit. The formula, y = m*x + c, where y is the dependent variable and x is the independent variable, uses the given data set to find the values ​​of m and c.
Linear regression is divided into two types, namely simple linear regression (simple linear regression) , with only one independent variable; multivariate regression (multiple regression) , with at least two or more independent variables.

img

Here is an example of linear regression: based on the Python scikit-learn toolkit description.

img

2. Support Vector Machine algorithm (Support Vector Machine, SVM)

Support vector machine/network algorithm (SVM) belongs to classification algorithm. SVM models represent instances as points in space, and a straight line will be used to separate the data points. It should be noted that support vector machines require full labeling of the input data and are only directly applicable to two-class tasks, and applications reduce the need for multi-class tasks to a few binary problems.

img

img

img

3. Nearest neighbor/k-nearest neighbor algorithm (K-Nearest Neighbors, KNN)

The KNN algorithm is an example-based learning, or lazy learning that is local approximation and defers all calculations until classification. Use nearest neighbors (k) to predict unknown data points. The k value is a key factor for prediction accuracy, whether it is classification or regression, it is very useful to measure the weight of neighbors, and the weight of closer neighbors is greater than that of distant neighbors.

The disadvantage of the KNN algorithm is that it is very sensitive to the local structure of the data. The calculation is heavy, and the data needs to be normalized so that each data point is in the same range.

img

img

img

Extension: One disadvantage of KNN is that it depends on the entire training data set. Learning Vector Quantization (LVQ) is a supervised learning human neural network algorithm that allows you to select training examples. LVQ is driven by data, searches for the two neurons closest to it, draws in neurons of the same type, and repels neurons of different types, and finally obtains the distribution pattern of the data. If a better data set classification effect can be obtained based on KNN, using LVQ can reduce the storage size of the storage training data set. Typical learning vector quantization algorithms include LVQ1, LVQ2 and LVQ3, especially LVQ2 is the most widely used.

img

4. Logistic Regression Algorithm

The Logistic Regression algorithm (Logistic Regression) is generally used in scenarios that require a clear output, such as the occurrence of certain events (predicting whether rainfall will occur). Typically, logistic regression uses some function to compress probability values ​​to a certain range.
For example, the sigmoid function (S-function) is a function with an S-shaped curve for binary classification. It converts the probability value of an event to a range representation of 0, 1.

Y = E ^(b0+b1 x)/(1 + E ^(b0+b1 x ))

The above is a simple logistic regression equation, B0, B1 are constants. These constant values ​​will be calculated to ensure that the error between predicted and actual values ​​is minimized.

img

img

5. Decision Tree Algorithm Decision Tree

A decision tree is a special tree structure consisting of a decision graph and possible outcomes (such as cost and risk) to assist decision-making. In machine learning, a decision tree is a predictive model. Each node in the tree represents an object, and each forked path represents a possible attribute value, and each leaf node corresponds to the path from the root node to the leaf node. The value of the object represented by the path traversed. Decision trees have only a single output, and typically this algorithm is used to solve classification problems.

A decision tree contains three types of nodes:

  • Decision node: usually represented by a rectangular box
  • Opportunity nodes: usually represented by circles
  • Termination point: usually represented by a triangle

Example of a simple decision tree algorithm to determine who in a population likes to use credit cards. Considering the age and marital status of the crowd, if people are 30 years old or married, people are more inclined to choose credit cards, and vice versa.
This decision tree can be further extended by identifying suitable attributes to define more categories. In this example, if a person is married, they are more likely to have a credit card if they are over 30 (100% preference). The test data is used to generate a decision tree.

img

img

Note : For those data with inconsistent sample sizes, the results of information gain in the decision tree are biased towards those features with more values.

6. k-average algorithm K-Means

The k-means algorithm (K-Means) is an unsupervised learning algorithm that provides a solution to the clustering problem.
The K-Means algorithm divides n points (which can be an observation or an instance of a sample) into k clusters (clusters), so that each point belongs to the cluster corresponding to the nearest mean (that is, the cluster center, centroid) . Repeat the above process until the center of gravity does not change.

img

7. Random Forest Algorithm Random Forest

The name of Random Forest algorithm (Random Forest) comes from the random decision forests proposed by Bell Labs in 1995. As its name says, Random Forest can be regarded as a collection of decision trees.
Each decision tree in a random forest estimates a class, a process called "voting". Ideally, based on each vote in each decision tree, we choose the class with the most votes.

img

img

8. Naive Bayes Algorithm Naive Bayes

Naive Bayes algorithm (Naive Bayes) is based on the Bayes theorem of probability theory, which is widely used, from text classification, spam filter, medical diagnosis and so on. Naive Bayes is suitable for scenarios where features are independent of each other, such as using the length and width of petals to predict the type of flower. The connotation of "simple" can be understood as the strong independence between features and features.

A concept closely related to the naive Bayesian algorithm is maximum likelihood estimation (Maximum likelihood estimation). Most of the maximum likelihood estimation theories in history have also been greatly developed in Bayesian statistics. For example, to establish a population height model, it is difficult to have manpower and material resources to count the height of each person in the country, but it is possible to obtain the height of some people through sampling, and then obtain the mean and variance of the distribution through maximum likelihood estimation.

img

9. Dimensional Reduction

In the field of machine learning and statistics, dimensionality reduction refers to the process of reducing the number of random variables under limited conditions to obtain a set of "irrelevant" main variables, and can be further subdivided into two methods: feature selection and feature extraction.

Some datasets may contain many unmanageable variables. Especially in the case of abundant resources, the data in the system will be very detailed. In this case, the dataset may contain thousands of variables, most of which may also be unnecessary. In this case, it is nearly impossible to identify the variables that have the most influence on our predictions. At this point, we need to use a dimensionality reduction algorithm, and other algorithms may also be used in the dimensionality reduction process, such as borrowing random forests and decision trees to identify the most important variables.

10. Gradient Boosting

Gradient Boosting uses multiple weak algorithms to create a more powerful exact algorithm. Instead of using a single estimator, using multiple estimators creates a more stable and robust algorithm. There are several gradient boosting algorithms:

  • XGBoost — use linear and tree algorithms
  • LightGBM — uses only tree-based algorithms
    Gradient boosting algorithms are characterized by high accuracy. Additionally, the LightGBM algorithm is incredibly performant.

Summarize

The learning of machine learning algorithms is a long-term process that requires continuous practice and practice to master. Through persistent learning and practice, you will be able to quickly master machine learning algorithms and apply them to practical problems.


Recommended books

Like+Comment+Favorite this article, 6 people will be randomly selected to give away!

insert image description here

"Machine Learning Python Edition"

Introduction: A beginner's guide to machine learning, using the Python language and the scikit-learn library, to master the processes, patterns, and strategies needed to develop machine learning systems.

Details: Click here to view

insert image description here

Guess you like

Origin blog.csdn.net/m0_63947499/article/details/131504045