Machine Learning Common Sense: Top 10 Machine Learning Algorithms Beginners Should Know

1. Brief introduction

In machine learning, there is a theorem called "there is no free lunch". In short, it states that no one machine learning algorithm is best for all problems, and it is particularly relevant to supervised learning (i.e. predictive modeling).

For example, you cannot say that neural networks are always better than decision trees, and vice versa. There are many factors at play, such as the size and structure of the dataset. Therefore, many different algorithms should be tried for the problem, while using a "test set" of held-out data to evaluate performance and choose a winner.

Of course, the algorithm you try has to fit your problem, and that's where choosing the right machine learning task comes in. For example, if you need to clean the house, you might use a vacuum cleaner, a broom, or a mop, but you wouldn't get out a shovel and start digging.
insert image description here

2. General principles

All supervised machine learning algorithms for predictive modeling share a common principle: a machine learning algorithm is described as learning an objective function (f) that best maps an input variable (X) to an output variable (Y): Y = f(X)

This is a general learning task where we want to predict the future (Y) given a new example of the input variable (X). We don't know what function (f) looks like or its form. If we did, we would use it directly, we would not need to use a machine learning algorithm to learn it from the data.

The most common type of machine learning is learning the map Y = f(X) to predict Y for a new X. This is called predictive modeling or predictive analytics, and our goal is to make the most accurate predictions possible.

For machine learning newbies eager to learn the basics of machine learning, here is a quick look at the top 10 machine learning algorithms used by data scientists.

3. Overview of ten commonly used algorithms

  • Linear regression
  • logistic regression
  • Linear Discriminant Analysis
  • Classification and Regression Trees
  • Naive Bayes
  • K-Nearest Neighbor (KNN)
  • Learning Vector Quantization (LVQ)
  • Support Vector Machine (SVM)
  • random forest
  • promote
  • AdaBoost

3.1 Linear regression

Linear regression is probably one of the most famous and well-understood algorithms in statistics and machine learning. Predictive modeling is primarily concerned with minimizing the error of the model or making the most accurate predictions, but at the expense of interpretability.

The representation of linear regression is an equation that describes the line that best fits the relationship between the input variable (x) and the output variable (y) by finding specific weights for the input variables called coefficients (B).

insert image description here

For example: y = B0 + B1 * x, we will predict y given the input x, the goal of the linear regression learning algorithm is to find the values ​​of the coefficients B0 and B1.

Linear regression models can be learned from data using different techniques, such as linear algebra solutions of ordinary least squares and gradient descent optimization.

Linear regression has been around for over 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated), and as much noise as possible from the data. This is a quick and easy technique and a good first try algorithm.

3.2 Logistic regression

Logistic regression is another technique that machine learning borrows from the field of statistics. It is the preferred method for binary classification problems (problems with two class values).

Logistic regression is similar to linear regression in that its goal is to find coefficient values ​​that weight each input variable. Unlike linear regression, the predictions of the output are transformed using a non-linear function called a logistic function.

The logistic function looks like a big S and will convert any value to the range 0 to 1. This is useful because we can apply rules to the output of a logistic function to capture values ​​to 0 and 1 (eg, output 1 if less than 0.5) and predict a class value.

insert image description here
The graph is a logistic regression graph showing the relationship between the probability of passing the exam and the study time.

Because of the way the model learns, the predictions made by logistic regression can also be used as the probability that a given data instance belongs to class 0 or 1. This is useful for problems where more reasons need to be given to make a prediction.

Like linear regression, logistic regression works better when you remove attributes that are unrelated to the output variable and attributes that are very similar (correlated) to each other. This is a fast-learning model and works well for binary classification problems.

3.3 Linear Discriminant Analysis

Logistic regression is a classification algorithm traditionally limited to two-class classification problems. If you have more than two classes, the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is very simple. It consists of statistical properties of your data, computed for each class. For a single input variable, this includes:

  1. Average value for each category.
  2. Variance computed across all classes.

insert image description here
Prediction is made by calculating the discriminant value for each class and predicting for the class with the largest value. This technique assumes that the data has a Gaussian distribution (bell curve), so it is best to remove outliers from the data beforehand. It is a simple and powerful method for classifying predictive modeling problems.

3.4 Decision tree

Decision trees are an important algorithm for predictive modeling machine learning. The representation of a decision tree model is a binary tree. This is a binary tree from algorithms and data structures, nothing fancy. Each node represents an input variable (x) and a split point on that variable (assuming the variable is a number).
insert image description here
The leaf nodes of the tree contain an output variable (y) that is used to make predictions. Predictions are made by traversing the splits of the tree until a leaf node is reached and outputting the class value at that leaf node.

Trees learn fast, and predictions are fast. They are also generally accurate for a wide range of questions and do not require any special preparation of your data.

3.5 Naive Bayes

Naive Bayes is a simple but powerful predictive modeling algorithm.

The model consists of two types of probabilities that can be computed directly from your training data: 1) the probability of each class; 2) the conditional probability of each class given each x value. Once calculated, the probabilistic model can be used to make predictions on new data using Bayes' theorem. When your data are real-valued, a Gaussian distribution (bell curve) is usually assumed so that you can easily estimate these probabilities.

insert image description here
Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption that is unrealistic for real data, however, the technique is very effective when dealing with a large number of complex problems.

3.6 K-nearest neighbors

The KNN algorithm is very simple and very efficient. The model representation of KNN is the entire training dataset. Simple right?

Predictions for new data points are made by searching the entire training set for the K most similar instances (neighbors) and summarizing the output variables of these K instances. For regression problems this might be the mean output variable, for classification problems this might be the mode (or most common) class value.

The trick is how to determine the similarity between data instances. If your attributes all have the same scale (e.g. all in inches), the easiest technique is to use Euclidean distance, where you can directly calculate a number based on the difference between each input variable.
insert image description here
KNNs can require a lot of memory or space to store all the data, but only perform computations (or learning) in time when predictions are needed. You can also update and manage your training instances over time to maintain prediction accuracy.

The concept of distance or proximity may decompose in very high dimensions (lots of input variables), which can negatively affect the performance of the algorithm on your problem. This is called the curse of dimensionality. It recommends that you use only those input variables that are most relevant to the predicted output variables.

3.7 Learning Vector Quantization

One disadvantage of K-Nearest Neighbors is that you need to keep the entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose the number of training instances to suspend and see exactly what those instances look like.

insert image description here
The representation of LVQ is a collection of codebook vectors. These are randomly chosen at the beginning and are suitable to best summarize the training dataset over many iterations of the learning algorithm. After learning, the codebook vector can be used to make predictions like K-Nearest Neighbors. Find the most similar neighbors (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (the actual value in the case of regression) of the best matching unit is then returned as the prediction. Best results are obtained if you rescale the data to have the same range, eg between 0 and 1.

If you find that KNN provides good results on your dataset, try using LVQ to reduce the memory requirements for storing the entire training dataset.

3.8 Support Vector Machines

Support Vector Machines are probably one of the most popular and talked about machine learning algorithms.

A hyperplane is a line that divides the input variable space. In the SVM, a hyperplane is chosen that best separates the points in the input variable space by their class (class 0 or class 1). In 2D, you can visualize this as a line, assuming all our input points can be completely separated by this line. The SVM learning algorithm finds the coefficients that lead to the best separation of classes through the hyperplane.
insert image description here
The distance between the hyperplane and the closest data point is called the margin. The best or optimal hyperplane that can separate these two classes is the line with the largest margin. Only these points are relevant for defining the hyperplane and building the classifier. These points are called support vectors. They support or define hyperplanes. In practice, an optimization algorithm is used to find the coefficient values ​​that maximize the margin.

SVM is probably one of the most powerful out-of-the-box classifiers and worth trying on your dataset.

3.9 Bagging and random forests

Random Forest is one of the most popular and powerful machine learning algorithms. It is an ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

Bootstrap is a powerful statistical method for estimating quantities from a sample of data. Like a meaning. You take a large sample of data, calculate the average, and then average all the averages to get a better estimate of the true average.

In bagging, the same method is used, but for estimating the entire statistical model, most commonly decision trees. Take multiple samples of training data, and build a model for each data sample. When you need to make predictions on new data, each model makes a prediction, and the predictions are averaged to better estimate the true output value.
insert image description here
Random forest is an adaptation of this approach, where decision trees are created so instead of choosing the best split point, a sub-optimal split is made by introducing randomness.

Therefore, the model created for each data sample is more different than the others, but still accurate in its unique and different ways. Combining their predictions gives a better estimate of the true potential output value.

If you get good results with an algorithm with high variance (like decision trees), you can usually get better results by bagging that algorithm.

3.10 Boosting and AdaBoost

insert image description here
Boosting is an ensemble technique that attempts to create a strong classifier from multiple weak classifiers. This is done by building a model from the training data and then creating a second model to try to correct the errors in the first model. Add models until the training set is perfectly predicted or add the maximum number of models.

AdaBoost was the first truly successful boosting algorithm developed for binary classification. This is the best starting point for understanding boosting. Modern boosting methods build on top of AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the tree's performance on each training instance is used to measure how much attention the next tree created should pay to each training instance. Hard-to-predict training data is given higher weight, while easy-to-predict instances are given less weight. Models are created sequentially one after the other, and each model updates the weights on the training instance, which affect the learning performed by the next tree in the sequence. After all trees are built, predictions are made on new data, and the performance of each tree depends on how accurate it is on the training data.

Since algorithms place a lot of emphasis on correcting errors, it is important to have clean data and remove outliers.

Fourth, the last

When faced with a wide variety of machine learning algorithms, a typical question a beginner asks is "Which algorithm should I use?" The answer to that question depends on many factors, including: (1) The size, quality, and nature; (2) available computing time; (3) the urgency of the task; (4) what you want to do with the data.

Even an experienced data scientist can't tell which algorithm performs best before trying different algorithms. Although there are many other machine learning algorithms, these are the most popular. If you are new to machine learning, these will be a good starting point for learning. This column has implemented all the above algorithms, please search CSDN: Sichuan Cainiao, you can see the introduction to machine learning to the Great God column on the home page.

Supongo que te gusta

Origin blog.csdn.net/weixin_46211269/article/details/126415917
Recomendado
Clasificación