Machine learning algorithm introduced

Reprinted: https: //blog.csdn.net/xiaochendefendoushi/article/details/81905111

In the field of machine learning, there was a saying called "There is no free lunch," Simply put, it means that no single algorithm can have the best effect on every issue, this theory in supervised learning It was especially important.

For example, you can not say that the neural network is always better than the decision trees, and vice versa. Model is run around a number of factors, such as size and structure of the data set.

Therefore, you should try many different algorithms based on your question, use the data to evaluate the performance of the test set and select the best entries.

Of course, you must try the algorithm and meet your problems with the main task of which is the doorway machine learning. For example, if you want to clean the house, you might use a vacuum cleaner, broom or mop, but you certainly will not get shovels began digging it.

For eager to learn the basics of machine learning machine learning newcomer, here have taken ten data scientists use machine learning algorithms, and introduce you to these ten characteristics of algorithms, we better understand and easy to use, come take a look it.

 

1. Linear regression

Linear regression is probably one of statistics and machine learning of the most famous and most easy to understand algorithm.

Because predictive modeling focuses on error minimization model, or to interpretability expense to make the most accurate predictions. We will borrow from many different areas, reuse and theft algorithm, which involves some knowledge of statistics.

Using a linear regression equation represented by finding a specific input variable heavy weights (B), to describe the linear relationship between the input variable (x) and the output variable (y).

 

640?wx_fmt=jpeg

Linear Regression

Example: y = B0 + B1 * x

Given input x, we will predict y, target linear regression learning algorithm is to find the value of the coefficient B0 and B1.

May use different techniques from linear regression model learning data, for example ordinary least squares optimization of the gradient descent and linear algebra solutions used.

Linear regression has been around for 200 years, and has been extensively studied. If possible, some rules of thumb when using this technique is to remove very similar (related) variable and remove noise from the data. This is a quick and simple technique and a good first algorithm.

 

2. logistic regression

Logistic regression is another technique borrowed from statistical machine learning field. This is a (two classes problem values) of specific binary classification method.

Logistic regression and linear regression similar because both goal is to find the weight value for each input variable. Linear regression is different, the predicted output is worth using a nonlinear function called logical function is transformed.

Logic function looks like a large S, and can convert any value in the range 0 to 1. This is useful, because we can apply the rule to the corresponding output logic function, the value classification 0 and 1 (e.g., if the IF is less than 0.5, then the output 1) and the prediction class value.

 

640?wx_fmt=jpeg

Logistic Regression

Due to the unique learning model, made by the logistic regression prediction it may also be used to calculate a probability of belonging to Class 1 or Class 0. This is useful for a number of issues need to be given basic principles.

As with linear regression, when you remove extraneous and output variables and attributes are very similar to each other (related) properties, logistic regression does better. This model is a fast and effective learning process binary classification problem.

 

3. Linear discriminant analysis

Traditional logistic regression limited to binary classification. If you have more than two classes, the linear discriminant analysis algorithm (Linear Discriminant Analysis, referred to as LDA) is the preferred linear classification techniques.

LDA represents very simple. It consists of statistical properties of the composition of your data, calculated according to each category. For a single input variable, which includes:

  • The average value of each type.

  • Calculation of variance across all categories.

640?wx_fmt=png

Linear Discriminant Analysis

LDA discriminant value calculated for each class and the class having a maximum value to be predicted. This technique assumes that the data has a Gaussian distribution (bell curve), it is preferable to manually remove outliers from the data. This classification is forecasting a modeling problem in a simple and powerful way.

 

4. classification and regression tree

A decision tree is an important machine learning algorithm.

Binary tree represents a decision tree model available. Yes, that's from the binary tree algorithms and data structures, nothing special. On the left and right child nodes each representing a single input variable (x) and the variable (assumed to be a digital variable).

640?wx_fmt=png

Decision Tree

Tree comprises leaf nodes for predicting the output variable (y). Prediction is performed by traversing the tree, stops when it reaches a leaf node, and the output value of the leaf node class.

Decision tree learning fast, fast speed prediction. For many of the problems often predict accurately, and you do not need to do anything special to prepare for the data.

 

5. Naive Bayes

Naive Bayes is a simple but extremely powerful predictive modeling algorithm.

该模型由两种类型的概率组成,可以直接从你的训练数据中计算出来:1)每个类别的概率; 2)给定的每个x值的类别的条件概率。 一旦计算出来,概率模型就可以用于使用贝叶斯定理对新数据进行预测。 当你的数据是数值时,通常假设高斯分布(钟形曲线),以便可以轻松估计这些概率。

640?wx_fmt=png

Bayes Theorem

朴素贝叶斯被称为朴素的原因,在于它假设每个输入变量是独立的。 这是一个强硬的假设,对于真实数据来说是不切实际的,但该技术对于大范围内的复杂问题仍非常有效。

 

6. K近邻

KNN算法非常简单而且非常有效。 KNN的模型用整个训练数据集表示。 是不是特简单?

通过搜索整个训练集内K个最相似的实例(邻居),并对这些K个实例的输出变量进行汇总,来预测新的数据点。 对于回归问题,新的点可能是平均输出变量,对于分类问题,新的点可能是众数类别值。

成功的诀窍在于如何确定数据实例之间的相似性。如果你的属性都是相同的比例,最简单的方法就是使用欧几里德距离,它可以根据每个输入变量之间的差直接计算。

 

640?wx_fmt=png

K-Nearest Neighbors

KNN可能需要大量的内存或空间来存储所有的数据,但只有在需要预测时才会执行计算(或学习)。 你还可以随时更新和管理你的训练集,以保持预测的准确性。

距离或紧密度的概念可能会在高维环境(大量输入变量)下崩溃,这会对算法造成负面影响。这类事件被称为维度诅咒。它也暗示了你应该只使用那些与预测输出变量最相关的输入变量。

 

7. 学习矢量量化

K-近邻的缺点是你需要维持整个训练数据集。 学习矢量量化算法(或简称LVQ)是一种人工神经网络算法,允许你挂起任意个训练实例并准确学习他们。

640?wx_fmt=png

Learning Vector Quantization

LVQ用codebook向量的集合表示。开始时随机选择向量,然后多次迭代,适应训练数据集。 在学习之后,codebook向量可以像K-近邻那样用来预测。 通过计算每个codebook向量与新数据实例之间的距离来找到最相似的邻居(最佳匹配),然后返回最佳匹配单元的类别值或在回归情况下的实际值作为预测。 如果你把数据限制在相同范围(如0到1之间),则可以获得最佳结果。

如果你发现KNN在您的数据集上给出了很好的结果,请尝试使用LVQ来减少存储整个训练数据集的内存要求。

 

8. 支持向量机

支持向量机也许是最受欢迎和讨论的机器学习算法之一。

超平面是分割输入变量空间的线。 在SVM中,会选出一个超平面以将输入变量空间中的点按其类别(0类或1类)进行分离。在二维空间中可以将其视为一条线,所有的输入点都可以被这条线完全分开。 SVM学习算法就是要找到能让超平面对类别有最佳分离的系数。

640?wx_fmt=jpeg

Support Vector Machine

超平面和最近的数据点之间的距离被称为边界,有最大边界的超平面是最佳之选。同时,只有这些离得近的数据点才和超平面的定义和分类器的构造有关,这些点被称为支持向量,他们支持或定义超平面。在具体实践中,我们会用到优化算法来找到能最大化边界的系数值。

SVM可能是最强大的即用分类器之一,在你的数据集上值得一试。

 

9. bagging和随机森林

随机森林是最流行和最强大的机器学习算法之一。 它是一种被称为Bootstrap Aggregation或Bagging的集成机器学习算法。

bootstrap是一种强大的统计方法,用于从数据样本中估计某一数量,例如平均值。 它会抽取大量样本数据,计算平均值,然后平均所有平均值,以便更准确地估算真实平均值。

在bagging中用到了相同的方法,但最常用到的是决策树,而不是估计整个统计模型。它会训练数据进行多重抽样,然后为每个数据样本构建模型。当你需要对新数据进行预测时,每个模型都会进行预测,并对预测结果进行平均,以更好地估计真实的输出值。

 

640?wx_fmt=png

Random Forest

Random Forests decision tree is an adjustment with respect to selecting the best split point to achieve sub-optimal random forest division by introducing randomness.

Therefore, the difference between each data sample to create the model will be greater, but it is their own sense is still accurate. Combined with the predicted results can be better estimate of potential output correct value.

If you are using a high variance algorithm (such as decision trees) to get a good result, then add the effect will be better after this algorithm.

 

10. Boosting and AdaBoost

Boosting is a way to create a strong classifier from several weak classifiers integration technology. It start with training data to build a model, and then create a second model to try to correct the mistakes of the first model. Continues to add models to predict or until the perfect training set has been added to the maximum number. 

AdaBoost is the first truly successful Boosting the development of algorithms for the dichotomous, but also the best starting point for understanding on Boosting. Currently algorithm based on AdaBoost constructed in the most famous is the stochastic gradient boosting.

 

640?wx_fmt=jpeg

AdaBoost

AdaBoost is often used in conjunction with a short decision tree. After creating the first tree, the performance of each training instance on the tree have determined that a tree needs to put much attention on this training instance. Unpredictable training data will be given more weight, and easy to predict examples are given less weight. Followed in order to create a model, each model update will affect the sequence of a tree learning. After completion of all construction tree algorithm to predict new data, and the weighted performance of each tree by the accuracy of the training data.

Because the algorithm is extremely focused on error correction, so a no clean data outliers is very important.

 

Written in the last

A typical beginner problem raised in the face of a wide variety of machine learning algorithm is "Which algorithm should I use?" The answer depends on many factors, including:

  • Size of the data, the quality and nature; 

  • Available computing time; 

  • Urgency of the task; 

  • What do you want the data.

Even an experienced data scientists, before trying different algorithms and can not know which algorithm will show the best. Although there are many other machine learning algorithms, but these algorithms are the most popular algorithms. If you are new to machine learning, this is a good starting point for learning.

Guess you like

Origin www.cnblogs.com/-wenli/p/11908868.html