Data Mining Algorithms Comparison

Go to: http://www.36dsj.com/archives/68363

There are too many machine learning algorithms, such as classification, regression, clustering, recommendation, image recognition, etc. It is really not easy to find a suitable algorithm, so in practical applications, we generally use heuristic learning methods to experiment . Usually, at the beginning, we will choose algorithms that are generally recognized by everyone, such as SVM, GBDT, Adaboost. Now deep learning is very hot, and neural networks are also a good choice.

If you care about accuracy, the best way is to test each algorithm one by one through cross-validation, compare, then adjust the parameters to ensure that each algorithm achieves the optimal solution, and finally choose the best solution. one of. But if you are just looking for a "good enough" algorithm to solve your problem, or here are some tips for reference, let's analyze the advantages and disadvantages of each algorithm. Based on the advantages and disadvantages of the algorithm, it is easier for us to choose it.

Bias & Variance

In statistics, the quality of a model is measured by bias and variance, so let's popularize bias and variance first:

Bias: Describes the difference between the expected E' of the predicted value (estimated value) and the true value Y. The larger the deviation, the further away from the real data.

36 big data

Variance: Describes the variation range and dispersion of the predicted value P, which is the variance of the predicted value, that is, the distance from its expected value E. The larger the variance, the more spread out the distribution of the data. 36 Big Data (http://www.36dsj.com/)

36 big data

The true error of the model is the sum of the two, as shown below:

36 big data

If the training set is small, a high bias/low variance classifier (e.g. Naive Bayes NB) has a greater advantage than a low bias/high variance large classifier (e.g. KNN) because the latter will overfit. However, as your training set grows, the better the model predicts the original data, the lower the bias, and the low bias/high variance classifiers will gradually show their advantage (because they have lower gradients) close error), high-bias classifiers are no longer sufficient to provide an accurate model at this point.

Of course, you can also think of this as a difference between generative models (NB) and discriminative models (KNN).

Why is Naive Bayes high bias and low variance?

The following content is quoted from Zhihu: 36 Big Data (http://www.36dsj.com/)

First, assume you know the relationship between the training and test sets. Simply put, we need to learn a model on the training set, and then use it on the test set. Whether the effect is good or not depends on the error rate of the test set. But in many cases, we can only assume that the test set and the training set conform to the same data distribution, but we cannot get the real test data. At this time, how to measure the test error rate when only seeing the training error rate?

Since there are very few training samples (at least not enough), the model obtained from the training set is always not really correct. (Even if the correct rate is 100% on the training set, it does not mean that it depicts the real data distribution. We must know that it is our purpose to depict the real data distribution, not only the limited data points of the training set). Moreover, in practice, the training samples often have a certain noise error, so if you pursue the perfection in the training set and use a very complex model, the model will take the errors in the training set as the real data distribution characteristics. , resulting in a wrong estimate of the data distribution.

In this case, it will be a mess on the real test set (this phenomenon is called overfitting). But you can't use a too simple model, otherwise, when the data distribution is more complex, the model will not be enough to describe the data distribution (reflected that the error rate even on the training set is very high, which is less fitting). Overfitting means that the model used is more complex than the true data distribution, while underfitting means that the model used is simpler than the true data distribution.

Under the framework of statistical learning, when people describe the complexity of the model, there is such a view that Error = Bias + Variance. The Error here can be roughly understood as the prediction error rate of the model, which consists of two parts, one part is the inaccurate estimation (Bias) caused by the model being too simple, and the other part is caused by the model being too complex. Greater room for change and uncertainty.

So, this makes it easy to analyze Naive Bayes. It simply assumes that the data is unrelated and is a severely simplified model. Therefore, for such a simple model, the Bias part is larger than the Variance part in most cases, that is, high bias and low variance.

在实际中,为了让Error尽量小,我们在选择模型的时候需要平衡Bias和Variance所占的比例,也就是平衡over-fitting和under-fitting。

偏差和方差与模型复杂度的关系使用下图更加明了:

36 big data

当模型复杂度上升的时候,偏差会逐渐变小,而方差会逐渐变大。

常见算法优缺点

1.朴素贝叶斯36大数据(http://www.36dsj.com/)

朴素贝叶斯属于生成式模型(关于生成模型和判别式模型,主要还是在于是否是要求联合分布),非常简单,你只是做了一堆计数。如果注有条件独立性假设(一个比较严格的条件),朴素贝叶斯分类器的收敛速度将快于判别模型,如逻辑回归,所以你只需要较少的训练数据即可。即使NB条件独立假设不成立,NB分类器在实践中仍然表现的很出色。它的主要缺点是它不能学习特征间的相互作用,用mRMR中R来讲,就是特征冗余。引用一个比较经典的例子,比如,虽然你喜欢Brad Pitt和Tom Cruise的电影,但是它不能学习出你不喜欢他们在一起演的电影。

优点:36大数据(http://www.36dsj.com/)

  1. 朴素贝叶斯模型发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率。
  2. 对小规模的数据表现很好,能个处理多分类任务,适合增量式训练;
  3. 对缺失数据不太敏感,算法也比较简单,常用于文本分类。

缺点:36大数据(http://www.36dsj.com/)

  1. 需要计算先验概率;
  2. 分类决策存在错误率;
  3. 对输入数据的表达形式很敏感。

2.Logistic Regression(逻辑回归)

属于判别式模型,有很多正则化模型的方法(L0, L1,L2,etc),而且你不必像在用朴素贝叶斯那样担心你的特征是否相关。与决策树与SVM机相比,你还会得到一个不错的概率解释,你甚至可以轻松地利用新数据来更新模型(使用在线梯度下降算法,online gradient descent)。如果你需要一个概率架构(比如,简单地调节分类阈值,指明不确定性,或者是要获得置信区间),或者你希望以后将更多的训练数据快速整合到模型中去,那么使用它吧。

Sigmoid函数:

36 big data

优点:

  1. 实现简单,广泛的应用于工业问题上;
  2. 分类时计算量非常小,速度很快,存储资源低;
  3. 便利的观测样本概率分数;
  4. 对逻辑回归而言,多重共线性并不是问题,它可以结合L2正则化来解决该问题;

缺点:

  1. 当特征空间很大时,逻辑回归的性能不是很好;
  2. 容易欠拟合,一般准确度不太高
  3. 不能很好地处理大量多类特征或变量;
  4. 只能处理两分类问题(在此基础上衍生出来的softmax可以用于多分类),且必须线性可分;
  5. 对于非线性特征,需要进行转换;

3.线性回归

线性回归是用于回归的,而不像Logistic回归是用于分类,其基本思想是用梯度下降法对最小二乘法形式的误差函数进行优化,当然也可以用normal equation直接求得参数的解,结果为:

36 big data

而在LWLR(局部加权线性回归)中,参数的计算表达式为:

36 big data

由此可见LWLR与LR不同,LWLR是一个非参数模型,因为每次进行回归计算都要遍历训练样本至少一次。

优点: 实现简单,计算简单;36大数据(http://www.36dsj.com/)

缺点: 不能拟合非线性数据.36大数据(http://www.36dsj.com/)

4.最近邻算法——KNN

KNN即最近邻算法,其主要过程为:36大数据(http://www.36dsj.com/)

1. 计算训练样本和测试样本中每个样本点的距离(常见的距离度量有欧式距离,马氏距离等);

2. 对上面所有的距离值进行排序;

3. 选前k个最小距离的样本;

4. 根据这k个样本的标签进行投票,得到最后的分类类别;

如何选择一个最佳的K值,这取决于数据。一般情况下,在分类时较大的K值能够减小噪声的影响。但会使类别之间的界限变得模糊。一个较好的K值可通过各种启发式技术来获取,比如,交叉验证。另外噪声和非相关性特征向量的存在会使K近邻算法的准确性减小。

近邻算法具有较强的一致性结果。随着数据趋于无限,算法保证错误率不会超过贝叶斯算法错误率的两倍。对于一些好的K值,K近邻保证错误率不会超过贝叶斯理论误差率。

KNN算法的优点

  1. 理论成熟,思想简单,既可以用来做分类也可以用来做回归;
  2. 可用于非线性分类;
  3. 训练时间复杂度为O(n);
  4. 对数据没有假设,准确度高,对outlier不敏感;

缺点

  1. 计算量大;
  2. 样本不平衡问题(即有些类别的样本数量很多,而其它样本的数量很少);
  3. 需要大量的内存;

5.决策树

易于解释。它可以毫无压力地处理特征间的交互关系并且是非参数化的,因此你不必担心异常值或者数据是否线性可分(举个例子,决策树能轻松处理好类别A在某个特征维度x的末端,类别B在中间,然后类别A又出现在特征维度x前端的情况)。它的缺点之一就是不支持在线学习,于是在新样本到来后,决策树需要全部重建。

另一个缺点就是容易出现过拟合,但这也就是诸如随机森林RF(或提升树boosted tree)之类的集成方法的切入点。另外,随机森林经常是很多分类问题的赢家(通常比支持向量机好上那么一丁点),它训练快速并且可调,同时你无须担心要像支持向量机那样调一大堆参数,所以在以前都一直很受欢迎。

决策树中很重要的一点就是选择一个属性进行分枝,因此要注意一下信息增益的计算公式,并深入理解它。

信息熵的计算公式如下:36大数据(http://www.36dsj.com/)

36 big data

其中的n代表有n个分类类别(比如假设是2类问题,那么n=2)。分别计算这2类样本在总样本中出现的概率p1和p2,这样就可以计算出未选中属性分枝前的信息熵。

现在选中一个属性xixi用来进行分枝,此时分枝规则是:如果xi=vxi=v的话,将样本分到树的一个分支;如果不相等则进入另一个分支。很显然,分支中的样本很有可能包括2个类别,分别计算这2个分支的熵H1和H2,计算出分枝后的总信息熵H’ =p1 H1+p2 H2,则此时的信息增益ΔH = H – H’。以信息增益为原则,把所有的属性都测试一边,选择一个使增益最大的属性作为本次分枝属性。

决策树自身的优点

  1. 计算简单,易于理解,可解释性强;
  2. 比较适合处理有缺失属性的样本;
  3. 能够处理不相关的特征;
  4. 在相对短的时间内能够对大型数据源做出可行且效果良好的结果。

缺点

  1. 容易发生过拟合(随机森林可以很大程度上减少过拟合);
  2. 忽略了数据之间的相关性;
  3. 对于那些各类别样本数量不一致的数据,在决策树当中,信息增益的结果偏向于那些具有更多数值的特征(只要是使用了信息增益,都有这个缺点,如RF)。

5.1 Adaboosting

Adaboost是一种加和模型,每个模型都是基于上一次模型的错误率来建立的,过分关注分错的样本,而对正确分类的样本减少关注度,逐次迭代之后,可以得到一个相对较好的模型。是一种典型的boosting算法。下面是总结下它的优缺点。

优点36大数据(http://www.36dsj.com/)

  1. adaboost是一种有很高精度的分类器。
  2. 可以使用各种方法构建子分类器,Adaboost算法提供的是框架。
  3. 当使用简单分类器时,计算出的结果是可以理解的,并且弱分类器的构造极其简单。
  4. 简单,不用做特征筛选。
  5. 不容易发生overfitting。

关于随机森林和GBDT等组合算法,参考这篇文章:机器学习-组合算法总结

缺点:对outlier比较敏感

6.SVM支持向量机

高准确率,为避免过拟合提供了很好的理论保证,而且就算数据在原特征空间线性不可分,只要给个合适的核函数,它就能运行得很好。在动辄超高维的文本分类问题中特别受欢迎。可惜内存消耗大,难以解释,运行和调参也有些烦人,而随机森林却刚好避开了这些缺点,比较实用。

优点

  1. 可以解决高维问题,即大型特征空间;
  2. 能够处理非线性特征的相互作用;
  3. 无需依赖整个数据;
  4. 可以提高泛化能力;

缺点

  1. 当观测样本很多时,效率并不是很高;
  2. 对非线性问题没有通用解决方案,有时候很难找到一个合适的核函数;
  3. 对缺失数据敏感;

对于核的选择也是有技巧的(libsvm中自带了四种核函数:线性核、多项式核、RBF以及sigmoid核):

第一,如果样本数量小于特征数,那么就没必要选择非线性核,简单的使用线性核就可以了;

第二,如果样本数量大于特征数目,这时可以使用非线性核,将样本映射到更高维度,一般可以得到更好的结果;

第三,如果样本数目和特征数目相等,该情况可以使用非线性核,原理和第二种一样。

对于第一种情况,也可以先对数据进行降维,然后使用非线性核,这也是一种方法。

7. 人工神经网络的优缺点

人工神经网络的优点:36大数据(http://www.36dsj.com/)

  1. 分类的准确度高;
  2. 并行分布处理能力强,分布存储及学习能力强,
  3. 对噪声神经有较强的鲁棒性和容错能力,能充分逼近复杂的非线性关系;
  4. 具备联想记忆的功能。

人工神经网络的缺点:

  1. 神经网络需要大量的参数,如网络拓扑结构、权值和阈值的初始值;
  2. 不能观察之间的学习过程,输出结果难以解释,会影响到结果的可信度和可接受程度;
  3. 学习时间过长,甚至可能达不到学习的目的。

8、K-Means聚类

之前写过一篇关于K-Means聚类的文章,博文链接:机器学习算法-K-means聚类。关于K-Means的推导,里面有着很强大的EM思想。

优点

  1. 算法简单,容易实现 ;
  2. For processing large datasets, the algorithm is relatively scalable and efficient because its complexity is approximately O(nkt), where n is the number of all objects, k is the number of clusters, and t is the number of iterations. Usually k<<n. This algorithm usually converges locally. < p=””>
  3. The algorithm tries to find the k partitions that minimize the value of the squared error function. The clustering effect is better when the clusters are dense, spherical or clump-like, and the difference between clusters is obvious.

shortcoming

  1. High requirements for data types, suitable for numerical data;
  2. May converge to local minima, slower on large scale data
  3. The value of K is more difficult to choose;
  4. Sensitive to the cluster center value of the initial value, for different initial values, it may lead to different clustering results;
  5. Not suitable for finding clusters that are not convex in shape, or that vary widely in size.

Sensitive to "noise" and outlier data, a small amount of such data can have a dramatic effect on the average.

Algorithm Selection Reference

I have translated some foreign articles before, and an article gives a simple algorithm selection technique:

The first one that should be selected is logistic regression. If its effect is not very good, then its results can be used as a reference to compare with other algorithms on the basis;

Then try decision trees (random forests) to see if you can drastically improve your model performance. Even if you don't use it as the final model in the end, you can use random forest to remove noise variables and do feature selection;

If the number of features and observation samples are particularly large, then when resources and time are sufficient (this premise is very important), using SVM may be an option.

Usually: [GBDT>=SVM>=RF>=Adaboost>=Other…], deep learning is very popular now, it is used in many fields, it is based on neural network, and I am currently learning it myself, just The theoretical knowledge is not very thick, and the understanding is not deep enough, so I will not introduce it here.

Algorithms are important, but good data is better than good algorithms, and designing good features is of great benefit. If you have a very large dataset, no matter which algorithm you use probably won't have much impact on the classification performance (in this case you can decide based on speed and ease of use).

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326989404&siteId=291194637