Experience and common algorithm machine learning algorithm formula

 

This is part of my jiao particularly good, so I excerpt down

In fact in the interview process, and understand basic ideas about the process of these algorithms is not enough, interviewers often ask those topics are some of the company's internal business often requires you not only to understand the theoretical course of these algorithms, but also very familiar with how to use it, what the occasion to use it, the advantages and disadvantages of the algorithm, as well as the assistant experience and so on. To put it plainly, the point is not only to be the theory, but also will point applications, it is necessary to a little depth, but also a little breadth, or bad luck can easily be washed off, because each interviewer hobby.

 

Intelligent decision-making is an important performance, similar to the database concept of "triggers", its meaning can be expressed as: what do we do under what circumstances? ML field "classification" concept similar thereto, intelligent decision is the result of the decision from the decision-making criteria to the departure, with a similar classification. Classification can be expressed as if () then () statement, the image is represented as a logical combination of BOOL, if (Feature> X) Then (ClassLabel = 1), decision-making, and triggers can say so. Using pattern recognition data, machine learning, machine classification function to obtain data directly from the rules get skipped this step, the rule implicitly manifested by way of the model.

 

Scientific laws are chasing the target, discovery rules are simplified representations of complex methods, like the complicated semantics by relatively few constraints grammar rules and master the rules of formal semantics - syntax, you can use a small amount of storage to deal with complex scene, machine learning is the explicit or implicit looking for complex scenes hidden rules of procedure, manifested by way of the model.

 Expert System is the first goal of completeness, that the rules of the system is no contradiction. The first goal of machine learning model is generalization performance, complete decision-making model can classify more space samples.

Rule-based expert system is a hard rule system, the rules of the performance of knowledge in the system, the expert system may represent the ultimate expression in the form of machine learning model. Despite the promotion of the rule can not be increased without limit, there are always new goals can not deal with, which is the source of knowledge acquisition - Data restricted decision, does not mean the rules are wrong, can be subdivided by the rules - Decision tree split to complete.

       When the rules of the expert system face a new target failure, feedback can be reconstructed by the rules, completeness complete rules of the system, corresponding to the performance of "online machine learning" in the field of machine learning, which is mapped to the same state "online decision tree" algorithm.

 

  Extension of the algorithm is one of machine learning algorithm performance metrics, machine learning algorithms to follow the sample -> instance to determine cause and effect associated with the new scene. Depending on the type and model of the intermediate steps are known, divided into various. "Scalability", "generalization" in homomorphism mathematical system as a function of range extension ductile range,

 

   If the sample -> Models -> new instance of the scene to determine the model as a black box, it may be represented by representatives of the neural network, the neural network model using the parameters described, even without knowing the meaning of the parameters can also get good results, this It describes the use of the structure instead of "decision" this intelligence is worth pursuing it?

       If the sample -> Models -> instance to determine the relationship between the new scene model a priori probability and posterior probability of this method is naive Bayes, naive Bayes rule and use the results to estimate the reason, the possibility of using digital " "to describe the rules, there must be a certain error rate, which is the other model can achieve all of the limits of accuracy.   

       If the sample -> Model -> new scene model instance is determined bool additional logic rule, such a method of decision tree. The use of decision trees "And" operation, the feature prioritize the establishment of a hierarchical tree structure, using the "And" operation is completed from the root to the leaves of recurrence, complete the decision-making process. Decision Tree bound to over-fitting phenomenon, its generalization performance by the distribution of samples of topological constraints obviously, because of its split characteristics, decision-making space is divided into blocks, classification of surface between its different parent leaves irrelevant.

  If the sample -> Model -> new scene instance is determined model is in continuous function, it is necessary fitting, linear regression method.

 

 

A Naive Bayes:

  There are several things to note:

  1. If a given feature vector length may be different, it is necessary through a normalized vector length (here, an example in text categorization) into, for example, a sentence of words, then the length of the entire length of the vocabulary, the corresponding position is the number of times the word appears.

  2. The formula is as follows:

   

  One of conditional probability can be expanded by an independent Naive Bayes conditions. To note that that method of calculation, the assumed premise Naive Bayes seen,

 = ,

Thus there are two, one is in the category of those sample concentration ci, wj to find the sum of the number of occurrences, and dividing the sum of the sample; second method is to focus on those samples ci category, finding the number of occurrence wj sum is then divided by the sum of the number of occurrences of all features in the sample.

  3. If  an item is 0, then the joint probability may also be in the product is 0, i.e. a molecule of formula 2 is 0 in order to avoid this phenomenon, which under normal circumstances will be initialized to a 1, of course, in order to ensure equal probability, the denominator should correspond initialized to 2 (here because it is the class 2, it is plus 2, if a class k you need to add k, called the smooth laplace terminology, because the denominator plus k is to satisfy total probability formula ).

  Naive Bayes advantages:

  Data on the performance of small-scale well for multi-classification task for incremental training.

  Disadvantages :

  Expression is very sensitive to the form of the input data.

       postscript:

  

 

 

Second Decision Tree:

  A very important point is to choose a decision tree branch attributes, so pay attention to what information gain is calculated, and in-depth understanding of it.

  Entropy is calculated as follows:

   

  Where n represents the n-th classification categories (such as type 2 problem is assumed, then n = 2). Calculate probabilities p1 and p2 that category 2 sample appearing in the total sample, so that we can calculate the entropy not selected before branching properties.

  Xi now select a property used for branches, branch rule at this time is: if xi = vx, then the sample is assigned to a branch of the tree; if not equal the process proceeds to the other branch. Obviously, the sample is likely branch comprises two categories, the two branches are calculated entropy H1 and H2, the total entropy H is calculated after the branch '= p1 * H1 + p2 * H2., Then this gain information ΔH = H-H '. In order to gain information about the principle, all the attributes test side, select a gain the greatest attribute as an attribute of this branch.

  Decision tree advantages:

  Calculation simple, strong explanatory, processes the samples for comparison with missing attribute values, capable of processing not related characteristics;

  Disadvantages:

  Easy overfitting (subsequent occurrence of a random forest, reducing the over-fitting);

 

 

Three .Logistic return:

  Logistic is used for sorting, a linear classifier, caveats are:

  1. logistic function expression is:

   

  Its derivative form:

   

  2. logsitc regression method is mainly used maximum likelihood estimation to learn, so the posterior probability of a single sample is:

   

  The posterior probability of the entire sample:

   

  among them:

   

  Through several further simplified as:

   

  3. In fact, it is the loss function -l (θ), so we need to make the smallest loss function, may be used to obtain gradient descent method. Gradient descent equation is:

   

  

  Logistic regression advantages:

  1, simple;

  2, classification calculation is very small, very fast, low storage resource;

  Disadvantages:

  1, easy underfitting general accuracy is not too high

  2, only two processing classification (based on this derived softmax multiple classification may be used), and must be linearly separable;

 

 

Four linear regression:

  Linear regression for real regression, logistic regression was used rather than the classification, the basic idea is to use gradient descent method of least squares in the form of the error function for optimization, of course, also be obtained directly by normal equation parameters solution, the result is:

   

  In LWLR (locally weighted linear regression), the calculation expression parameters are:

   

  Because at this time optimization is:

   

  Thus LWLR and LR different, LWLR is a non-parametric model, because each regression calculation must traverse the training sample at least once.

  Linear regression advantages:

  Simple, simple calculation;

  Disadvantages:

  Not fit non-linear data;

 

 

Five .KNN algorithm:

  That KNN Nearest Neighbor algorithm, the main process is:

  1. Calculate the training and test samples from each sample point (common distance metric with a Euclidean distance, Mahalanobis distance, etc.);

  2. All of the above sort distance values;

  3. Sample selected first k smallest distance;

  4. The label vote of the k samples, to obtain a final classification category;

  如何选择一个最佳的K值,这取决于数据。一般情况下,在分类时较大的K值能够减小噪声的影响。但会使类别之间的界限变得模糊。一个较好的K值可通过各种启发式技术来获取,比如,交叉验证。另外噪声和非相关性特征向量的存在会使K近邻算法的准确性减小。

  近邻算法具有较强的一致性结果。随着数据趋于无限,算法保证错误率不会超过贝叶斯算法错误率的两倍。对于一些好的K值,K近邻保证错误率不会超过贝叶斯理论误差率。

  注:马氏距离一定要先给出样本集的统计性质,比如均值向量,协方差矩阵等。关于马氏距离的介绍如下:

   

  KNN算法的优点:

  1. 思想简单,理论成熟,既可以用来做分类也可以用来做回归;

  2. 可用于非线性分类;

  3. 训练时间复杂度为O(n);

  4. 准确度高,对数据没有假设,对outlier不敏感;

  缺点:

  1. 计算量大;

  2. 样本不平衡问题(即有些类别的样本数量很多,而其它样本的数量很少);

  3. 需要大量的内存;

 

 

六.SVM:

  要学会如何使用libsvm以及一些参数的调节经验,另外需要理清楚svm算法的一些思路:

  1. svm中的最优分类面是对所有样本的几何裕量最大(为什么要选择最大间隔分类器,请从数学角度上说明?网易深度学习岗位面试过程中有被问到。答案就是几何间隔与样本的误分次数间存在关系: ,其中的分母就是样本到分类间隔距离,分子中的R是所有样本中的最长向量值),即:

   

  经过一系列推导可得为优化下面原始目标:

  

  2. 下面来看看拉格朗日理论:

  

  可以将1中的优化目标转换为拉格朗日的形式(通过各种对偶优化,KKD条件),最后目标函数为:

   

  我们只需要最小化上述目标函数,其中的α为原始优化问题中的不等式约束拉格朗日系数。

  3. 对2中最后的式子分别w和b求导可得:

  

   

  由上面第1式子可以知道,如果我们优化出了α,则直接可以求出w了,即模型的参数搞定。而上面第2个式子可以作为后续优化的一个约束条件。

  4. 对2中最后一个目标函数用对偶优化理论可以转换为优化下面的目标函数:

  

  而这个函数可以用常用的优化方法求得α,进而求得w和b。

  5. 按照道理,svm简单理论应该到此结束。不过还是要补充一点,即在预测时有:

   

  那个尖括号我们可以用核函数代替,这也是svm经常和核函数扯在一起的原因。

  6. 最后是关于松弛变量的引入,因此原始的目标优化公式为:

   

  此时对应的对偶优化公式为:

   

  与前面的相比只是α多了个上界。

  SVM算法优点:

  可用于线性/非线性分类,也可以用于回归;

  低泛化误差;

  容易解释;

  计算复杂度较低;

  缺点:

  对参数和核函数的选择比较敏感;

  原始的SVM只比较擅长处理二分类问题;

 

 

七.Boosting:

  主要以Adaboost为例,首先来看看Adaboost的流程图,如下:

   

  从图中可以看到,在训练过程中我们需要训练出多个弱分类器(图中为3个),每个弱分类器是由不同权重的样本(图中为5个训练样本)训练得到(其中第一个弱分类器对应输入样本的权值是一样的),而每个弱分类器对最终分类结果的作用也不同,是通过加权平均输出的,权值见上图中三角形里面的数值。那么这些弱分类器和其对应的权值是怎样训练出来的呢?

  下面通过一个例子来简单说明。

  书中(machinelearning in action)假设的是5个训练样本,每个训练样本的维度为2,在训练第一个分类器时5个样本的权重各为0.2.注意这里样本的权值和最终训练的弱分类器组对应的权值α是不同的,样本的权重只在训练过程中用到,而α在训练过程和测试过程都有用到。

  现在假设弱分类器是带一个节点的简单决策树,该决策树会选择2个属性(假设只有2个属性)的一个,然后计算出这个属性中的最佳值用来分类。

  Adaboost的简单版本训练过程如下:

  1. 训练第一个分类器,样本的权值D为相同的均值。通过一个弱分类器,得到这5个样本(请对应书中的例子来看,依旧是machine learning in action)的分类预测标签。与给出的样本真实标签对比,就可能出现误差(即错误)。如果某个样本预测错误,则它对应的错误值为该样本的权重,如果分类正确,则错误值为0. 最后累加5个样本的错误率之和,记为ε。

  2. 通过ε来计算该弱分类器的权重α,公式如下:

   

  3. 通过α来计算训练下一个弱分类器样本的权重D,如果对应样本分类正确,则减小该样本的权重,公式为:

   

  如果样本分类错误,则增加该样本的权重,公式为:

   

  4. 循环步骤1,2,3来继续训练多个分类器,只是其D值不同而已。

  测试过程如下:

  输入一个样本到训练好的每个弱分类中,则每个弱分类都对应一个输出标签,然后该标签乘以对应的α,最后求和得到值的符号即为预测标签值。

  Boosting算法的优点:

  低泛化误差;

  容易实现,分类准确率较高,没有太多参数可以调;

  缺点:

  对outlier比较敏感;

 

 

八.聚类:

  根据聚类思想划分:

  1. 基于划分的聚类:

  K-means, k-medoids(每一个类别中找一个样本点来代表),CLARANS.

  k-means是使下面的表达式值最小:

   

   k-means算法的优点:

  (1)k-means算法是解决聚类问题的一种经典算法,算法简单、快速。

  (2)对处理大数据集,该算法是相对可伸缩的和高效率的,因为它的复杂度大约是O(nkt),其中n是所有对象的数目,k是簇的数目,t是迭代的次数。通常k<<n。这个算法通常局部收敛。

  (3)算法尝试找出使平方误差函数值最小的k个划分。当簇是密集的、球状或团状的,且簇与簇之间区别明显时,聚类效果较好。

   缺点:

  (1)k-平均方法只有在簇的平均值被定义的情况下才能使用,且对有些分类属性的数据不适合。

  (2)要求用户必须事先给出要生成的簇的数目k。

  (3)对初值敏感,对于不同的初始值,可能会导致不同的聚类结果。

  (4)不适合于发现非凸面形状的簇,或者大小差别很大的簇。

  (5)对于"噪声"和孤立点数据敏感,少量的该类数据能够对平均值产生极大影响。

  2. 基于层次的聚类:

  自底向上的凝聚方法,比如AGNES。

  自上向下的分裂方法,比如DIANA。

  3. 基于密度的聚类:

  DBSACN,OPTICS,BIRCH(CF-Tree),CURE.

  4. 基于网格的方法:

  STING, WaveCluster.

  5. 基于模型的聚类:

  EM,SOM,COBWEB.

  以上这些算法的简介可参考聚类(百度百科)。

 

 九.推荐系统:

  推荐系统的实现主要分为两个方面:基于内容的实现和协同滤波的实现。

  基于内容的实现:

  不同人对不同电影的评分这个例子,可以看做是一个普通的回归问题,因此每部电影都需要提前提取出一个特征向量(即x值),然后针对每个用户建模,即每个用户打的分值作为y值,利用这些已有的分值y和电影特征值x就可以训练回归模型了(最常见的就是线性回归)。这样就可以预测那些用户没有评分的电影的分数。(值得注意的是需对每个用户都建立他自己的回归模型)

  从另一个角度来看,也可以是先给定每个用户对某种电影的喜好程度(即权值),然后学出每部电影的特征,最后采用回归来预测那些没有被评分的电影。

  当然还可以是同时优化得到每个用户对不同类型电影的热爱程度以及每部电影的特征。具体可以参考Ng在coursera上的ml教程:https://www.coursera.org/course/ml

  基于协同滤波的实现:

  协同滤波(CF)可以看做是一个分类问题,也可以看做是矩阵分解问题。协同滤波主要是基于每个人自己的喜好都类似这一特征,它不依赖于个人的基本信息。比如刚刚那个电影评分的例子中,预测那些没有被评分的电影的分数只依赖于已经打分的那些分数,并不需要去学习那些电影的特征。

  SVD将矩阵分解为三个矩阵的乘积,公式如下所示:

   

  中间的矩阵sigma为对角矩阵,对角元素的值为Data矩阵的奇异值(注意奇异值和特征值是不同的),且已经从大到小排列好了。即使去掉特征值小的那些特征,依然可以很好的重构出原始矩阵。如下图所示:

  

  其中更深的颜色代表去掉小特征值重构时的三个矩阵。

  果m代表商品的个数,n代表用户的个数,则U矩阵的每一行代表商品的属性,现在通过降维U矩阵(取深色部分)后,每一个商品的属性可以用更低的维度表示(假设为k维)。这样当新来一个用户的商品推荐向量X,则可以根据公式X'*U1*inv(S1)得到一个k维的向量,然后在V’中寻找最相似的那一个用户(相似度测量可用余弦公式等),根据这个用户的评分来推荐(主要是推荐新用户未打分的那些商品)。具体例子可以参考网页:SVD在推荐系统中的应用

  另外关于SVD分解后每个矩阵的实际含义可以参考google吴军的《数学之美》一书(不过个人感觉吴军解释UV两个矩阵时好像弄反了,不知道大家怎样认为)。或者参考machinelearning in action其中的svd章节。

 

 

十.pLSA:

  pLSA由LSA发展过来,而早期LSA的实现主要是通过SVD分解。pLSA的模型图如下:

   

  公式中的意义如下:

   

  具体可以参考2010龙星计划:机器学习中对应的主题模型那一讲

 

 

十一、LDA:

  主题模型,概率图如下:

   

  和pLSA不同的是LDA中假设了很多先验分布,且一般参数的先验分布都假设为Dirichlet分布,其原因是共轭分布时先验概率和后验概率的形式相同。

 

 

  GDBT:

  GBDT(Gradient Boosting Decision Tree) 又叫 MART(Multiple Additive Regression Tree),好像在阿里内部用得比较多(所以阿里算法岗位面试时可能会问到),它是一种迭代的决策树算法,该算法由多棵决策树组成,所有树的输出结果累加起来就是最终答案。它在被提出之初就和SVM一起被认为是泛化能力(generalization)较强的算法。近些年更因为被用于搜索排序的机器学习模型而引起大家关注。

  GBDT是回归树,不是分类树。其核心就在于,每一棵树是从之前所有树的残差中来学习的。为了防止过拟合,和Adaboosting一样,也加入了boosting这一项。

  关于GDBT的介绍可以可以参考:GBDT(MART) 迭代决策树入门教程 | 简介

 

 

  Regularization:

  作用是(网易电话面试时有问到):

  1. 数值上更容易求解;

  2. 特征数目太大时更稳定;

  3. 控制模型的复杂度,光滑性。复杂性越小且越光滑的目标函数泛化能力越强。而加入规则项能使目标函数复杂度减小,且更光滑。

  4. 减小参数空间;参数空间越小,复杂度越低。

  5. 系数越小,模型越简单,而模型越简单则泛化能力越强(Ng宏观上给出的解释)。

  6. 可以看出是权值的高斯先验。

 

 

  异常检测:

  可以估计样本的密度函数,对于新样本直接计算其密度,如果密度值小于某一阈值,则表示该样本异常。而密度函数一般采用多维的高斯分布。如果样本有n维,则每一维的特征都可以看作是符合高斯分布的,即使这些特征可视化出来不太符合高斯分布,也可以对该特征进行数学转换让其看起来像高斯分布,比如说x=log(x+c),x=x^(1/c)等。异常检测的算法流程如下:

   

   其中的ε也是通过交叉验证得到的,也就是说在进行异常检测时,前面的p(x)的学习是用的无监督,后面的参数ε学习是用的有监督。那么为什么不全部使用普通有监督的方法来学习呢(即把它看做是一个普通的二分类问题)?主要是因为在异常检测中,异常的样本数量非常少而正常样本数量非常多,因此不足以学习到好的异常行为模型的参数,因为后面新来的异常样本可能完全是与训练样本中的模式不同。

  另外,上面是将特征的每一维看成是相互独立的高斯分布,其实这样的近似并不是最好的,但是它的计算量较小,因此也常被使用。更好的方法应该是将特征拟合成多维高斯分布,这时有特征之间的相关性,但随之计算量会变复杂,且样本的协方差矩阵还可能出现不可逆的情况(主要在样本数比特征数小,或者样本特征维数之间有线性关系时)。

  上面的内容可以参考Ng的https://www.coursera.org/course/ml

 

 

  EM算法:

  有时候因为样本的产生和隐含变量有关(隐含变量是不能观察的),而求模型的参数时一般采用最大似然估计,由于含有了隐含变量,所以对似然函数参数求导是求不出来的,这时可以采用EM算法来求模型的参数的(对应模型参数个数可能有多个),EM算法一般分为2步:

  E步:选取一组参数,求出在该参数下隐含变量的条件概率值;

  M步:结合E步求出的隐含变量条件概率,求出似然函数下界函数(本质上是某个期望函数)的最大值。

  重复上面2步直至收敛。

  公式如下所示:

   

  M步公式中下界函数的推导过程:

   

  EM算法一个常见的例子就是GMM模型,每个样本都有可能由k个高斯产生,只不过由每个高斯产生的概率不同而已,因此每个样本都有对应的高斯分布(k个中的某一个),此时的隐含变量就是每个样本对应的某个高斯分布。

  GMM的E步公式如下(计算每个样本对应每个高斯的概率):

   

  更具体的计算公式为:

  

  M步公式如下(计算每个高斯的比重,均值,方差这3个参数):

   

  关于EM算法可以参考Ng的cs229课程资料 或者网易公开课:斯坦福大学公开课 :机器学习课程

 

 

  Apriori:

  Apriori是关联分析中比较早的一种方法,主要用来挖掘那些频繁项集合。其思想是:

  1. 如果一个项目集合不是频繁集合,那么任何包含它的项目集合也一定不是频繁集合;

  2. 如果一个项目集合是频繁集合,那么它的任何非空子集也是频繁集合;

  Aprioir需要扫描项目表多遍,从一个项目开始扫描,舍去掉那些不是频繁的项目,得到的集合称为L,然后对L中的每个元素进行自组合,生成比上次扫描多一个项目的集合,该集合称为C,接着又扫描去掉那些非频繁的项目,重复…

  看下面这个例子:

  元素项目表格:

   

  如果每个步骤不去掉非频繁项目集,则其扫描过程的树形结构如下:

   

  在其中某个过程中,可能出现非频繁的项目集,将其去掉(用阴影表示)为:

   

  上面的内容主要参考的是machinelearning in action这本书。

 

 

  FP Growth:

  FPGrowth是一种比Apriori更高效的频繁项挖掘方法,它只需要扫描项目表2次。其中第1次扫描获得当个项目的频率,去掉不符合支持度要求的项,并对剩下的项排序。第2遍扫描是建立一颗FP-Tree(frequent-pattentree)。

  接下来的工作就是在FP-Tree上进行挖掘。

  比如说有下表:

   

  它所对应的FP_Tree如下:

   

  Then the frequency of the smallest item P starts to find out the condition pattern P group, with the proviso pattern FP_tree configured the same way to construct the P group FP_tree, this tree to find frequent itemsets containing P.

  Sequentially from m, b, a, c, digging the condition pattern group f frequent itemsets, some items recursively to dig more trouble, such as m nodes, a specific process may refer to blog: Frequent two (FP Pattern Mining of Growth algorithm) , which is very detailed.

 

 

  References:

  Harrington, P. (2012). Machine Learningin Action, Manning Publications Co.

      Nearest Neighbor algorithm (Wikipedia)

      Mahalanobis distance (Wikipedia)

  Clustering (Baidu Encyclopedia)

      https://www.coursera.org/course/ml

      SVD in the recommended application system

      Wu and Google (2012). Mathematical beauty, Posts & Telecom Press.

      2010 Dragon Star: machine learning corresponding video tutorials: 2010 Dragon Star machine learning video tutorials

      GBDT (MART) introductory tutorial iteration tree | Profile

      Ng's cs229 course materials

      Stanford University public courses: Machine Learning Course

      Frequent Pattern mining bis (FP Growth algorithm)

Guess you like

Origin www.cnblogs.com/klsfct/p/10961784.html