Machine learning algorithm knowledge points finishing

1 Generative model and discriminative model

Given the input variable x, the generative model calculates the joint probability distribution P (x, y) for the observations and labeled data to achieve the purpose of determining the estimated y. The discriminant model predicts y by solving the conditional probability distribution P (y | x) or directly calculating the value of y.

Common discriminant models are Linear Regression, Logistic Regression, Support Vector Machine (SVM), Traditional Neural Networks (Traditional Neural Networks), Linear Discriminative Analysis, Conditional Random Field (Conditional Random) Field); common generative models are Naive Bayes (Naive Bayes), Hidden Markov Model (HMM), Bayesian Networks (Bayesian Networks) and Latent Dirichlet Allocation (Latent Dirichlet Allocation).

2 Basic methods of Chinese word segmentation

The basic methods of Chinese word segmentation can be divided into methods based on grammatical rules, methods based on dictionaries, and methods based on statistics.

The basic idea of ​​word segmentation based on grammatical rules is to perform syntactic and semantic analysis at the same time of word segmentation, and use syntactic information and semantic information to perform part-of-speech tagging to solve the phenomenon of word segmentation ambiguity. Because the existing grammatical knowledge and syntax rules are very general and complex, the accuracy achieved by grammar and rule-based word segmentation is far from satisfactory. At present, such word segmentation systems are rarely used.

In the dictionary-based method, it can be further divided into a maximum matching method, a maximum probability method, a shortest path method, and so on. The maximum matching method refers to selecting a number of words in a character string as a word in a certain order and searching in a dictionary. According to the scanning method, it can be subdivided into: forward maximum matching, reverse maximum matching, bidirectional maximum matching, and minimum segmentation. The maximum probability method refers to that a Chinese character string to be segmented may contain multiple word segmentation results, and the one with the highest probability is taken as the word segmentation result of the word string. The shortest path method refers to selecting a path with the fewest words on the word graph.

The basic principle of statistics-based word segmentation is to determine whether a string constitutes a word based on the statistical frequency of the string in the corpus. A word is a combination of words. The more occurrences of adjacent words, the more likely it is to form a word. Therefore, the frequency or probability of co-occurrence of words adjacent to words can better reflect their credibility as words. Commonly used methods are HMM (Hidden Markov Model), MAXENT (Maximum Entropy Model), MEMM (Maximum Entropy Hidden Markov Model), CRF (Conditional Random Field).


3 Comparative analysis of CRF model, HMM model and MEMM model

Reference https://www.cnblogs.com/hellochennan/p/6624509.html


4 Viterbi algorithm


5ID3 algorithm

 
 

The core idea of ​​the ID3 algorithm is to select information gain measurement attributes, and select the attribute with the largest information gain after splitting to split. The limitation of the ID3 algorithm is that its attributes can only take discrete values. In order to make the decision tree applicable to continuous attribute values, an extended algorithm C4.5 of ID3 can be used. The BC options are all characteristics of the ID3 algorithm. The decision tree generated by the ID3 algorithm is a multi-fork tree, and the number of branches depends on how many different values ​​the split attribute has.

6 Overfitting problems

The main reasons for overfitting in machine learning are: (1) using overly complex models; (2) the data is noisy; and (3) there is little training data. 
The corresponding methods to reduce overfitting are: (1) simplify model assumptions, or use penalty terms to limit model complexity; (2) perform data cleaning to reduce noise; and (3) collect more training data.


7 Calculate the conditional entropy H (Y | X)

There are two calculation formulas for conditional entropy


Choose the appropriate formula to calculate according to the probability given by the title


8Fisher linear discriminant function

http://blog.csdn.net/yujianmin1990/article/details/48007589

Fisher's linear discriminant function is to project feature vectors in multi-dimensional space onto a straight line, that is, compress the dimension to one dimension. The criterion for finding this optimal straight line is the Fisher criterion: the projection of the two types of samples in one-dimensional space should be as dense as possible within the class, as far as possible between the classes, that is, the difference between the mean of the two types of samples after projection should be as large as possible The variance is as small as possible. Generally speaking, for the case where the data distribution is similar to the Gaussian distribution, Fisher's linear criterion can get a good classification effect.

9HMM parameter estimation method

EM algorithm: Only the observation sequence, stateless sequence to learn the model parameters, that is, Baum-Welch algorithm 
Viterbi algorithm: Use dynamic programming to solve the HMM prediction problem, not parameter estimation 
Forward and backward algorithm: used to calculate the probability of 
maximum likelihood Random estimation: the supervised learning algorithm when the observation sequence and the corresponding state sequence both exist to estimate the parameters


10 Understanding of Naive Bayes Classifier
The condition of Naive Bayes is that each variable is independent of each other. In the Bayesian theory system, there is an important conditional independence assumption: it is assumed that all features are independent of each other, so that the joint probability can be split. 
In addition, if highly correlated features are introduced twice in the model, which increases the importance of this feature, its performance is reduced because the data contains highly correlated features. The correct approach is to evaluate the correlation matrix of features and remove those highly correlated features.

11 Mahalanobis distance application

The Mahalanobis distance is based on the chi-square distribution, which is a statistical method for measuring multiple outlier outliers.    

If the covariance matrix is ​​an identity matrix (each sample vector is independently and identically distributed), then it is the Euclidean distance.  
If the covariance matrix is ​​diagonal, the formula becomes standardized Euclidean distance.

(2) The advantages and disadvantages of the Mahalanobis distance: the dimension is irrelevant, and the interference of the correlation between variables is excluded.

12 The difference between "bootstrap" and "boosting"

13 Understanding of the problems of overfit / high variance and underfit / high bias

Overfitting is because the trained model is too complicated. The error on the training set is small but the generalization ability is weak. The general solutions are:

收集更多的训练数据;简化特征;增加正则化项的系数lambda

欠拟合是模型没有充分学到数据中的信息,在训练集和测试集上的误差都很大,一般的解决办法有:

增加特征;增加多项式特征;减小正则化项的系数。

14对svm常用的几种核函数的理解

SVM核函数包括线性核函数、多项式核函数、径向基核函数、高斯核函数、幂指数核函数、拉普拉斯核函数、ANOVA核函数、二次有理核函数、多元二次核函数、逆多元二次核函数以及Sigmoid核函数. 核函数的定义并不困难,根据泛函的有关理论,只要一种函数 K ( x i , x j ) 满足Mercer条件,它就对应某一变换空间的内积.对于判断哪些函数是核函数到目前为止也取得了重要的突破,得到Mercer定理和以下常用的核函数类型: (1)线性核函数  K ( x , x i ) = x ⋅ x i (2)多项式核  K ( x , x i ) = ( ( x ⋅ x i ) + 1 ) d (3)径向基核(RBF)  K ( x , x i ) = exp ( − ∥ x − x i ∥ 2 σ 2 )  Gauss径向基函数则是局部性强的核函数,其外推能力随着参数 σ 的增大而减弱。多项式形式的核函数具有良好的全局性质。局部性较差。 (4)傅里叶核  K ( x , x i ) = 1 − q 2 2 ( 1 − 2 q cos ( x − x i ) + q 2 ) (5)样条核  K ( x , x i ) = B 2 n + 1 ( x − x i ) (6)Sigmoid核函数  K ( x , x i ) = tanh ( κ ( x , x i ) − δ ) 采用Sigmoid函数作为核函数时,支持向量机实现的就是一种多层感知器神经网络,应用SVM方法,隐含层节点数目(它确定神经网络的结构)、隐含层节点对输入节点的权值都是在设计(训练)的过程中自动确定的。而且支持向量机的理论基础决定了它最终求得的是全局最优值而不是局部最小值,也保证了它对于未知样本的良好泛化能力而不会出现过学习现象。 核函数的选择 在选取核函数解决实际问题时,通常采用的方法有: 一是利用专家的先验知识预先选定核函数; 二是采用Cross-Validation方法,即在进行核函数选取时,分别试用不同的核函数,归纳误差最小的核函数就是最好的核函数.如针对傅立叶核、RBF核,结合信号处理问题中的函数回归问题,通过仿真实验,对比分析了在相同数据条件下,采用傅立叶核的SVM要比采用RBF核的SVM误差小很多. 三是采用由Smits等人提出的混合核函数方法,该方法较之前两者是目前选取核函数的主流方法,也是关于如何构造核函数的又一开创性的工作.将不同的核函数结合起来后会有更好的特性,这是混合核函数方法的基本思想.

15KNN算法的适用场景:

样本较少但典型性好

16对随机森林参数的理解

增加树的深度可能导致过拟合;增加树的数目可能导致欠拟合。

17对时间序列模型的理解

AR模型是一种线性预测,即已知N个数据,可由模型推出第N点前面或后面的数据(设推出P点),所以其本质类似于插值。
MA模型(moving average model)滑动平均模型,其中使用趋势移动平均法建立直线趋势的预测模型。
ARMA模型(auto regressive moving average model)自回归滑动平均模型,模型参量法高分辨率谱分析方法之一。这种方法是研究平稳随机过程有理谱的典型方法。它比AR模型法与MA模型法有较精确的谱估计及较优良的谱分辨率性能,但其参数估算比较繁琐。
GARCH模型称为广义ARCH模型,是ARCH模型的拓展,由Bollerslev(1986)发展起来的。它是ARCH模型的推广。GARCH(p,0)模型,相当于ARCH(p)模型。GARCH模型是一个专门针对金融数据所量体订做的回归模型,除去和普通回归模型相同的之处,GARCH对误差的方差进行了进一步的建模。特别适用于波动性的分析和预测,这样的分析对投资者的决策能起到非常重要的指导性作用,其意义很多时候超过了对数值本身的分析和预测。
本题题目及解析来源:@刘炫320
链接:http://blog.csdn.net/column/details/16442.html



发布了9 篇原创文章 · 获赞 1 · 访问量 6066

Guess you like

Origin blog.csdn.net/wcysghww/article/details/79373459