Machine Learning topics: Overview

A commonly used algorithm

. (A) k-nearest neighbor:

　kNN algorithm is a well-known pattern recognition statistical methods, is one of the best text categorization algorithms, machine learning classification algorithm occupies a large place, is one of the simplest machine learning algorithms.

Thought: official explanation for a given test specimen, measuring its training set to find the k closest training samples based on a certain distance, and then based on the information the k "neighbors" to predict. Popular point that: is to calculate the distance between a point and the sample space of all points out the most recent k points with the point and then count the k point inside the largest category ratio ( "return" which use average method), the point A belonging to the classification.

Three basic elements: selecting the k value, a measure of distance, the classification decision rule

Algorithm steps:

　　1, calculated from: the given test object, each object is calculated from it and the training set;

　　2, find a neighbor: delineation of the k nearest training objects, as a close neighbor of the test object;

　　3, do classification: According to the k main categories neighbors belong to classify the test object;

Calculated distance (similarity measure): Euclidean distance / Manhattan distance,

Categories of judgment:

Voting method: majority rule in neighboring point which category up to be divided into class.

Weighted voting method: The on distance, weighted voting pair of adjacent, the closer the greater the weight (weight as the inverse square of the distance).

advantage:

　　1, simple, easy to understand, easy to implement, without estimating parameters, no training;

　　2, suitable for rare events are classified;

　　3, is particularly suitable for multi-classification problems (multi-modal, object has multiple category labels), kNN better performance than SVM.

Disadvantages:

1, a smaller sample size class field is generated using this algorithm is relatively easy to misclassification.

The classification algorithm in the major drawback is that when the sample imbalance, such as a large sample size class, and other classes very small sample sizes, it is possible when entering a new sample cause, K number of the sample neighborhood bulk sample class majority.

The algorithm calculates only the "closest" neighbor samples, a large number of samples of a class, then this type of sample is not close to the target or sample, or very close to the target sample of these samples. In any case, the number does not affect the results.

2. Another weakness of this method is computationally expensive , because each text should be classified calculate its distance to all the known samples in order to obtain its K nearest neighbor points.

3, poor intelligibility, like a decision tree can not give that kind of rule.

common problem:

1, k value setting

Choose a small value of k, the number of neighbors obtained too little will reduce the classification accuracy, but also amplifies noise interference data; if k is too large value is selected, and the sample to be classified belong to the training set contains less data class number, then select the k neighbors when, in fact, similar data was also not included in, resulting in an increase in noise caused by the effect of lower classification. How to select the appropriate value of K has become a hot topic of KNN. k value is typically determined using a cross-check (k = 1 to the reference).

A rule of thumb: k is generally lower than the square root of the number of training samples.

2, the way to determine the type

Voting law does not consider the distance from the neighborhood, from the more recent neighbor perhaps more should decide the final classification, it is more appropriate weighted voting method.

3, the distance metrics of the selected

The impact of high-dimensional measure of distance: the more well known when the number of variables, the ability to distinguish Euclidean distance of the worse.

Affect the variable range of distance: the greater the range of variables often dominant role in distance calculations, so should the variables are standardized.

4, referring to the principle of training samples

Scholars have conducted research for the selection of training samples, in order to achieve the purpose of reducing computing, these algorithms can be broadly divided into two categories. The first, reducing the size of the training set. KNN algorithm stored sample data, which sample data contains a lot of redundant data, which increases the overhead of redundant data storage and computational cost. Reduced training sample methods are: delete the part of small samples in the sample relating to the classification in the original sample, the remaining sample as a new training sample; centralized or select a number of representative sample in the original training sample as a new training sample; or by clustering, the center point of the cluster generated as a new training samples.

In the training set, some samples may be more trustworthy. You can apply different weights to different samples, sample weights to strengthen reliance, reduce the impact of unreliable sample.

5, performance problems

kNN algorithm is a lazy, lazy and consequences: the structural model is very simple, but the big test sample classification system overhead, because you want to scan all of the training samples and calculate the distance. There have been some ways to improve the efficiency of computation, such as training sample compression.

reference:

KNN difference between artificial intelligence and SVM _ _ pavilion -CSDN blog

K- nearest neighbor (KNN) - Chuan Shan Jia - blog Park

(B) random forest:

Random Forest ensemble learning is through the idea of an algorithm integrated multi-tree, which is the basic unit of decision trees, and its essence belongs to a major branch of machine learning - Integrated Learning (Ensemble Learning) method. Name of random forests, there are two key words, one is " random ", is a " forest ."

" Forest " We well understood, is called a tree, then hundreds of thousands of trees in the forest can be called up, this analogy is very apt, in fact, this is the main idea of random forests - reflects the idea of integration. The explanation is intuitive perspective, each of them is a decision tree classifier (assuming for now that the classification problem), then for an input sample, there will be N N Category trees results. Random Forest classification integrates all of the voting results, the number of times the most votes category designated as final output, which is one of the most simple Bagging thought.

" Random " refers to a bootstrap sample (random replacement, and extracted with a) two random. If you are not a random sample, each tree training set are the same, then the final training a tree classification result is exactly the same, so there is no need for bagging; if not with replacement sampling, the training of each tree samples are different, are not the intersection, so that each tree is "biased" are absolutely "one-sided" (of course, this could not say), that is to say each tree are all trained a lot of differences. The introduction of two random classification performance is critical to random forests. Since their introduction, such that the random forest is not vulnerable to overfitting, and has the ability to obtain a good noise immunity (for example: insensitive to the default value).

Generating a random forest

1) If the training set of size N, for each tree, the random replacement, and there is extracted from the training set of N training samples (This sampling method is called bootstrap sample), as a training set of the tree;

2) if the characteristic dimension of each sample is M, a specified constant m << M, m feature randomly selected subset of the M features from each split tree, selected from optimum m feature of;

3) have the greatest extent of the growth of each tree, and no pruning.

Random Forest classification results (error rate) with two factors:

Two trees in the forest any correlation: the greater the greater the correlation, the error rate;

Each tree forest classification ability: every tree classification ability is stronger, the lower the error rate of the entire forest.

　　Reducing the number of selection wherein m, correlation and classification tree capacity will be reduced accordingly; increasing m, both also increases. So the key question is how to choose the best m (or range), which is the only one parameter random forest.

reference:

Machine Learning] [Random Forest (Random Forest) _ AI _ Yunfeng Court -CSDN blog

Random Forests pros and cons _ Network _keepreder-CSDN blog

. (C) SVM :

Logistic regression algorithm is to strengthen the sense: by giving logistic regression algorithm to optimize more stringent conditions, support vector machine algorithm can get better than logistic regression classification boundaries. But without some sort function technique, the support vector machine algorithm up to be a better linear classification techniques.

However, it can express very complex classification boundaries through with the combination of "core" of Gauss, support vector machines, so as to achieve a good classification results. "Nuclear" is in fact a special function, the most typical feature is the low-dimensional space can be mapped into high-dimensional space.

SVM is a nonlinear mapping by p, the sample space is mapped to a high-dimensional feature space and even in infinite dimensional (Hilber space), so that a problem in the original sample space is divided into non-linear in the feature space linearly separable problems. L-dimensional, mapping to do is to sample the high-dimensional space, in general, this will increase the computational complexity even cause "curse of dimensionality", so people rarely cares. But as classification, regression problems, it is likely in the low-dimensional set of samples sample space is not a linear process, the high dimensional feature space, it can implement a linear divide (or regression) through a linear hyperplane. General liters dimension will bring the computational complexity of, SVM method cleverly solved this problem: Expand the theorem kernel function, you do not know the explicit expression of non-linear mapping; because it is in a high-dimensional feature

The establishment of a linear machine learning space, compared with the linear model, not only almost no increase in computational complexity, but also in a way to avoid the "curse of dimensionality." All this thanks to expand the theoretical and computational kernel function.

Selecting a different kernel function, the SVM can generate different, common kernel function has the following four:

- of the kernel function K (x, y) = x · y

- polynomial kernel K (x, y) = [(x · y) +1] d

- K basis function (x, y) = exp (- | xy | ^ 2 / d ^ 2)

- layer neural network kernel function K (x, y) = tanh (a (x · y) + b)

How do we divide a round of classification boundaries in two-dimensional plane? In the two-dimensional plane can be difficult, but through the "core" two-dimensional space can be mapped to the three-dimensional space, and then use a linear plane can reach a similar effect. That is, two-dimensional plane is divided nonlinear classification boundary line may be equivalent to the three-dimensional plane of the linear classifier. Thus, we can perform a simple linear division of the three-dimensional space is divided non-linear effect can be achieved in a two-dimensional plane.

SVM is a very strong mathematical component of machine learning algorithms (relative, neural networks, there are biological science component). In a key step in the algorithm, there is further proof, coming from a low-dimensional map data into a high dimensional complexity of the upgrade will not bring the final calculation. Thus, by support vector machine algorithm can not only keep the computational efficiency, and can get very good classification results. Therefore, support vector machines in the late 1990s has been dominated by the machine learning in the core position, replaced the neural network algorithm. Until now, deep learning neural network by re-emergence, it has undergone a subtle shift of balance between the two.

(D) neural network

For nonlinear classification neural network is one in which the earliest. Artificial Neural Network (on ANN), referred to as neural network computational model is a mathematical model or a mimic the structure and function of biological neural networks. The neural network is calculated by a large number of artificial neural links. In most cases artificial neural network can be changed on the basis of the internal structure of the information on the outside, it is an adaptive system. Modern neural network is a non-linear statistical data modeling tools, often used to model complex relationships between inputs and outputs or to discover patterns in data.

Overall, K nearest neighbor and decision tree is a natural non-linear classifiers , they will be classified within your given feature space. NN and the SVM , although a linear classifier (perceptron model is therefore both ancestors), but they will be very clever their build new feature space so that the linearly separable sets of data, and finally presents the effect of nonlinear classification. The two different ways, the SVM data is projected into the high-dimensional space do hyperplane data linearly divided, the NN is to use the activation function (sigmoid, tanh, softmax etc.) is similar to the projection data of the surface of the feature space (this space is a hidden layer of decision).

- theoretical foundation more solid than that of SVM NN, like a rigorous "science" (three elements: that problem, problem-solving, proof)

- SVM - rigorous mathematical reasoning

-ANN - strongly dependent on the engineering skills

- generalization ability depends on the "experience value at risk" and "confidence range of values", ANN can not control any of the two.

-ANN designer with superb engineering skills to make up for the deficiencies of mathematics - the special design of the structure, the use of heuristic algorithms can sometimes get unexpected good result.

As Feynman pointed out, "We must clarify from the outset on a view that if something is not science, it's not necessarily bad. For example, love is not a science. So, if we say that something is not science, and not to say there is something wrong, but only that it is not science. "compared with SVM, ANN unlike a science, more like an engineering skills, but that does not mean it's necessarily bad.

reference:

"MACHINE LEARNING" reading notes | Contact neural networks and support vector machines _ the blog -CSDN blog network _amazingmango

R Language neural network algorithm - a baby Philippines - blog Park

(E) Naive Bayes

Naive Bayesian classifier is a learning algorithm to generate. For the samples to be classified, each feature is determined under the condition of the sample appears that the probability of belonging to each category (P (Yi | X)), the probability of a large category to which the sample is determined which one category .

reference:

Naive Bayesian classifier Naive Bayes --- R - A jump button - blog Park

Second, summary

reference:

Decision trees, Bayes, artificial neural networks, K- nearest neighbor, support vector machines and other commonly used classification algorithm Summary _ _seu_yang network of blog -CSDN blog

Dry goods | machine learning super full review (with multi-map) - Jane books

Classification and machine learning algorithms major contrast - upstreamL - blog Park

Artificial intelligence, machine learning literature review _ _a1742326479 the blog -CSDN blog

Classification, regression, clustering, dimension reduction of the difference _ AI _kiss__soul the blog -CSDN blog

Third, the basic Concepts

artificial intelligence:

Machine learning: Machine learning is more than one field of cross-disciplinary, involving probability theory, statistics, approximation theory, convex analysis, complexity theory on the algorithm and many other subjects. How specializing in computer simulation or realization of human learning behavior to acquire new knowledge or skills, re-organize existing knowledge structures so as to continuously improve their performance. It is the core of artificial intelligence, is to make computers intelligent fundamental way, the application of artificial intelligence throughout all areas and branches. English define a frequently cited are: A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P, if its performanceat tasks in T, as measured by P, improves with experience E.

Deep learning: deep learning algorithms for the development of artificial neural networks.

Artificial intelligence, machine learning, depth

2. classified according to the presence or absence of man-made label

Supervised learning: supervised learning refers to a set of known samples using the parameter category of adjustment of the classifier to reach the required performance of the process, also known as supervised training of teachers or learning. In supervised learning process will provide an indication of right and wrong, by constantly repeating the training, given that it found a pattern or rule of the training data set, when the arrival of new data, the results can be predicted based on this function. Supervised learning the training set of requirements, including input and output, mainly used in classification and prediction. Common supervised learning algorithms including regression analysis and statistical classification.

Unsupervised learning: unsupervised learning data sets need to be labeled, that there is no output. It was found from the need to focus on certain implicit data structure, thereby obtaining a structural feature sample data, determining which data is quite similar. Therefore, unsupervised learning goal is not to tell the computer how to do it, but let it go to learn how to do things. A typical algorithm of automatic unsupervised learning encoder, Restricted Boltzmann Machine, depth belief networks like; typical applications are: clustering and anomaly detection.

半监督学习：半监督学习是监督学习和非监督学习的结合，其在训练阶段使用的是未标记的数据和已标记的数据，不仅要学习属性之间的结构关系，也要输出分类模型进行预测。与使用所有标签数据的模型相比，使用训练集的训练模型在训练时可以更为准确，而且训练成本更低，在实际运用中也更为普遍。

监督学习算法 vs 无监督学习算法

3. 偏差与方差

偏差：描述的是算法预测的平均值和真实值的差距（算法的拟合能力），低偏差对应于模型复杂化，但模型过于复杂容易过拟合；高偏差(一般是欠拟合，注意跟上面低偏差时模型复杂化做区别)是模型在训练集和验证集上的误差都比较大。

方差：描述的是同一个算法在不同数据集上的预测值和所有数据集上的平均预测值之间的关系（算法的稳定性），低方差对应于模型简单化，但模型过于简单容易欠拟合；高方差是针对不同的训练集，其拟合得到的参数相差很大(一般是过拟合，注意跟上面低方差时模型简单化做区别)。

解决高偏差的方法：使用更多特征，增加多项式特征，减少正则化程度λ。

解决高偏差的方法：增加训练样本，减少特征数量，增加正则化程度λ。

方差vs偏差

4. 特征选择：

特征选择对机器学习至关重要，减少特征数量会防止维度灾难，减少训练时间；增强模型泛化能力，减少过拟合；增强对特征和特征值的理解。个人认为在大部分机器学习任务中特征就决定了效果的上限，模型的选择与组合只是无限逼近于这个上限。

常见的特征选择方法：①去除取值变化小的特征：如果绝大部分实例的某个特征取值一样，那这个特征起到的作用可能就比较有限，极端情况下如果所有实例的某特征取值都一样，那该特征基本就不起作用。②单变量特征选择法：能够对每一个特征进行测试，衡量该特征和响应变量之间的关系，根据得分扔掉不好的特征。常见方法包括卡法检验、互信息、皮尔森相关系数、距离相关系数、基于学习模型的特征排序(Model based ranking)等。③正则化：L1正则化、L2正则化。④随机森林特征选择：这类方法主要包括平均不纯度减少(mean decrease impurity)和平均精确率减少(Mean decrease accuracy)两种方法。⑤顶层特征选择法：这类方法主要包括稳定性选择(Stability selection)和递归特征消除(Recursive feature elimination)两种方法。

损失函数

5. 损失函数

损失函数用来评价模型的预测值和真实值不一样的程度，损失函数越好，通常模型的性能越好。不同的模型用的损失函数一般也不一样。损失函数分为经验风险损失函数和结构风险损失函数。经验风险损失函数指预测结果和实际结果的差别，结构风险损失函数是指经验风险损失函数加上正则项。

损失函数

包括：0-1损失、绝对值损失、log对数损失、指数损失、Hinge损失等。

三、医学领域的应用与研究进展

参考文献：