One hundred face machine learning reading notes

1 feature works

1.1 wherein normalization
purposes: to eliminate the influence dimension between the data features.
Of numerical data were normalized There are two general methods:
(1) a linear function of normalized (Min-Max Scaling)
of original data is converted, the results are mapped to the range [0, 1], to achieve the original data geometric scaling.
Here Insert Picture Description
(2) zero-mean normalization (Z-Score Normalization)
mapping the original data to a mean of 0 and standard deviation 1 on the distribution.
Numeric data normalization benefits:
You can use fewer iterations hopes of finding the optimal solution.
Here Insert Picture Description
Model using a gradient descent method generally requires normalization, comprising: a linear regression, logistic regression, SVM, neural network.

Decision tree does not require normalization, reason:
to C4.5 for example, the decision tree node split mainly rely on the data set than the information gain feature x, gain information has nothing to do with whether the ratio normalized.
Wherein 1.2 categorical
approach:
(1) number coding
(2) one-hot encoding
(3) binary-coding
processing method at the time of image data is less than 1.7
holding the image without changing the category:
(1) a degree of random rotation, translation , scaling, cropping and so on.
(2) adding noise disturbance.
(3) color conversion.
(4) changing the image brightness, sharpness, contrast, etc.,
the image feature extraction, and then converts the image in a feature space.

2 model assessment

2.3 Application of the cosine distance
cosine similarity definition:
Here Insert Picture Descriptioni.e. the angle between two vectors, concern is the relationship between the angle of the vector, rather than the absolute value of the vector of interest.
2.5 Method model evaluation
(1) Holdout Test
the original random sample is divided into training and validation sets two portions
disadvantage: because the random division, calculated on the final evaluation index validation set has a lot of the original packet.
(2) cross-validation
Motivation: In order to eliminate Holdout test randomness.
k-fold cross-validation:
the entire sample into k subsets of the same sample size, successively traverse the k subsets, each subset of the current set of verification as the remainder as a subset of the training set. k times the average evaluation index as the final evaluation index.
Leave a validation:
Each time leaving a sample as a validation set, all the other samples as a test set.
(3) self-help method
Motivation: Considering the relatively small sample size when the sample that is divided into a training set will be further reduced, it may affect the training effect model.
N is the total number of samples n times of random sampling with replacement

2.6 super tuning parameters
(1) grid search
practice, network search will normally use a wide search range larger step size, to find the global best possible position, and then narrowing the search range and step size, to find the exact values of the best quality
drawbacks: Since the objective function is typically non-convex function, it is likely to miss the global optimum
(2) random search
randomly selected sample points within the search range.
Theory: if the sample point is large enough, can also be a high probability to find the global optimum by random sampling
(3) Bayesian Optimization Algorithm

2.7 underfitting over-fitting and
over-fitting: fitting the model to the training data when the case was over, the reaction on the training set evaluation index is a good performance, poor performance of the test set and the new data.
Reduce over-fitting
(1) for more training data, image translation, rotation, scaling, etc.
(2) reduce the complexity of the model, such as reducing the number of layers of the network, the number of neurons, to reduce the depth of the tree, pruning
(3) regularization, join the regular constraints.
(4) integrated learning, such as bagging.

3 classic machine learning algorithms

3.3 tree
(1) ID3 algorithm - the maximum information gain
data set D entropy experience:
Here Insert Picture Description
D - sample set
K - Number Category
H (D) - empirical entropy of the data set D
Ck - D sub-sample set in the sample belonging to class k set
| Ck | - the number of elements of subset of
| D | - number of elements in a set of samples

(2) C4.5 - maximum information gain ratio
(3) CART maximum Gini index (Gini)

Decision tree prone to over-fitting, over-fitting solutions are as follows:
pre-pruning
in the process of generating the decision tree to stop the advance tree growth.
The core idea: before expanding the tree nodes, first calculate whether the current division can bring to enhance the generalization ability of the model, if not, stop the growth of sub-tree, in accordance with the principle of majority voting to determine the node belongs to the category.
(1) of the tree reaches a certain depth, stop the growth of
(2) when the number of samples reaches the current node is less than a certain threshold value, stop growing.
(3) the accuracy of calculation of each division to enhance the test set, when less than a certain threshold value, the not continue to expand.
Risk: Risk You Qian fitting

After pruning
on the generated decision tree pruning over-fitting to obtain a simplified version tree pruning.
Bold Style
4 dimensionality reduction

PCA
steps:
(1) sample data center process
(2) Determine the sample covariance matrix
(3) of the covariance matrix eigenvalue decomposition descending
d large corresponding eigenvectors w1 front (4) take the characteristic values, w2, ..., wd, by mapping the n-dimensional maps samples with dimension d:
Here Insert Picture Description

5 unsupervised learning

Unsupervised learning methods:
data clustering feature variable and associated
data clustering: through multiple iterations to find the optimal segmentation data
associated characteristic variables: correlation analysis using a variety of methods to find the relationship between variables.
(. 1) K-means clustering
Thought: Looking for a way of dividing the K clusters offense by iteration, such that the minimum monovalent function corresponding to the clustering result.
Cost function: from the square error for each sample and the center of the cluster belongs
Here Insert Picture Description
Xi - i-th sample
ci - Xi cluster belongs
Uci - corresponding to the cluster center point
drawbacks: the initial value and affected the results of each of the outliers are not stable, the result is usually a local optimal solution.
Unable to resolve differences in the distribution of the data cluster is relatively large, is not applicable to discrete classification.
It requires manual pre-determined initial value of K, and this value may not match the real data distribution.
Sample points can be divided into a single class.
Advantages: For large data sets, K-means is scalable and efficient computational complexity is O (NKt) close to linear,
N is the number of samples of the object, K is the cluster number of clusters, t is the number of iterations.

7 Optimization

凸优化问题:支持向量机、线性回归等线性模型
非凸优化问题:低秩模型(矩阵分解)、深度神经网络模型

经典优化算法:直接法、迭代法

直接法
需要满足条件:
(1)凸函数
(2)有闭式解
Here Insert Picture Description
例子:岭回归(Ridge Regression)
目标函数为:
Here Insert Picture Description
最优解:
Here Insert Picture Description
迭代法

一阶法:一阶泰勒展开
二阶法:二阶泰勒展开
7.5 随机梯度下降法
经典的梯度下降法
BGD
采用所有的训练数据的平均损失近似目标函数,即:
Here Insert Picture Description
Here Insert Picture Description
模型参数的更新公式为:
Here Insert Picture Description
其中a是学习速率。
缺点:每次对模型参数进行更新,需要遍历所有的训练数据,计算量大,耗费时间。

为了解决上述问题,随机梯度下降法(SGD)被提出,用单个训练样本的损失来近似平均损失。
随机梯度下降法
用单个训练数据对模型参数进行更新,大大加快了收敛速率。
Here Insert Picture Description
Here Insert Picture Description
缺点:由于是用一个个的训练数据分别取更新参数,梯度方向容易一次次被改变。
小批量梯度下降法
Mini-Batch Gradient Descent
为了降低随机梯度的方差,使得迭代算法更加稳定。
每次同时处理若干训练数据(m个)并更新参数。
Here Insert Picture Description
Here Insert Picture Description
7.7 L1正则化与稀疏性
L1正则化使得模型参数具有稀疏性。
原因:
解空间形状:
L2正则项约束后的解空间是圆形,而L1正则约束的解空间是多边形。多边形的解空间更容易在尖角处与等高线碰撞出稀疏解。
Here Insert Picture Description
函数叠加:
Here Insert Picture Description
由图可知,加上L1正则化约束后,最小值的点在红点处,对应的w是0,产生了稀疏性。

12 集成学习

12.2 集成学习的步骤和例子
集成学习步骤
(1)找到误差相互独立的基分类器
(2)训练基分类器
(3)合并基分类器的结果
合并方法
(1)voting
投票的方式,将获得最多选票的结果作为最终的结果。比如bagging。
(2)stacking
串行方式,把前一个基分类器的结果输出到下一个分类器,将所有基分类器的输出结果相加作为最终输出,比如 boosting。

12.3 基分类器
常用的基分类器是决策树,原因:
(1)较方便的将样本权重整合到训练过程中,不需要使用过采样的方法调整样本权重。
(2)可以通过调节树的层数折中树的表达能力和泛化能力。
(3)决策树很好的引入了随机性。
随机森林的基分类器是决策树,不可以替换为线性分类器或者K-近邻。
原因:
(1)Random Forest属于Bagging类集成学习,Bagging的好处是集成后的分类器的方差比基分类器的方差小,基分类器最好是不稳定的、本身对样本分布较为敏感的分类器,这样Bagging才有用武之地。
线性分类器或者K-近邻自身都是较为稳定的分类器,本身方差不大。
12.4 偏差和方差
偏差
指的是由所有采样得到的大小为m的训练数据集训练出的所有模型的输出的平均值和真是模型输出之间的偏差。
通常是由于我们对学习算法做了错误的假设导致的。

方差
指的是由所有采样得到的大小为m的训练数据集训练出的所有模型的输出的方差。
通常是由于模型的复杂度相对于训练样本数m过高导致的。
比如100个训练样本,假设函数却是不大于200阶的多项式函数。
二者的直观区别看下图:
Here Insert Picture Description
Bagging关注的是方差,可以降低方差
Boosting关注的是偏差,降低了偏差。
方差和偏差是矛盾统一的,关系如下:
Here Insert Picture Description
12.5 GBDT基本原理
基本思想:根据当前模型损失函数的负梯度信息来训练新加入的弱分类器,然后将训练好的弱分类器以累加的形式结合到现有模型中。
GBDT利用残差来迭代训练,在预测过程中,把所有的树的预测值加起来,得到最终的预测结果。
GBDT中梯度会等于残差
原因:因为GBDT的损失函数使用的是最小二乘损失函数,在梯度下降过程中,损失函数对预测函数求偏导后,就变成了 f(xi) - F(i),预测值 - 实际值

The only difference GBDT GBDT regression algorithm and classification algorithm is different loss functions.
GBDT sort key features and also features a divided value is divided, and the same tree.
The residual prediction process of each layer can be obtained by adding
Here Insert Picture Description
12.6 XGBoost contact GBDT difference of
(1) used as the base CART classifier, XGBoost added to control the complexity of the model regularization term, help prevent over-fitting.
(2) GBDT when the model trained to use only the first derivative of the cost function information, XGBoost cost function to make a second order Taylor expansion, using both first-order and second-order derivatives.
(3) GBDT employed as the base CART classifier, XGBoost support multiple types of base classifier, such as a linear classifier.
(4) GBDT use all of the data at each iteration, XGBoost adopted a similar strategy with random forests, support for data sampling.
(5) GBDT not involve the missing values are handled, XGBoost can automatically learn the handling strategy missing values.

Published 29 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/c2250645962/article/details/98474245