LDA and PCA Algorithms

1. Questions

The PCA and ICA we discussed before may not have the class label y for the sample data. Recall that when we do regression, if there are too many features, then there will be problems such as introduction of irrelevant features and overfitting. We can use PCA to reduce dimensionality, but PCA does not take the class labels into account and is unsupervised.

For example, going back to the last question that the document contains "learn" and "study", after using PCA, it may be possible to combine these two features into one, reducing the dimension. But suppose our category label y is to judge whether the topic of this article is related to learning. Then these two features have little effect on y and can be completely removed.

For another example, suppose we perform face recognition on a 100*100 pixel picture, each pixel is a feature, then there will be 10,000 features, and the corresponding category label y is only a 0/1 value, 1 means yes human face. So many features are not only complicated to train, but unnecessary features will have unpredictable effects on the results, but we want to get some of the best features (the most closely related to y) after dimensionality reduction, what should we do?

2. Linear discriminant analysis (two types of cases)

Recalling our previous logistic regression approach, we are given m training examples of n-dimensional features clip_image002(i ranging from 1 to m), each clip_image004corresponding to a class label clip_image006. We just want to learn the parameters clip_image008such that clip_image010(g is the sigmoid function).

Now only the binary classification case is considered, that is, y=1 or y=0.

In order to facilitate the representation, we first change the notation to redefine the problem, given N samples with d-dimensional features, clip_image012, one of which clip_image014belongs to the category clip_image016, and the other clip_image018belongs to the category clip_image020.

Now we feel that the number of original features is too many, and we want to reduce the d-dimensional features to only one dimension , but also to ensure that the categories can be "clearly" reflected on the low-dimensional data, that is, this dimension can determine the quality of each sample. category.

We call this optimal vector w (d-dimension), then the projection of the sample x (d-dimension) onto w can be calculated as

clip_image022

The y value you get here is not the 0/1 value, but the distance from the origin of the point where x is projected onto the line.

When x is two-dimensional, we just want to find a line (direction w) to do the projection, and then find the line that best separates the sample points. As shown below:

clip_image024

Intuitively, the picture on the right is better, and it can well separate the sample points of different categories.

Next we find this optimal w from a quantitative point of view.

First, we find the mean (center point) of each class of samples, where there are only two i

clip_image026

Since the mean of the sample points after the projection of x to w is

clip_image028

It can be seen that the mean value after projection is also the projection of the sample center point.

What is the best straight line (w)? We first found that the straight line that can separate the center points of the two types of samples after projection as far as possible is a good straight line, and the quantitative expression is:

clip_image030

The larger the J(w), the better.

But what about just considering J(w)? No, see the picture below

clip_image031

The sample points are evenly distributed in the ellipse, and a larger center point spacing J(w) can be obtained when projected on the horizontal axis x1, but due to overlap, x1 cannot separate the sample points. Projected on the vertical axis x2, although J(w) is small, it can separate the sample points. Therefore, we also need to consider the variance between sample points. The larger the variance, the harder it is to separate the sample points.

We use another metric, called scatter, to hash the projected classes, as follows

clip_image033

It can be seen from the formula that it is only the variance value divided by the number of samples. The geometric meaning of the hash value is the density of the sample points. The larger the value, the more scattered, and vice versa, the more concentrated.

The projected sample points we want look like: the more separated the sample points of different categories, the better, the more similar the more aggregated, the better, that is, the larger the mean difference, the better, and the smaller the hash value, the better. Exactly, we can use J(w) and S to measure, the final measurement formula is

clip_image035

The next thing is more obvious, we just need to find the w that maximizes J(w).

Expand the hash value formula first

clip_image037

We define the middle part of the above formula

clip_image039

Isn't this formula just a covariance matrix less divided by the number of samples, called scatter matrices

we continue to define

clip_image041

clip_image043Called Within -class scatter matrix.

So going back to clip_image045the formula above, using the clip_image047replacement middle part, we get

clip_image049

clip_image051

Then, we expand the molecule

clip_image052

clip_image054称为Between-class scatter,是两个向量的外积,虽然是个矩阵,但秩为1。

那么J(w)最终可以表示为

clip_image056

在我们求导之前,需要对分母进行归一化,因为不做归一的话,w扩大任何倍,都成立,我们就无法确定w。因此我们打算令clip_image058,那么加入拉格朗日乘子后,求导

clip_image059

其中用到了矩阵微积分,求导时可以简单地把clip_image061当做clip_image063看待。

如果clip_image043[1]可逆,那么将求导后的结果两边都乘以clip_image065,得

clip_image066

这个可喜的结果就是w就是矩阵clip_image068的特征向量了。

这个公式称为Fisher linear discrimination。

等等,让我们再观察一下,发现前面clip_image070的公式

clip_image072

那么

clip_image074

代入最后的特征值公式得

clip_image076

由于对w扩大缩小任何倍不影响结果,因此可以约去两边的未知常数clip_image078clip_image080,得到

clip_image082

至此,我们只需要求出原始样本的均值和方差就可以求出最佳的方向w,这就是Fisher于1936年提出的线性判别分析。

看上面二维样本的投影结果图:

clip_image083

3. 线性判别分析(多类情况)

前面是针对只有两个类的情况,假设类别变成多个了,那么要怎么改变,才能保证投影后类别能够分离呢?

我们之前讨论的是如何将d维降到一维,现在类别多了,一维可能已经不能满足要求。假设我们有C个类别,需要K维向量(或者叫做基向量)来做投影。

将这K维向量表示为clip_image085

我们将样本点在这K维向量投影后结果表示为clip_image087,有以下公式成立

clip_image089

clip_image091

为了像上节一样度量J(w),我们打算仍然从类间散列度和类内散列度来考虑。

当样本是二维时,我们从几何意义上考虑:

clip_image092

其中clip_image094clip_image043[2]与上节的意义一样,clip_image096是类别1里的样本点相对于该类中心点clip_image098的散列程度。clip_image100变成类别1中心点相对于样本中心点clip_image102的协方差矩阵,即类1相对于clip_image102[1]的散列程度。

clip_image043[3]

clip_image104

clip_image106的计算公式不变,仍然类似于类内部样本点的协方差矩阵

clip_image108

clip_image054[1]需要变,原来度量的是两个均值点的散列情况,现在度量的是每类均值点相对于样本中心的散列情况。类似于将clip_image094[1]看作样本点,clip_image102[2]是均值的协方差矩阵,如果某类里面的样本点较多,那么其权重稍大,权重用Ni/N表示,但由于J(w)对倍数不敏感,因此使用Ni。

clip_image110

其中

clip_image112

clip_image102[3]是所有样本的均值。

上面讨论的都是在投影前的公式变化,但真正的J(w)的分子分母都是在投影后计算的。下面我们看样本点投影后的公式改变:

这两个是第i类样本点在某基向量上投影后的均值计算公式。

clip_image114

clip_image116

下面两个是在某基向量上投影后的clip_image043[4]clip_image070[1]

clip_image118

clip_image120

其实就是将clip_image102[4]换成了clip_image122

综合各个投影向量(w)上的clip_image124clip_image126,更新这两个参数,得到

clip_image128

clip_image130

W是基向量矩阵,clip_image124[1]是投影后的各个类内部的散列矩阵之和,clip_image126[1]是投影后各个类中心相对于全样本中心投影的散列矩阵之和。

回想我们上节的公式J(w),分子是两类中心距,分母是每个类自己的散列度。现在投影方向是多维了(好几条直线),分子需要做一些改变,我们不是求两两样本中心距之和(这个对描述类别间的分散程度没有用),而是求每类中心相对于全样本中心的散列度之和。

然而,最后的J(w)的形式是

clip_image132

由于我们得到的分子分母都是散列矩阵,要将矩阵变成实数,需要取行列式。又因为行列式的值实际上是矩阵特征值的积,一个特征值可以表示在该特征向量上的发散程度。因此我们使用行列式来计算(此处我感觉有点牵强,道理不是那么有说服力)。

整个问题又回归为求J(w)的最大值了,我们固定分母为1,然后求导,得出最后结果(我翻查了很多讲义和文章,没有找到求导的过程)

clip_image134

与上节得出的结论一样

clip_image136

最后还归结到了求矩阵的特征值上来了。首先求出clip_image138的特征值,然后取前K个特征向量组成W矩阵即可。

注意:由于clip_image070[2]中的clip_image140 秩为1,因此clip_image070[3]的秩至多为C(矩阵的秩小于等于各个相加矩阵的秩的和)。由于知道了前C-1个clip_image094[2]后,最后一个clip_image142可以有前面的clip_image094[3]来线性表示,因此clip_image070[4]的秩至多为C-1。那么K最大为C-1,即特征向量最多有C-1个。特征值大的对应的特征向量分割性能最好。

由于clip_image138[1]不一定是对称阵,因此得到的K个特征向量不一定正交,这也是与PCA不同的地方。

 

 

 

  关于进行多类分类的问题:一种方法是“one against the rest “方法构造C个分类器(每个分类器的作用都是二分的),然后把这些结果综合起来;另一种方法是成对分类,每一个分类器把两个类别进行分开(产生(C(C-1)/2)个类.... (参考维基百科

  

In order to avoid that SW is a singular matrix   when constructing the transformation matrix , the number of samples should be larger than the dimension of the samples.

 

 

 

vector projection

Given a vector u and v, find the projection vector of u on v, as shown below.

Suppose that the projection vector of u on v is u', and the angle between the vectors u and v is theta. A vector has two properties, size and direction. We first determine the size (ie length, or modulus) of u', and make the vertical line of v from the end of u, then d is the length of u'. The directions of u' and v are the same, and the direction of v v/|v| is also the direction of u'. So have

 (1)

Then find the length of d.

 (2)

Finally find cos(theta)

 (3)

Solving equations (1) (2) (3) jointly yields

This is the final projection vector.

And the length d of this vector is

============================

The following is the old derivation, which is also retained.

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327037412&siteId=291194637