1. QuestionsThe PCA and ICA we discussed before may not have the class label y for the sample data. Recall that when we do regression, if there are too many features, then there will be problems such as introduction of irrelevant features and overfitting. We can use PCA to reduce dimensionality, but PCA does not take the class labels into account and is unsupervised. For example, going back to the last question that the document contains "learn" and "study", after using PCA, it may be possible to combine these two features into one, reducing the dimension. But suppose our category label y is to judge whether the topic of this article is related to learning. Then these two features have little effect on y and can be completely removed. For another example, suppose we perform face recognition on a 100*100 pixel picture, each pixel is a feature, then there will be 10,000 features, and the corresponding category label y is only a 0/1 value, 1 means yes human face. So many features are not only complicated to train, but unnecessary features will have unpredictable effects on the results, but we want to get some of the best features (the most closely related to y) after dimensionality reduction, what should we do? 2. Linear discriminant analysis (two types of cases)Recalling our previous logistic regression approach, we are given m training examples of n-dimensional features (i ranging from 1 to m), each corresponding to a class label . We just want to learn the parameters such that (g is the sigmoid function). Now only the binary classification case is considered, that is, y=1 or y=0. In order to facilitate the representation, we first change the notation to redefine the problem, given N samples with d-dimensional features, , one of which belongs to the category , and the other belongs to the category . Now we feel that the number of original features is too many, and we want to reduce the d-dimensional features to only one dimension , but also to ensure that the categories can be "clearly" reflected on the low-dimensional data, that is, this dimension can determine the quality of each sample. category. We call this optimal vector w (d-dimension), then the projection of the sample x (d-dimension) onto w can be calculated as The y value you get here is not the 0/1 value, but the distance from the origin of the point where x is projected onto the line. When x is two-dimensional, we just want to find a line (direction w) to do the projection, and then find the line that best separates the sample points. As shown below: Intuitively, the picture on the right is better, and it can well separate the sample points of different categories. Next we find this optimal w from a quantitative point of view. First, we find the mean (center point) of each class of samples, where there are only two i Since the mean of the sample points after the projection of x to w is It can be seen that the mean value after projection is also the projection of the sample center point. What is the best straight line (w)? We first found that the straight line that can separate the center points of the two types of samples after projection as far as possible is a good straight line, and the quantitative expression is: The larger the J(w), the better. But what about just considering J(w)? No, see the picture below The sample points are evenly distributed in the ellipse, and a larger center point spacing J(w) can be obtained when projected on the horizontal axis x1, but due to overlap, x1 cannot separate the sample points. Projected on the vertical axis x2, although J(w) is small, it can separate the sample points. Therefore, we also need to consider the variance between sample points. The larger the variance, the harder it is to separate the sample points. We use another metric, called scatter, to hash the projected classes, as follows It can be seen from the formula that it is only the variance value divided by the number of samples. The geometric meaning of the hash value is the density of the sample points. The larger the value, the more scattered, and vice versa, the more concentrated. The projected sample points we want look like: the more separated the sample points of different categories, the better, the more similar the more aggregated, the better, that is, the larger the mean difference, the better, and the smaller the hash value, the better. Exactly, we can use J(w) and S to measure, the final measurement formula is The next thing is more obvious, we just need to find the w that maximizes J(w). Expand the hash value formula first We define the middle part of the above formula Isn't this formula just a covariance matrix less divided by the number of samples, called scatter matrices we continue to define Called Within -class scatter matrix. So going back to the formula above, using the replacement middle part, we get Then, we expand the molecule 称为Between-class scatter,是两个向量的外积,虽然是个矩阵,但秩为1。 那么J(w)最终可以表示为 在我们求导之前,需要对分母进行归一化,因为不做归一的话,w扩大任何倍,都成立,我们就无法确定w。因此我们打算令,那么加入拉格朗日乘子后,求导 这个公式称为Fisher linear discrimination。 那么 代入最后的特征值公式得 由于对w扩大缩小任何倍不影响结果,因此可以约去两边的未知常数和,得到 至此,我们只需要求出原始样本的均值和方差就可以求出最佳的方向w,这就是Fisher于1936年提出的线性判别分析。 看上面二维样本的投影结果图: 3. 线性判别分析(多类情况)前面是针对只有两个类的情况,假设类别变成多个了,那么要怎么改变,才能保证投影后类别能够分离呢? 我们之前讨论的是如何将d维降到一维,现在类别多了,一维可能已经不能满足要求。假设我们有C个类别,需要K维向量(或者叫做基向量)来做投影。 为了像上节一样度量J(w),我们打算仍然从类间散列度和类内散列度来考虑。 当样本是二维时,我们从几何意义上考虑: 其中和与上节的意义一样,是类别1里的样本点相对于该类中心点的散列程度。变成类别1中心点相对于样本中心点的协方差矩阵,即类1相对于的散列程度。 需要变,原来度量的是两个均值点的散列情况,现在度量的是每类均值点相对于样本中心的散列情况。类似于将看作样本点,是均值的协方差矩阵,如果某类里面的样本点较多,那么其权重稍大,权重用Ni/N表示,但由于J(w)对倍数不敏感,因此使用Ni。 其中 上面讨论的都是在投影前的公式变化,但真正的J(w)的分子分母都是在投影后计算的。下面我们看样本点投影后的公式改变: 这两个是第i类样本点在某基向量上投影后的均值计算公式。 W是基向量矩阵,是投影后的各个类内部的散列矩阵之和,是投影后各个类中心相对于全样本中心投影的散列矩阵之和。 回想我们上节的公式J(w),分子是两类中心距,分母是每个类自己的散列度。现在投影方向是多维了(好几条直线),分子需要做一些改变,我们不是求两两样本中心距之和(这个对描述类别间的分散程度没有用),而是求每类中心相对于全样本中心的散列度之和。 然而,最后的J(w)的形式是 由于我们得到的分子分母都是散列矩阵,要将矩阵变成实数,需要取行列式。又因为行列式的值实际上是矩阵特征值的积,一个特征值可以表示在该特征向量上的发散程度。因此我们使用行列式来计算(此处我感觉有点牵强,道理不是那么有说服力)。 整个问题又回归为求J(w)的最大值了,我们固定分母为1,然后求导,得出最后结果(我翻查了很多讲义和文章,没有找到求导的过程) 与上节得出的结论一样 最后还归结到了求矩阵的特征值上来了。首先求出的特征值,然后取前K个特征向量组成W矩阵即可。 注意:由于中的 秩为1,因此的秩至多为C(矩阵的秩小于等于各个相加矩阵的秩的和)。由于知道了前C-1个后,最后一个可以有前面的来线性表示,因此的秩至多为C-1。那么K最大为C-1,即特征向量最多有C-1个。特征值大的对应的特征向量分割性能最好。 由于不一定是对称阵,因此得到的K个特征向量不一定正交,这也是与PCA不同的地方。
关于进行多类分类的问题:一种方法是“one against the rest “方法构造C个分类器(每个分类器的作用都是二分的),然后把这些结果综合起来;另一种方法是成对分类,每一个分类器把两个类别进行分开(产生(C(C-1)/2)个类.... (参考维基百科)
In order to avoid that SW is a singular matrix when constructing the transformation matrix , the number of samples should be larger than the dimension of the samples.
vector projectionGiven a vector u and v, find the projection vector of u on v, as shown below. Suppose that the projection vector of u on v is u', and the angle between the vectors u and v is theta. A vector has two properties, size and direction. We first determine the size (ie length, or modulus) of u', and make the vertical line of v from the end of u, then d is the length of u'. The directions of u' and v are the same, and the direction of v v/|v| is also the direction of u'. So have (1) Then find the length of d. (2) Finally find cos(theta) (3) Solving equations (1) (2) (3) jointly yields This is the final projection vector. And the length d of this vector is ============================ The following is the old derivation, which is also retained.
|
LDA and PCA Algorithms
Guess you like
Origin http://10.200.1.11:23101/article/api/json?id=327037412&siteId=291194637
Ranking