Dimensionality Reduction(降维)

Foreword

       In front of us to introduce the K-means algorithm is an unsupervised learning, clustering is to solve the problem, in this chapter I will introduce another unsupervised learning algorithm is used to solve the dimensionality reduction problem, called Principal component analysis (principal component analysis).

      Finally, if you do not have a place to understand, I hope you have educated us, thank you!

Chapter XII Dimensionality Reduction (dimensionality reduction)

12.1 Data Compression

      Before we introduce dimensionality reduction, we must first ask the question, why should the dimensionality reduction? Sometimes we deal with some data, such a situation occurs, for example, we want to measure an object, we use as a characteristic cm x1, but as a feature in feet x2, to obtain two sets of data, it is clearly redundant , it expressed both mean the same thing only, so we both are variable, will get their data into a linear relationship. It is possible you may ask, when we choose features as a variable, how would commit such a problem, but when you are doing a large project, you give a few people to collect data, will inevitably encounter a similar feature data this time will result in redundancy, we need to reduce the dimensionality of the data, such as 2D just that question, we can drop this dimension into a 1D shown in Figure 1, we made a straight line from the data, such that the distance and the minimum data line, we put the line direction as a new space coordinate axes, i.e. the axes of the space 1D, there may be two directions, and then mapped to each of the data of this new axes, this time to put 2D dimensionality reduction has become 1D.

                                                                                   FIG 1 2D to 1D dimensionality reduction

       Similarly, for a 3D problem, we can also reduce the dimension into 2D, shown in Figure 2, we have a set of 3D data set can be found, the data substantially in a plane, so we can select a 2D plane, such that each data to the minimum distance from the plane, shown in Figure 3, this plane is selected to be two directions z1 and z2, i.e., horizontal and vertical axes of the new 2D space, and each data mapped onto this plane, at this time, new data we have a 2D space, shown in Figure 4, the new data at this time {z ^ (i)}is in the 2D, i.e. z\in R^{2}, z^{(i)}=\begin{bmatrix} z_{1}^{(i)}\\ z_{2}^{(i)} \end{bmatrix}.

                                                                                           2 3D data set of FIG.

                                                                               3 Select a 2D plane

                                                                                      FIG 4 new 2D datasets

       After the data dimension reduction, we can significantly reduce the original amount of memory in the memory, but also faster when the algorithm is executed.

12.2 visualization of data

      在前面给大家介绍的数据压缩只是我们进行降维的好处之一,还有一个好处就是数据的可视化。很多实际问题,我们都知道一组数据有很多的特征,比如分析各个国家的GDP、人民幸福度等等,我们对于每个国家就会有得到很多的特征,如图5所示,对于这个输入X,我们就可能会有x\in R^{50},这个时候如果我们想把这些数据画在一张图上,显然是不可能的,我们在实际中最多可以表示出3D的空间,再更多维的就不行了,所以如果我们想让数据可视化,则需要对数据进行降维,如图6所示,我们最终降维的结果,我们会发现如果z1代表国家的大小,z2代表国家的GDP,则USA可能是图6中那一点,我们就会很清楚的明白一些特征之间的关系。z1和z2可以代表不同的特征,我们会得到不同的结果,但这样我们对整个特征和数据之间的关系就有了一个可视化的感受。

                                                                         图5 关于各国一些情况调查表

                                                                             图6 数据降维成2D的结果

12.3 Principal Component Analysis problem formulation(主成分分析问题的公式)

        在前面我们给大家介绍了为什么要进行降维,一个是对数据进行压缩,节约内存和是算法运行更快,另一个是让高维数据可视化。在这一节,我将给大家介绍怎样来实现数据的降维,我们所用到的是Principal Component Analysis(PCA),首先我们来介绍下PCA到底在做一件什么样的事,对于前面一个2D的问题降维到1D,如图7所示,首先我们会选取一条直线,使得每个数据到直线的距离和最小,对于这条直线的方向有两种,我们用u ^ {(1)}和-u ^ {(1)}来表示。对于一个n维的数据降维到K维,同样的,我们选取K个向量u ^ {(1)}, u ^ {(2)}, ..., u ^ {(k)}构成一个新的K维空间,使得原始n维空间的数据到K维空间的距离和最小,然后再把n维的数据映射到K维中。

                                                                               图7 PCA实现2D降维到1D

     在这里,大家可能会有个问题,就是这个降维的过程貌似和线性拟合有些相似,他们是一样的吗?答案:肯定不是的,我们对二者进行对比,如图8所示,线性拟合中有一个特殊的变量就是输出变量y,在这里我们所求的距离和是每一个数据预测输出的结果和实际结果的距离值,在图中可以看出是每一个竖着的距离和,而我们的降维的对象是平等的输入变量x,所求的距离和是数据到直线的垂直距离和,两者不能混为一谈。

                                                                       图8 线性拟合和降维的比较

12.4 PCA算法的实现过程

       在前面只是给大家一个比较直观的感受来了解PCA是怎样实现的,现在将给大家详细地介绍这个算法一步步的数学实现过程。对于训练集x^{(1)},x^{(2)},...,x^{(m)},我们第一步要做的是特征的放缩,在前面预测房子的价格问题时,我们也有给大家讲到这个,在这里就不过多地解释它的含义了,我们直接给出式子:\mu _{j}=\frac{1}{m}\sum_{i=1}^{m}x_{j}^{(i)},而新的x_{j}^{(i)}则是x_{j}^{(i)}-\mu _{j},有时候我们会这样做把\frac{x_{j}^{(i)}-\mu _{j}}{s_{j}}作为新的x_{j}^{(i)},不管怎样做,我们的目的就是让每个数据在同一个范围类。然后就是进行PCA算法了,关于算法本身的原理,这里就不进行讲解了,如果想了解可以自行查找资料,这个原理有点复杂,下面只给大家介绍实现的过程。首先我们计算\Sigma =\frac{1}{m}\sum_{i=1}^{m}(x^{(i)})(x^{(i)})^{\top },这是得到一个协方差矩阵,注意这里最终得到的矩阵是n*n的,还有就是\Sigma注意和求和符号进行区别,这是大写的Sigma,接下来我们用一个函数svd即可得到以上我们想要的各个矩阵,svd用法是这样的:[U,S,V]=svd(Sigma);我们得到的U是n*n的,而如果想得到一个K维的结果,那么我们只需要选择U的前K列即可,而我们所需要的n维中的数据x对应到K维中的z,即z=X*U(:,1:K),注意x是m*n的,而U是n*K的,所以最后z是m*k的,即我们的数据从n维降维到了K维。

12.5 从压缩的数据中重建原始的数据

       在前面,我们一直在讲如何对一个高维的数据降维的问题,那么当我们拿到是降维后的数据,该怎样恢复得到原始的数据了?首先要说明一点,我们想得到和原先一模一样的数据是不可能的,我们得到的只是原始数据的一个接近值。如图9所示,图中左边就是我们在通过PCA从一个2D降维到1D,而右边就是根据降维后的数据重建原始数据,我们得到的只是一个接近值,我们的做法就是:x_{approx}=z*U(:,1:K)',注意z是m*K的,而U是n*K的,所以x_{approx}是m*n的,也正好符合原始数据x的维数。

                                                                               图9 降维和对数据的重建

12.6 选择主成分的数量

      选择主成分的数量说白了就是选择K,我们只知道要降维,那么到底降维到几维就满足我们的要求了,在这里我们给出要求。首先向大家给出平均平方误差:\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-x_{approx}^{(i)} \right \|^2,还有数据总的variation:\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}\right \|^2,我们就是要选择一个K使得\frac{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-x_{approx}^{(i)} \right \|^2}{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}\right \|^2}\leq 0.01,这里不一定是0.01,可以是其他的值,如果是0.01就表示我们会得到99%的方差是保留的。所以可以根据情况看我们需要让其值小于等于多少。那么在算法中该怎样做了?首先我们当然希望K越小越好,所以我们会对K从1开始递增,每次通过计算\frac{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-x_{approx}^{(i)} \right \|^2}{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}\right \|^2},判断其值是否<=0.01,即作为结束条件找到合适的K值。原理是这样的没错,但在前面我们通过[U,S,V]=svd(Sigma),得到了一个S矩阵,这是一个n*n的矩阵,是一个对角矩阵,就是指只有对角线上有值,其他地方的值都为0,则\frac{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-x_{approx}^{(i)} \right \|^2}{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}\right \|^2}可以表示成1-\frac{\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}},注意这里的S_{ii}即是S的对角线上的元素,所以我们的判断条件就变成了1-\frac{\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}}\leq 0.01,即\frac{\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{n}S_{ii}}\geq 0.99,这样我们就可以利用S来计算了。

12.7 关于PCA应用的一些建议

      First we set for a training set of data must be done if the PCA? When the group got a training set, we first do is to conduct training based on the raw data, to see whether our algorithm is fast enough, if the process is ideal no need to use PCA, after all, the process is still very PCA need to calculate amount , but also more time-consuming. If we determine the need for PCA, so what we need is the training data set were PCA, not a cross nor gather data to test data. When there is data compression, we need to think how much variance reserved to select the appropriate K value, when there is the need for data visualization, the choice of K can be 2 or 3. The last question is, when the data PCA, we have an intuitive feeling is that the characteristic data are reduced, if it can be used to solve our PCA said before overfitting problem? The answer is not impossible, but we have not previously used regularization even better, so I do not recommend that you use PCA to solve the problem of over-fitting, but with our previous regularization method.

Guess you like

Origin blog.csdn.net/qq_36417014/article/details/84327749