Machine Learning threesome (Series X) ---- Machine Learning buck artifact (with code)

Welcome attention to micro-channel public number " smart algorithm " - the original link (for better reading experience):

Machine Learning threesome (Series X) ---- Machine Learning buck artifact (with code)

Series nine with us in terms of actual learning algorithm combination of a little knowledge of aspects of the algorithm combination, details stamp next link:

Machine learning three lines (series nine) ---- combination of ever-changing algorithm (with code)

However, we also know that the algorithm combination will cause an increase in the overall cost of the algorithm time, so today we drop down from the dimension point of view, how to reduce the time cost to the algorithm.

Many features of machine learning problems involving up to thousands or even millions. As we will see, this not only makes training very slow, but also makes finding a good solution more difficult. This problem is often called curse of dimensionality.

Fortunately, in real-world problems usually it may reduce the number of features, which will become intractable problems tractable problem. For example, consider MNIST image (described in the Fourth Series): pixels on the image boundary is almost always white, so you can completely discard these pixels from the training set without losing too much information. Also, two adjacent pixels are generally highly relevant: if they are combined into a pixel (e.g., by averaging two pixel intensity), will not lose too much information.

In addition to accelerated training, dimension reduction for data visualization (or of DataViz) it is also useful. Reducing the number of dimensions to two (or three) such that the high-dimensional training set can be plotted on a graph, and to get some important insights typically detected by such cluster pattern visually.

In this issue, we will discuss the main aspects:

  • Curse of dimensionality
  • The main way to reduce the dimension
  • PCA (Principal Component Analysis)
  • Kernel PCA
  • The LLE (locally linear embedding)

End annexed relevant code keywords, keyword download can reply in public view number.

A. Curse of dimensionality

We are accustomed to three-dimensional life, when we try to imagine a higher-dimensional space, our intuition failed. Even a basic 4D hypercube, in our minds is also hard to imagine, as shown below, not to mention the 1000-dimensional curved 200-dimensional ellipsoid.

It turns out that many things in the high-dimensional space behave very differently. For example, if you select a unit square (1 × 1 square) random points, it is only about 0.4% less than the chance boundary located 0.001 (in other words, any random point along dimension "extreme" It is very unlikely of). But in a 10,000-dimensional unit hypercube (1 × 1 × 1 cubes, there are 10 000 1), this probability is greater than 99.999999%. Most ultra-high-dimensional cube points are very close to the border.

This is more difficult to distinguish: If you two randomly selected points in a unit square, the distance between these two points an average of about 0.52. If the randomly selected points in the two-dimensional unit cube, the average distance will be approximately 0.66. But in a web of 100 randomly selected two points hypercube it? Then the average distance will be about 408.25 (about 1,000,000 / 6)! This is very counterintuitive: When two points are located within the same unit hypercube, how two separate? This fact means that high-dimensional data sets can be very sparse: most of the training examples may be far from each other. Of course, this also means that a new instance may very far away from any training examples, which makes the credibility of the predicted performance than in low-dimensional data to come worse. In short, the more multi-dimensional training set, the greater the risk of over-fitting.

In theory, the curse of dimensionality of a solution might be to increase the size of the training set of training examples in order to achieve a sufficient density. Unfortunately, in practice, to achieve a given density required number of training examples with the number of dimensions of exponential growth. If only 100 features (much less than MNIST problem), then the training examples in order to make the average of less than 0.1, require more training examples may be observed using an atomic universe, assuming they are evenly distributed in all dimensions.

II. The main way to reduce the dimension

Before delving into specific dimensionality reduction algorithm research, we look at two main ways to reduce the dimension: projection and manifold learning.

2.1 Projection

In most real-world problems, training examples are not evenly distributed across all dimensions. Many of the features are almost constant, and the other features are highly correlated (as previously discussed MNIST). As a result, all the training instances is actually located in (or near) the low-dimensional subspace high-dimensional space. This sounds very abstract, so let's look at an example. In the following figure, we can see the three-dimensional data set is represented by a circle.

Please note that all training instances are close to a plane: This is a low-dimensional high-dimensional (3D) space (2D) subspace. Now, if we each training instance the vertically projected subspace (Example connector to the plane as indicated by the short-term), we obtain the following new 2D shown in FIG dataset. Dangdang! We just dimensional data set is reduced from 3D to 2D. Note that the features corresponding to the new axis z1 and z2 (coordinates on the projection plane).

However, the projection is not always the best way to dimensional reduction. In many cases, the subspace may twist and turn, as shown below in Swiss rolling toy known data set.

Simply projected onto a plane (e.g., characterized by discarding x3) will be different layers of the swiss roll pressed together, the left side as shown in FIG. However, you really need is to expand Swiss roll, 2D data set in order to obtain the right of the figure below.

2.2 Manifold Learning

Swiss roll is an example of a two-dimensional manifold. Briefly, the two-dimensional manifold is a two-dimensional shape, may be bent and twisted at a higher dimensional space. More generally, d dimensional manifold is a fragmentary portion similar to the d-dimensional hyperplane n-dimensional space (where d <n) of the. Swiss volumes, d = 2 and n = 3: It is similar in the local 2D plane, but rolls on the third dimension.

Many dimensionality reduction algorithm to model the work of training examples where the manifold; this is called manifold learning. It relies on the assumption manifold, also known as manifolds assume that it considers most real-world high-dimensional data sets a much lower near low-dimensional manifolds. This assumption is usually empirically observed.

Consider again MNIST datasets: All handwritten digital images have some similarities. They consist of wire composition, the border is white, more or less centered, and so on. If you randomly generated image, only a small part looks like a handwritten numbers. In other words, if you try to create digital images, available degrees of freedom is much lower than you can generate a degree of freedom in any desired image. These constraints tend to compress the data set to a lower dimension.

Another assumption is often accompanied manifold implicit assumption: if the manifold is shown in the lower-dimensional space, the task at hand (e.g., classification or regression) will be simpler. For example, the first row of the figure below, Swiss roll is divided into two categories: the three-dimensional space (on the left), the decision boundary is quite complex, but in the expanded stream 2D-shaped space (on the right), the decision boundaries are a simple line .

但是,这个假设并不总是成立。 例如,在上图右侧,判定边界位于x1 = 5。这个判定边界在原始三维空间(一个垂直平面)看起来非常简单,但是在展开的流形中它看起来更复杂 四个独立的线段的集合)。

简而言之,如果在训练模型之前降低训练集的维数,那么肯定会加快训练速度,但并不总是会导致更好或更简单的解决方案。 这一切都取决于数据集。

到这里我们基本能够很好地理解维度灾难是什么,以及维度减少算法如何与之抗衡,特别是当多种假设成立的时候。 那么接下来我们将一起学习一下常见的降维算法。

三. PCA(主成分分析

主成分分析(PCA)是目前最流行的降维算法。主要是通过识别与数据最接近的超平面,然后将数据投影到其上。

3.1 保持差异

在将训练集投影到较低维超平面之前,您首先需要选择正确的超平面。 例如,简单的2D数据集连同三个不同的轴(即一维超平面)一起在下图的左侧表示。 右边是将数据集投影到这些轴上的结果。 正如你所看到的,在c2方向上的投影保留了最大的方差,而在c1上的投影保留了非常小的方差。

选择保留最大变化量的轴似乎是合理的,因为它最有可能损失比其他投影更少的信息。 证明这一选择的另一种方法是,使原始数据集与其在该轴上的投影之间的均方距离最小化的轴。 这是PCA背后的一个相当简单的想法。

3.2 PCA中的PC

主成分分析(PCA)识别训练集中变化量最大的轴。 在上图中,它是实线。 它还发现第二个轴,与第一个轴正交,占了剩余方差的最大量。 如果它是一个更高维的数据集,PCA也可以找到与前两个轴正交的第三个轴,以及与数据集中维数相同的第四个,第五个等。

定义第i个轴的单位矢量称为第i个主成分(PC)。 在上图中,第一个PC是c1,第二个PC是c2。 在2.1节的图中,前两个PC用平面中的正交箭头表示,第三个PC与平面正交(指向上或下)。

主成分的方向不稳定:如果稍微扰动训练集并再次运行PCA,一些新的PC可能指向与原始PC相反的方向。 但是,他们通常仍然位于同一轴线上。 在某些情况下,一部分的PC甚至可能旋转或交换,但他们确定的平面通常保持不变。

那么怎样才能找到训练集的主要组成部分呢? 幸运的是,有一种称为奇异值分解(SVD)的标准矩阵分解方法,可以将训练集矩阵X分解成三个矩阵U·Σ·VT的点积,其中VT包含我们正在寻找的所有主成分, 如下公式所示。

下面的Python代码使用NumPy的svd()函数来获取训练集的所有主成分,然后提取前两个PC:

3.3 投影到d维度

一旦确定了所有主要组成部分,就可以将数据集的维数降至d维,方法是将其投影到由第一个主要组件定义的超平面上。 选择这个超平面确保投影将保留尽可能多的方差。 例如,在2.1节中的数据集中,3D数据集向下投影到由前两个主成分定义的2D平面,从而保留了大部分数据集的方差。 因此,二维投影看起来非常像原始的三维数据集。

为了将训练集投影到超平面上,可以简单地通过矩阵Wd计算训练集矩阵X的点积,该矩阵定义为包含前d个主分量的矩阵(即,由VT的前d列组成的矩阵 ),如下公式所示。

以下Python代码将训练集投影到由前两个主要组件定义的平面上:

现在我们已经知道如何将任何数据集的维度降低到任意维数,同时尽可能保留最多的差异。

3.4 使用Scikit-Learn

Scikit-Learn的PCA类使用SVD分解来实现PCA,就像我们之前做的那样。 以下代码应用PCA将数据集的维度降至两维:

在将PCA变换器拟合到数据集之后,可以使用components_变量访问主成分(注意,它包含水平向量的PC,例如,第一个主成分等于pca.components_.T [:,0])。

3.5 解释方差比率

另一个非常有用的信息是解释每个主成分的方差比率,可通过explained_variance_ratio_变量得到。 它指明了位于每个主成分轴上的数据集方差的比例。 例如,让我们看看图8-2中表示的3D数据集的前两个分量的解释方差比率:

它告诉我们,84.2%数据集的方差位于第一轴,14.6%位于第二轴。 第三轴的这一比例不到1.2%,所以可以认为它可能没有什么信息。

3.6 选择正确的维度数量

不是任意选择要减少的维度的数量,通常优选选择加起来到方差的足够大部分(例如95%)的维度的数量。 当然,除非我们正在降低数据可视化的维度(在这种情况下,您通常会将维度降低到2或3)。

下面的代码在不降低维数的情况下计算PCA,然后计算保留训练集方差的95%所需的最小维数:

然后可以设置n_components = d并再次运行PCA。 但是,还有一个更好的选择:不要指定要保留的主要组件的数量,您可以将n_components设置为0.0到1.0之间的浮点数,表示您希望保留的方差比率:

3.7 PCA压缩

降维后显然,训练集占用的空间少得多。例如,尝试将PCA应用于MNIST数据集,同时保留其95%的方差。你会发现每个实例只有150多个特征,而不是原来的784个特征。所以,虽然大部分方差都被保留下来,但数据集现在还不到原始大小的20%!这是一个合理的压缩比,你可以看到这是如何加速分类算法(如SVM分类器)。

也可以通过应用PCA投影的逆变换将缩小的数据集解压缩回到784维。 当然这不会给你原来的数据,因为投影丢失了一点信息(在5%的差异内),但它可能会非常接近原始数据。 原始数据与重构数据(压缩然后解压缩)之间的均方距离称为重建误差。 例如,以下代码将MNIST数据集(公众号回复“mnist”)压缩到154维,然后使用inverse_transform()方法将其解压缩到784维。 下图显示了原始训练集(左侧)的几个数字,以及压缩和解压缩后的相应数字。 你可以看到有一个轻微的图像质量损失,但数字仍然大部分完好无损。

如下等式,显示了逆变换的等式。

3.8 增量PCA

先于PCA实现的一个问题是,为了使SVD算法运行,需要整个训练集合在内存中。 幸运的是,已经开发了增量式PCA(IPCA)算法:您可以将训练集分成小批量,并一次只提供一个小批量IPCA算法。 这对于大型训练集是有用的,并且也可以在线应用PCA(即在新实例到达时即时运行)。

下面的代码将MNIST数据集分成100个小批量(使用NumPy的array_split()函数),并将它们提供给Scikit-Learn的IncrementalPCA class5,以将MNIST数据集的维度降低到154维(就像以前一样)。 请注意,您必须调用partial_fit()方法,而不是使用整个训练集的fit()方法。

或者,您可以使用NumPy的memmap类,它允许您操作存储在磁盘上的二进制文件中的大数组,就好像它完全在内存中; 该类仅在需要时加载内存中所需的数据。 由于IncrementalPCA类在任何给定时间只使用数组的一小部分,因此内存使用情况仍然受到控制。 这可以调用通常的fit()方法,如下面的代码所示:

3.9 随机PCA

Scikit-Learn提供了另一种执行PCA的选项,称为随机PCA。 这是一个随机算法,可以快速找到前d个主成分的近似值,它比以前的算法快得多。

四. Kernel PCA

在前面的系列中,我们讨论了内核技巧,一种将实例隐式映射到非常高维的空间(称为特征空间)的数学技术,支持向量机的非线性分类和回归。 回想一下,高维特征空间中的线性决策边界对应于原始空间中的复杂非线性决策边界。

事实证明,同样的技巧可以应用于PCA,使得有可能执行复杂的非线性投影降维。 这被称为内核PCA(KPCA)。 在投影之后保留实例簇通常是好的,有时甚至可以展开靠近扭曲流形的数据集。

例如,以下代码使用Scikit-Learn的KernelPCA类来执行带RBF内核的KPCA(有关RBF内核和其他内核的更多详细信息,可以参考前面的系列文章):

下图显示了使用线性内核(等同于简单使用PCA类),RBF内核和S形内核(Logistic)减少到二维的瑞士卷。

五. LLE(局部线性嵌入)

局部线性嵌入(LLE)是另一种非常强大的非线性降维(NLDR)技术。 这是一个流形学习技术,不依赖于像以前的投影算法。简而言之,LLE通过首先测量每个训练实例如何与其最近的邻居(c.n.)线性相关,然后寻找保持这些本地关系最好的训练集的低维表示。 这使得它特别擅长展开扭曲的流形,尤其是在没有太多噪声的情况下。

例如,下面的代码使用Scikit-Learn的LocallyLinearEmbedding类来展开瑞士卷。 得到的二维数据集如下图所示。 正如你所看到的,瑞士卷是完全展开的,实例之间的距离在本地保存得很好。 但是,展开的瑞士卷的左侧被挤压,而右侧的部分被拉长。 尽管如此,LLE在对多样性进行建模方面做得相当不错。

六. 本期小结

本期我们从维度灾难入手,一起学习了降维的投影和流行分析的主要两种途径,接下来学习了主成分分析,核PCA以及LLE的相关知识,希望从本节我们更详细的了解到有关降维的相知识,以及将其用到工作项目中。

 

(如需更好的了解相关知识,欢迎加入智能算法社区,在“智能算法”公众号发送“社区”,即可加入算法微信群和QQ群)

本期代码关键字:dim_redu

Guess you like

Origin blog.csdn.net/x454045816/article/details/92113989