线性代数和深度学习的关系

在这里插入图片描述
  线性代数、概率论与数理统计、最优化是机器学习的三大数学支柱。他们是学习神经网络的前序课程。

  神经网络可视化:alexlenail.me/NN-SVG。手写数字识别的demo:http:/www.teachyourmachine.com。

  线性代数包含了不同类型的矩阵:对称矩阵、正交矩阵、三角矩阵、Banded 矩阵、置换矩阵、循环矩阵。正定对称矩阵是最值得深入学习的内容。正定矩阵包含正的特征值和正交的特征向量 q q q。它们的线性组合可以将秩为 1 的简单映射 qq^T 与对应特征值重构为正定矩阵 S。They are combinations S = λ 1 q 1 q 1 T + = λ 2 q 2 q 2 T + … S=\lambda_1q_1q_1^T+=\lambda_2q_2q_2^T+\dots S=λ1q1q1T+=λ2q2q2T+ of simple rank-one projections q q T qq^T qqT onto those eigenvectors. 如果 λ 1 ≥ λ 2 ≥ … \lambda _1 \geq \lambda_2 \geq \dots λ1λ2,那么 λ 1 q 1 q 1 T \lambda_1q_1q_1^T λ1q1q1T是S中最重要的部分。For a sample covariance matrix, that part has the greatest variance。

  第一章讲了SVD分解。第二章讲了大矩阵的高效计算方法。

Chapter I In our lifetimes, the most important step has been to extend those ideas from symmetric matrices to all matrices.Now we need two sets of singular vectors(奇异向量), u’s and v’s.Singular values o replace eigenvalues 入. The decomposition A = σ 1 u 1 v 1 T + σ 2 u 2 v 2 T + … \sigma_1u_1v_1^T+\sigma_2u_2v_2^T+\dots σ1u1v1T+σ2u2v2T+remains correct (this is the SVD). With decreasing o’s, those rank-one pieces of A stillcome in order of importance. That "Eckart-Young Theorem"about A complements wha twe have long known about the symmetric matrix A T A A^TA ATA: For rank k,stop at σ k u k v k T \sigma_ku_kv_k^T σkukvkT

The ideas in Chapter I become algorithms in Chapter II. For quite large matrices, the σ ′ s \sigma's σs and u’s and v’s are computable. For very large matrices, we resort to randomization:Sample the columns and the rows. For wide classes of big matrices this works well.

III-IV Chapter II focuses on low rank matrices, and Chapter IV on many important examples. We are looking for properties that make the computations especially fast (in III)or especially useful (in IV). The Fourier matrix is fundamental for every problem with constant coefficients (not changing with position).That discrete transform is superfast because of the FFT : the Fast Fourier Transform.

Chapter V explains, as simply as possible, the statistics we need. The central ideas arealways mean and variance : The average and the spread around that average.Usuallywe can reduce the mean to zero by a simple shift. Reducing the variance (the uncertainty)is the real problem.For random vectors and matrices and tensors, that problem becomesdeeper. It is understood that the linear algebra of statistics is essential to machine learning.

Vl Chapter VI presents two types of optimization problems. First come the nice problems of linear and quadratic programming and game theory. Duality and saddle points are key ideas. But the goals of deep learning and of this book are elsewhere: Very large problems with a structure that is as simple as possible."Derivative equals zero” is still the fundamental equation. The second derivatives that Newton would have used are too numerous and too complicated to compute. Even using all the data (when we take a descent step to reduce the loss) is often impossible. That is why we choose only a minibatch of input data, in each step of stochastic gradient descent.
The success of large scale learning comes from the wonderful fact that randomization often produces reliability——when there are thousands or millions of variables.

VI Chapter Vll begins with the architecture of a neural net. An input layer is connectedto hidden layers and finally to the output layer. For the training data,input vectors vare known.Also the correct outputs are known (often w is the correct classification of v).We optimize the weights a in the learning function F so that F(a, z) is close to wfor almost every training input v.

おすすめ

転載: blog.csdn.net/weixin_47532216/article/details/121557377
おすすめ