【机器学习】基础之线性代数(超详细总结)

此博文markdown原文链接,猛戳这里!!

标量、向量、矩阵和张量

  • 标量 ( s c a l a r ) \rm (scalar) (scalar):一个单独的数,一般用斜体表示标量,通常是小写字母。定义标量时,通常会指出其所属类型。比如定义自然数标量时,会这样定义" 令 n ∈ N 令 n \in \mathbb{N} nN表示元素数目"。
  • 向量 ( v e c t o r ) \rm (vector) (vector):一列数,并且有序排列。通过索引可以定位到每个单独的数。通常用小写的斜粗体字母表示。通常也会指出其所属类型。若每个元素都属于 R \mathbb{R} R , and this vector hasnnn elements, then the vector input real number setR \mathbb{R}R 'snnA set composed of n Cartesian products, denoted asR n \mathbb{R}^nRn,即: x ∈ R n \boldsymbol x \in \mathbb{R}^n xRn

x = [ x 1 x 2 ⋮ x n ] \boldsymbol x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} x=x1x2xn

​ You can think of a vector as a point in space, and each element is a coordinate on a different coordinate axis.

  • matrix ( matrix ) \rm(matrix)( m a t r i x ) : A two-dimensional array, each element is uniquely identified by two indices. Usually use italic bold uppercase to represent the matrix, such asA \boldsymbol AA. _ If the height of the matrix is​​mmm with widthnn . Then sayA ∈ R m × n \boldsymbol A \in \mathbb{R}^{m \times n}ARm×n A m , n A_{m,n} Am,nto represent the mm in the matrixm rownnThe elements of n columns. meansiiA i , : A_{i, :}for i -row elementsAi,:, means jjUse A for j column elements: , j A_{:,j}A:,j

A = [ A 1 , 1 A 1 , 2 A 2 , 1 A 2 , 2 ] \boldsymbol A = \begin{bmatrix} A_{1, 1} & A_{1, 2} \\ A_{2, 1} & A_{2, 2} \end{bmatrix} A=[A1,1A2,1A1,2A2,2]

​ Sometimes it is necessary to operate the matrix element by element, you can use f ( A ) i , jf(\boldsymbol A)_{i,j}f(A)i,jmeans fff acts on matrix elementsA i , j A_{i,j}Ai,jsuperior.

  • Tensor ( tensor ) \rm (tensor)( t e n s o r ) : Arrays with more than two dimensions. UseA \mathsf{A}A represents a tensor,A i , j , k \mathsf{A}_{i,j,k}Ai,j,kto represent its elements.
import numpy as np
# 标量 
s = 5
print(s)
# 向量
v = np.array([1, 2])
print(v)
# 矩阵
m = np.array([[1, 3], [2, 4]])
print(m)
# 张量
t = np.array([
    [[1,2,3], [2,3,4]],
    [[3,4,5], [4,5,6]],
    [[5,6,7], [6,7,8]]
])
print(t)

Matrix transpose

Matrix transposition is to flip the original matrix along the main diagonal, matrix A \boldsymbol AThe transpose of A is expressed as A = AT \boldsymbol A = \boldsymbol A^\mathrm{T}A=AT , defined as follows
( AT ) i , j = A j , i (\boldsymbol A^{\mathrm {T}})_{i, j} =\boldsymbol A_{j,i}(AT)i,j=Aj,i
A vector can be viewed as a matrix with only one column, and the transpose of a vector can be viewed as a matrix with only one row. A scalar can be thought of as a matrix with one row and one column whose transpose is equal to itself. a T = aa^{\mathrm{T}} = aaT=a.

A = np.array([[1, 2, 3],[1, 0, 2]])
A_t = A.transpose()

Matrix addition

Addition corresponds to element-wise addition, requiring that the two matrices must have the same shape.
C = A + B , C i , j = A i , j + B i , j \boldsymbol C = \boldsymbol A +\boldsymbol B, C_{i,j} = A_{i,j} + B_{i, j}C=A+B,Ci,j=Ai,j+Bi,j
标量和矩阵相乘,或者和矩阵相加时,只需要将其与矩阵的每个元素相乘或者相加。
D = a ⋅ B + c , 其 中 D i , j = a ⋅ B i , j + c \boldsymbol D = a \cdot \boldsymbol B + c, 其中 D_{i,j} = a \cdot B_{i,j} +c D=aB+c,Di,j=aBi,j+c

A = np.array([
    [1, 2, 3], [4, 5, 6]
])

B = np.array([
    [1, 2, 3], [4, 5, 6]
])

C = A + B
print(C)
# [[ 2  4  6]
#  [ 8 10 12]]

D = 2 * B + 1
print(D)
# [[ 3  5  7]
#  [ 9 11 13]]

矩阵和向量相乘

向量内积:两个向量的对应元素相乘并相加。
a ⋅ b = ∣ a ∣ ∣ b ∣ cos ⁡ θ , θ 是 两 个 向 量 的 夹 角 \boldsymbol a \cdot \boldsymbol b = \left| \boldsymbol a \right| |\boldsymbol b| \cos \theta, \theta是两个向量的夹角 ab=abcosθ,θ
从上述表达式中我们可以看出,向量内积的物理含义就是向量 a \boldsymbol a a b \boldsymbol b The projection length in the b direction, ifb \boldsymbol bThe modulus of b is1 11,则 a ⋅ b = ∣ a ∣ cos ⁡ θ \boldsymbol a \cdot \boldsymbol b = \left| \boldsymbol a \right|\cos \theta ab=acosθ , this conclusion can also be applied to matrix multiplication, which we will use in the following matrix linear transformation.

Matrix multiplication : matrix A m × p \boldsymbol A_{m \times p}Am×pAnd matrix B p × n \boldsymbol B_{p \times n}Bp×nThe product of can be expressed as
C m × n = A m × p B p × n \boldsymbol C_{m \times n} = \boldsymbol A_{m \times p} \boldsymbol B_{p \times n}Cm×n=Am×pBp×n
Among them, the matrix A \boldsymbol AA的列数和矩阵 B \boldsymbol B B的行数必须相同,得到的矩阵 C \boldsymbol C C m m m n n n列。具体的乘法操作如下
C i , j = ∑ k A i , k B k , j C_{i,j} = \sum_{k} A_{i, k}B_{k, j} Ci,j=kAi,kBk,j
该矩阵乘积不是矩阵对应元素相乘,但是这种操作也确实存在,被称为元素对应乘积或者叫Hadamard乘积,记为 A ⨀ B \boldsymbol A \bigodot \boldsymbol B AB

矩阵相乘服从分配律和结合律,但并不满足交换律。两个向量的点积满足交换律:
x T y = y T x \boldsymbol x^{\rm{T}} \boldsymbol y = \boldsymbol y^{\rm{T}} \boldsymbol x xTy=yT x
wherex \bf \it xx isnnn row1 11 column, after the transpose is1 11 rownnn columns, multiplied byy \bf \it yy isnnn row1 11 column, the final result is1 11 line1 11 column, the same result as the right.
The transpose of the matrix product is
( AB ) T = BTAT (\boldsymbol {AB})^{\rm{T}} = \boldsymbol B^{\rm{T}} \boldsymbol A^{\rm{T}}(AB)T=BT AT
两个向量的点积是标量,标量转置的结果是自身,可以通过上述结论得到
x T y = ( x T y ) T = y T x \boldsymbol x^{\rm{T}} \boldsymbol y = (\boldsymbol x^{\rm{T}} \boldsymbol y)^{\rm T} = \boldsymbol y^{\rm{T}} \boldsymbol x xTy=(xTy)T=yTx
了解矩阵乘积之后,我们就可以用矩阵乘积的方式来表示线性方程组 A x = b \boldsymbol {Ax} = \boldsymbol b Ax=b. 其中 A ∈ R m × n \boldsymbol A \in \mathbb{R}^{m \times n} ARm×n是一个已知矩阵 m m m n n n列, x ∈ R n \boldsymbol x \in \mathbb{R}^{n} xRn是一个求解 n n n rows of unknown vectors,b ∈ R m \boldsymbol b \in \mathbb{R}^mbRm is a known vector. This provides a more compact representation.

A = np.array([
    [1, 2, 3], [4, 5, 6]
])

A1 = np.array([
    [1, 2, 3], [2, 3, 4]
])

B = np.array([
    [1, 2, 3], [3, 4, 5], [5, 6, 7]
])

C = A.dot(B)  # 点乘 矩阵乘积
C = np.dot(A, B)
# C = np.dot(B, A)  # 不满足交换律

print("矩阵(点)乘积: \n", C)
# [[22 28 34]
#  [49 64 79]]

print("矩阵逐元素乘积: \n", np.multiply(A, A1))
print("矩阵逐元素乘积: \n", A * A1)
# [[1  4  9]
#  [8 15 24]]

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v = v1.dot(v2)  # 向量内积  结果为标量
v = np.dot(v1, v2)
v = np.dot(v2, v1)  # 满足交换律
print("向量内积: ", v)
# 32

Inverse and Identity Matrix

Identity matrix ( identity matrix ) (\rm identity \ matrix)( i d e n t i t y m a t r i x )  , all elements along the main diagonal are 1, and all other elements are 0. Any matrix and identity matrix product is unchanged. The identity matrix is ​​written asI n ∈ R n × n \boldsymbol I_{n} \in \mathbb{R}^{n \times n}InRn × n .
MatrixA \boldsymbol AThe matrix inverse of A is written asA − 1 \boldsymbol A^{-1}A1,满足
A − 1 A = I n \boldsymbol A^{-1} \boldsymbol A = \boldsymbol I_{n} A1A=In
因此, A x = b \boldsymbol {Ax} = \boldsymbol b Ax=b中可求得 x = A − 1 b \boldsymbol x = \boldsymbol A^{-1} \boldsymbol b x=A1b.

A = [[1.0,2.0],[3.0,4.0]]
A_inv = np.linalg.inv(A)
print(A_inv)
# [[-2.   1.]
#  [1.5 - 0.5]]

线性相关和生成子空间

线性相关:若 a = [ a 1 , a 2 , ⋯   , a n ] \boldsymbol a = [\boldsymbol {a_1, a_2, \cdots, a_n}] a=[a1,a2,,an]中每个向量都能由其他向量线性表示,则 a \boldsymbol a a是线性相关的。也就是说,存在不全为 0 0 0的数 k 1 , k 2 , ⋯   , k n k_1, k_2, \cdots, k_n k1,k2,,kn,使得
k 1 a 1 + k 2 a 2 + ⋯ + k n a n = 0 k_1 \boldsymbol a_1 + k_2 \boldsymbol a_2 + \cdots + k_n \boldsymbol a_n = 0 k1a1+k2a2++knan=0
holds, thena \boldsymbol aThe vectors in a are linearly related. Otherwise, linearly independent. If they are orthogonal to each other, then they are also linearly independent.

If the inverse matrix A − 1 \boldsymbol A^{-1}A1 exists, then for each vectorb \boldsymbol bb , there must be a solution. But for the system of equations, for the vectorb \boldsymbol bSome values ​​of b may have no solution, or infinitely many solutions. It is impossible for there to be more than one solution but less than an infinite number of solutions. Because if x and y are both equation solutions, then
z = α x + ( 1 − α ) y \boldsymbol z = \alpha\boldsymbol x + (1- \alpha)\boldsymbol yz=αx+(1α ) y
is also a solution of this equation.

The matrix must be a square matrix, and all column vectors are linearly independent . A square matrix whose column vectors are linearly dependent is said to be singular. If matrix A is not a square matrix or is a singular square matrix, the square matrix may still have a solution, but it cannot be solved by matrix inversion.

Norm

Norm ( norm ) \rm (norm)( n o r m ) is used to measure the vector size. FormallyLPL^PLThe P norm is defined as follows:
∥ x ∥ p = ( ∑ i ∣ xi ∣ p ) 1 p \left\| \boldsymbol x \right\| _p = \left(\sum_i |x_i|^p\right)^{\ frac{1}{p}}xp=(ixip)p1
where p ∈ R , p ≥ 1 p \in \mathbb{R},p \ge 1pR,p1

A norm is a function that maps a vector to nonnegative values. Intuitively, a vector x \boldsymbol xThe norm of x measures the source point to point xxx -distance.

L 2 L^2L2 norm: also known as Euclidean norm,∣ ∣ x ∣ ∣ 2 ||x||_2x2is the Euclidean distance. Sometimes L 2 L^2 is also usedL2- norm squared to measure vectors:x T x \boldsymbol x^T \boldsymbol xxT x. SquareL 2 L^2LThe 2 norm is more convenient in calculation. For example it is forx \boldsymbol xEach component of the gradient of x depends only on the corresponding components of x, andL 2 L^2L2 norm is forx \boldsymbol xEach component of the gradient of x depends on the entirexxx向量。
∥ x ∥ 2 = ∑ i x i 2 \left \| \boldsymbol x\right \|_2 = \sqrt{\sum_i x_i^2} x2=ixi2
L 1 L^1L1 norm:L 2 L^2LThe 2 norm is not necessarily applicable to all situations. It grows very slowly near the origin, so it is not suitable for the need to distinguish0 00 and very small but non-zero 00 value case. L 1 L^1LThe 1 norm is a better choice, andits growth rate is the same in all directions. Defined as follows
∥ x ∥ 1 = ∑ i ∣ xi ∣ \left\|x\right\|_1 = \sum_i \left| x_i \right|x1=ixi
L ∞ L^{\infty}L norm: In the mathematical sense, it is the maximum value of the absolute value of the vector elements, also known asmax norm \rm max \ normmax norm
∥ x ∥ ∞ = max ⁡ i ∣ x i ∣ \left \| \boldsymbol x\right \|_{\infty} = \max_{i} \left| x_i \right | x=imaxxi∣Measurement
matrix used in machine learning isFFF范数 ( F r o b e n i u s   n o r m ) (\rm Frobenius \ norm) (Frobenius norm),定义如下:
∥ A ∥ F = ∑ i , j A i , j 2 \left \| \boldsymbol A \right \| _{F} = \sqrt{\sum_{i,j} A_{i,j}^2} AF=i,jAi,j2
其类似于 L 2 L^2 L2范数,两个向量的点积也可以用范数来表示,
x T y = ∥ x ∥ 2 ∥ y ∥ 2 cos ⁡ θ \boldsymbol x^{\rm T} \boldsymbol y = \left \| \boldsymbol x \right \|_2 \left \| \boldsymbol y \right \|_2 \cos\theta xTy=x2y2cosθ

v = np.array([1, 2, 3, 4])
print("向量1范数: ", np.linalg.norm(v, ord=1))
print("向量2范数: ", np.linalg.norm(v, ord=2))
print("向量无穷范数: ", np.linalg.norm(v, ord=np.inf))
m = np.array([
    [1, 2],
    [3, 4]
])
print("矩阵F范数: ", np.linalg.norm(m, ord="fro"))

# 向量1范数:  10.0
# 向量2范数:  5.477225575051661
# 向量无穷范数:  4.0
# 矩阵F范数:  5.477225575051661

线性空间

  • 定义:设 V V V is a non-empty set,R \mathbf{R}R is a real number field, if for any two elementsα , β ∈ V \boldsymbol{\alpha}, \boldsymbol{\beta} \in Va ,bV , there is always a unique elementγ ∈ \gamma \inc VVV corresponds to it, calledα \boldsymbol{\alpha}α andβ \boldsymbol{\beta}The sum of β , recorded asγ = α + β \boldsymbol{\gamma}=\boldsymbol{\alpha}+\boldsymbol{\beta}c=a+β ; for any numberλ ∈ R \lambda \in \mathbf{R}lR and any elementα ∈ V \boldsymbol{\alpha} \in VaV , there is always a unique elementδ ∈ V \boldsymbol{\delta} \in VdV corresponds to it, calledλ \lambdaλ andα \boldsymbol{\alpha}α 的积,记为 δ = λ α \boldsymbol{\delta}=\lambda \boldsymbol{\alpha} δ=λα. 这两种运算满足下面各条运算规律:
  1. 交换律 α + β = β + α \quad \boldsymbol{\alpha}+\boldsymbol{\beta}=\boldsymbol{\beta}+\boldsymbol{\alpha} α+β=β+α;

  2. 结合律 ( α + β ) + r = α + ( β + γ ) \quad(\boldsymbol{\alpha}+\boldsymbol{\beta})+\boldsymbol{r}=\boldsymbol{\alpha}+(\boldsymbol{\beta}+\boldsymbol{\gamma}) (α+β)+r=α+(β+γ);

  3. 有零元 0 ∈ V 0 \in V 0V. 使任意 α ∈ V \boldsymbol{\alpha} \in V αV 0 + α = α \mathbf{0}+\boldsymbol{\alpha}=\boldsymbol{\alpha} 0+α=α;

  4. There are negative elements for any α ∈ V \boldsymbol{\alpha} \in VaV , all haveα \boldsymbol{\alpha}The negative element of α β ∈ V \boldsymbol{\beta} \in VbV, 使 α + β = 0 \boldsymbol{\alpha}+\boldsymbol{\beta}=\mathbf{0} a+b=0;

  5. With identity 1 ⋅ α = α 1 \cdot \boldsymbol{\alpha}=\boldsymbol{\alpha}1a=a ?

  6. Derivative λ ( µ α ) = ( λ µ ) α \quad \lambda(\mu\boldsymbol{\alpha})=(\lambda\mu)\boldsymbol{\alpha}λ ( m a )=( l m ) a ;

  7. Random range ( λ + µ ) α = λ α + µ α ; \quad(\lambda+\mu) \boldsymbol{\alpha}=\lambda \boldsymbol{\alpha}+\mu \boldsymbol{\alpha} ;(λ+μ)α=λα+μα;

  8. 对元素的分配律 λ ( α + β ) = λ α + λ β \quad \lambda(\boldsymbol{\alpha}+\boldsymbol{\beta})=\lambda \boldsymbol{\alpha}+\lambda \boldsymbol{\beta} λ(α+β)=λα+λβ.

    则称 V V V 为线性空间(或向量空间), V V V 中的元素统称为向量.

  • 线性空间的基:若线性空间 V V V中存在 n n n个元素 [ α 1 , α 2 , ⋯   , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}] [α1,α2,,αn] , they arelinearly independentandVVAny element in V α \boldsymbol \alphaα可由[ α 1 , α 2 , ⋯ , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}][ a1,a2,,an] linear representation, it is called[ α 1 , α 2 , ⋯ , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n }][ a1,a2,,an] is the linear spaceVVA basis of V , nnn represents the dimension of the space.

  • Coordinates : Suppose [ α 1 , α 2 , ⋯ , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}][ a1,a2,,an] is the linear spaceVVA basis of V , for VVAny element in V α \boldsymbol \alphaα , you can use a set of ordered numbers( x 1 , x 2 , x 3 , ⋯ , xn ) (x_1, x_2, x_3, \cdots, x_n)(x1,x2,x3,,xn),使得
    α = x 1 α 1 + x 2 α 2 + ⋯ + x n α n \boldsymbol \alpha = x_1 \boldsymbol \alpha_1 + x_2 \boldsymbol \alpha_2 + \cdots + x_n \boldsymbol \alpha_n a=x1a1+x2a2++xnan
    ( x 1 , x 2 , x 3 , ⋯   , x n ) T (x_1, x_2, x_3, \cdots, x_n)^{\rm T} (x1,x2,x3,,xn)T becomes the elementα \boldsymbol \alphaα在基[ α 1 , α 2 , ⋯ , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}][ a1,a2,,an] under the coordinates. Write
    α = ( x 1 , x 2 , ⋯ , xn ) T \boldsymbol{\alpha}=\left(x_{1}, x_{2}, \cdots, x_{n}\right)^{\mathrm {T}}a=(x1,x2,,xn)T

  • Basis transformation : Let [ α 1 , α 2 , ⋯ , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}][ a1,a2,,an] [ β 1 , β 2 , ⋯   , β n ] [\boldsymbol \beta_{1}, \boldsymbol{\beta}_{2}, \cdots, \boldsymbol{\beta}_{n}] [ b1,b2,,bn] are all bases in linear spaces, and
    { β 1 = p 11 α 1 + p 21 α 2 + ⋯ + pn 1 α n β 2 = p 12 α 1 + p 22 α 2 + ⋯ + pn 2 α n ⋮ β n = p 1 n α 1 + p 2 n α 2 + ⋯ + pnn α n \left\{\begin{array}{l} \boldsymbol{\beta}_{1}=p_{11} \boldsymbol{ \alpha}_{1}+p_{21} \boldsymbol{\alpha}_{2}+\cdots+p_{n1} \boldsymbol{\alpha}_{n} \\ \boldsymbol{\beta}_{ 2}=p_{12} \boldsymbol{\alpha}_{1}+p_{22} \boldsymbol{\alpha}_{2}+\cdots+p_{n2} \boldsymbol{\alpha}_{n} \\ \vdots \\ \boldsymbol{\beta}_{n}=p_{1n} \boldsymbol{\alpha}_{1}+p_{2n} \boldsymbol{\alpha}_{2}+\cdots+ p_{nn} \boldsymbol{\alpha}_{n} \end{array}\right.b1=p11a1+p21a2++pn 1anb2=p12a1+p22a2++pn 2anbn=p1na1+p2 na2++pnnan
    等价于
    [ β 1 β 2 ⋮ β n ] = P T [ α 1 α 2 ⋮ α n ] \begin{bmatrix} \boldsymbol \beta_1 \\ \boldsymbol \beta_2 \\ \vdots \\ \boldsymbol\beta_n \end{bmatrix} = \boldsymbol P^{\rm T} \begin{bmatrix} \boldsymbol \alpha_1 \\ \boldsymbol\alpha_2 \\ \vdots \\ \boldsymbol \alpha_n \end{bmatrix} b1b2bn=PTa1a2an
    或者
    ( β 1 , β 2 , ⋯   , β n ) = ( α 1 , α 2 , ⋯   , α n ) P \left ( \boldsymbol \beta_1, \boldsymbol \beta_2, \cdots, \boldsymbol\beta_n\right) = \left ( \boldsymbol \alpha_1, \boldsymbol \alpha_2, \cdots, \boldsymbol\alpha_n\right) \boldsymbol P ( b1,b2,,bn)=( a1,a2,,an)The above formula of P
    is called the base coordinate transformation formula,P \boldsymbol PP is calledtransition matrix, because( β 1 , β 2 , ⋯ , β n ) \left( \boldsymbol \beta_1, \boldsymbol \beta_2, \cdots, \boldsymbol\beta_n\right)( b1,b2,,bn) is linearly independent, so the matrixP \boldsymbol PP is reversible.

  • Coordinate transformation : assume linear space VVVector a \boldsymbol ain Va在基 [ α 1 , α 2 , ⋯   , α n ] [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}] [α1,α2,,αn]下的坐标为 ( x 1 , x 2 , x 3 , ⋯   , x n ) T (x_1, x_2, x_3, \cdots, x_n)^{\rm T} (x1,x2,x3,,xn)T,在基 [ β 1 , β 2 , ⋯   , β n ] [\boldsymbol \beta_{1}, \boldsymbol{\beta}_{2}, \cdots, \boldsymbol{\beta}_{n}] [β1,β2,,βn]下的坐标为 ( x 1 ′ , x 2 ′ , x 3 ′ , ⋯   , x n ′ ) T (x_1', x_2', x_3', \cdots, x_n')^{\rm T} (x1,x2,x3,,xn)T,且两个基满足上述基变换关系,则有
    ( x 1 x 2 ⋮ x n ) = P ( x 1 ′ x 2 ′ ⋮ x n ′ ) ,  或  ( x 1 ′ x 2 ′ ⋮ x n ′ ) = P − 1 ( x 1 x 2 ⋮ x n ) \left(\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{array}\right)=\boldsymbol{P}\left(\begin{array}{c} x_{1}^{\prime} \\ x_{2}^{\prime} \\ \vdots \\ x_{n}^{\prime} \end{array}\right), \quad \text { 或 } \quad\left(\begin{array}{c} x_{1}^{\prime} \\ x_{2}^{\prime} \\ \vdots \\ x_{n}^{\prime} \end{array}\right)=\boldsymbol{P}^{-1}\left(\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{array}\right) x1x2xn=Px1x2xn,  x1x2xn=P1x1x2xn
    证明:令 x = ( x 1 , x 2 , ⋯   , x n ) T , x ′ = ( x 1 ′ , x 2 ′ , ⋯ x n ′ ) T , A = [ α 1 , α 2 , ⋯   , α n ] , B = [ β 1 , β 2 , ⋯   , β n ] \boldsymbol x = (x_1, x_2, \cdots, x_n)^{\rm T}, \boldsymbol x' = (x_1', x_2',\cdots x_n')^{\rm T},\Alpha = [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}],\Beta = [\boldsymbol \beta_{1}, \boldsymbol{\beta}_{2}, \cdots, \boldsymbol{\beta}_{n}] x=(x1,x2,,xn)T,x=(x1,x2,xn)T,A=[α1,α2,,αn],B=[β1,β2,,βn]. ( A 是 大 写 的 α 并 不 是 英 文 字 母 A ) (\mathrm{\Alpha}是大写的\alpha并不是英文字母A) ( A is capital α not English letter A ) _


    A x = α = B x ′ = A P x ′ x = P x ′ \Alpha \boldsymbol x = \boldsymbol \alpha = \Beta \boldsymbol x' = \Alpha \boldsymbol P x' \\ \boldsymbol x = \boldsymbol P \boldsymbol x' Ax=a=Bx=APxx=Px'
    The above formula is established.

  • Linear mapping (transformation) : Let V n , U m V_n, U_mVn,Umare two linear spaces respectively, TTT is aV n from V_nVn U m U_m Ummapping, if the mapping TTT satisfies:

    ∀ α 1 , α 2 ∈ V n \forall \boldsymbol \alpha_1, \boldsymbol \alpha_2 \in V_n α1,a2Vn,有 T ( α 1 + α 2 ) = T ( α 1 ) + T ( α 2 ) T(\boldsymbol \alpha_1 + \boldsymbol \alpha_2) = T(\boldsymbol \alpha_1) + T(\boldsymbol \alpha_2) T ( a1+a2)=T ( a1)+T ( a2)

    ∀ α ∈ V n , λ ∈ R \forall \boldsymbol \alpha \in V_n, \lambda \in \mathbb{R} αVn,lR,有 T ( λ α ) = λ T ( α ) T(\lambda \boldsymbol \alpha) = \lambda T (\boldsymbol \alpha) T ( l a )=λ T ( a ) .

    then TTT is referred to fromV n V_nVn U m U_m UmThe linear map or linear transformation of .

    The geometric meaning of matrix transformation : To accurately describe a vector, we must clarify the basis of the vector and find out the projection length of the vector on each component of the basis . Assuming that the research is based on a two-dimensional Cartesian coordinate system, x = ( 1 , 0 ) T , y = ( 0 , 1 ) T \boldsymbol x = (1, 0)^{\rm T}, \boldsymbol y = (0,1)^{\rm T}x=(1,0)T,y=(0,1)T as a set of basis in the space. The form expressed as a matrix is
    ​​[ 1 0 0 1 ] \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}[1001] According to the physical meaning of the vector inner product, the vector ( 4 , 2 ) (4, 2)
    in the two-dimensional Cartesian coordinate system(4,2 ) is expressed as:
    ( 4 , 2 ) x = 4 ( 4 , 2 ) y = 2 ( 1 0 0 1 ) T ( 4 2 ) = ( 4 2 ) (4, 2) \boldsymbol x = 4\\ (4,2) \boldsymbol y = 2 \\ \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}^{\rm T} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = \begin{pmatrix} 4 \\ 2 \end{pmatrix}(4,2)x=4(4,2)y=2(1001)T(42)=(42)
    It can be seen that bycalculating the projection value of the vector on each component of the base, its coordinates in the space defined by the base can be obtained. Generally we want the base to have a modulus of1 11. From the geometric meaning of the vector inner product, we can directly find the coordinates of the vector on the base.
    Similarly, suppose we wantmmm nn_n- dimensional vector, mapped to a space represented by a new basis, where the basis iskkk nn_n维向量,可以表示如下: X n × m = [ x 1 , x 2 , ⋯   , x m ] \boldsymbol X_{n \times m} = [\boldsymbol x_1, \boldsymbol x_2, \cdots, \boldsymbol x_m] Xn×m=[x1,x2,,xm] x i \boldsymbol x_i xi是一个 n n n维列向量, P n × k = [ p 1 , p 2 , ⋯   , p k ] \boldsymbol P_{n \times k} = [\boldsymbol p_1, \boldsymbol p_2, \cdots, \boldsymbol p_k] Pn×k=[p1,p2,,pk] p i \boldsymbol p_i pi是一个 n n n维列向量,每个列向量是一个基向量。
    P n × k T X n × m = [ p 1 T p 2 T ⋮ p k T ] [ x 1 , x 2 , ⋯   , x m ] = [ p 1 T x 1 p 1 T x 2 ⋯ p 1 T x m p 2 T x 1 p 2 T x 2 ⋯ p 2 T x m ⋮ ⋮ ⋱ ⋮ p k T x 1 p k T x 2 ⋯ p k T x m ] k × m = Z k × m \boldsymbol P_{n \times k}^{\rm T} \boldsymbol X_{n \times m} = \begin{bmatrix} \boldsymbol p_1^{\rm T} \\ \boldsymbol p_2^{\rm T} \\ \vdots \\ \boldsymbol p_k^{\rm T} \end{bmatrix} \begin{bmatrix} \boldsymbol x_1, \boldsymbol x_2, \cdots, \boldsymbol x_m \end{bmatrix} = \begin{bmatrix} \boldsymbol p_1^{\rm T} \boldsymbol x_1 & \boldsymbol p_1^{\rm T} \boldsymbol x_2 & \cdots &\boldsymbol p_1^{\rm T} \boldsymbol x_m \\ \boldsymbol p_2^{\rm T} \boldsymbol x_1 & \boldsymbol p_2^{\rm T} \boldsymbol x_2 & \cdots &\boldsymbol p_2^{\rm T} \boldsymbol x_m \\ \vdots & \vdots & \ddots &\vdots \\ \boldsymbol p_k^{\rm T} \boldsymbol x_1 & \boldsymbol p_k^{\rm T} \boldsymbol x_2 & \cdots &\boldsymbol p_k^{\rm T} \boldsymbol x_m \\ \end{bmatrix}_{k \times m} = \boldsymbol Z_{k \times m} Pn×kTXn×m=p1Tp2TpkT[x1,x2,,xm]=p1Tx1p2Tx1pkTx1p1Tx2p2Tx2pkTx2p1Txmp2TxmpkTxmk×m=Zk×m
    由上述表达式可知,最终计算出的 Z k × m \boldsymbol Z_{k \times m} Zk×m矩阵就是矩阵 X \boldsymbol X X经过基 P \boldsymbol P P变换之后的结果,其中 p i T x j \boldsymbol p_i^{\rm T} \boldsymbol x_j piTxj是标量 。矩阵相乘实质就是将右边矩阵中的每一列向量变换到以左边矩阵行向量为基表示的空间中去。通过观察变换后的矩阵,我们发现该矩阵的维度可能发生改变,因此可以实现矩阵降维。当然,要想将该矩阵变换到原先基表示的空间中,只需要在变换之后的矩阵左乘 P − 1 \boldsymbol P^{-1} P1即可。 P \boldsymbol P P通常被假设为一组标准正交基,这样一来更加方便计算。比如可以利用性质 P − 1 P = I , P T P = I , P − 1 = P T \boldsymbol{P^{-1}P} = \boldsymbol I,\boldsymbol {P}^{\rm T}\boldsymbol P = \boldsymbol I, \boldsymbol P^{\rm -1} = \boldsymbol P^{\rm T} P1P=I,PTP=I,P1=PT。但只要矩阵 P \boldsymbol P P是的列向量线性无关即可进行坐标变换。

特殊矩阵和向量

  • 对角矩阵 ( d i a g o n a l   m a t r i x ) (\rm diagonal \ matrix) ( d i a go n a l m a t r i x )  : only contain non-zero elements on the main diagonal, and all other positions are zero . An identity matrix is ​​a diagonal matrix. usediag ( v ) \rm diag(\it \boldsymbol v)d i a g ( v ) represents a diagonal element consisting of vectorv \boldsymbol vThe diagonal square matrix given by the elements in v .

  • The multiplication of diagonal matrices is very convenient , to calculate diag ( v ) x \rm diag(\it \boldsymbol v) \boldsymbol xd i a g ( v ) x , just putx \boldsymbol xEach element xi x_iin xxibig vi v_ivitimes.

  • Not all diagonal matrices are square; rectangular matrices can be diagonal as well. Diagonal matrices that are not square do not have an inverse, but matrix multiplication can still be computed efficiently.

  • Symmetric matrix ( symmetric matrix ) (\rm symmetric \ matrix)( s y m m e t r i c m a t r i x )  is a matrix whose transpose is equal to itself.

  • Unit vector ( unit vector ) \rm (unit \ vector)( u n i t v e c t o r )  is a vector with unit norm, that is, the 2 norm is1 11

  • Orthogonal : if the dot product of the vector is 0 00 , then the two vectorsare orthogonal. If both vectors have non-zero norm, then the angle between the two vectors is9 0 ∘ 90^{\circ}90。在 R n \mathbb{R}^n Rn中,至多有 n n n个范数非零的向量互相正交,若这些向量不仅互相正交,而且范数都为 1 1 1,则为标准正交

  • 正交矩阵 ( o r t h o g o n a l   m a t r i x ) (\rm orthogonal \ matrix) (orthogonal matrix):指行向量和列向量分别标准正交的方阵,也就是
    A T A = A A T = I A − 1 = A T \boldsymbol A^{\rm T} \boldsymbol A = \boldsymbol A \boldsymbol A^{\rm T} = \boldsymbol I \\ \boldsymbol A^{-1} = \boldsymbol A^{\rm T} ATA=AAT=IA1=AT
    orthogonal matrix inversion calculation cost is small. ∣ A ∣ = 1 |\boldsymbol A| = 1A=1

    Suppose the matrix A \boldsymbol AA is a column vector matrixA = [ x 1 , x 2 , x 3 , ⋯ , xn ] \boldsymbol A = \left[ \boldsymbol x_1, \boldsymbol x_2, \boldsymbol x_3, \cdots, \boldsymbol x_n \right]A=[x1,x2,x3,,xn]

    then there is
    A T A = [ x 1 T x 2 T x 3 T ⋮ x n T ] [ x 1 , x 2 , x 3 , ⋯   , x n ] = [ x 1 T x 1 x 1 T x 2 x 1 T x 3 ⋯ x 1 T x n x 2 T x 1 x 2 T x 2 x 2 T x 3 ⋯ x 2 T x n x 3 T x 1 x 3 T x 2 x 3 T x 3 ⋯ x 3 T x n ⋮ ⋮ ⋮ ⋱ ⋮ x n T x 1 x n T x 2 x n T x 3 ⋯ x n T x n ] = I = [ 1 0 0 ⋯ 0 0 1 0 ⋯ 0 0 0 1 ⋯ 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ⋯ 1 ] \begin{aligned} \boldsymbol A^{\rm T} \boldsymbol A & = \begin{bmatrix} \boldsymbol x_1^{\rm T}\\ \boldsymbol x_2^{\rm T}\\ \boldsymbol x_3^{\rm T}\\ \vdots \\ \boldsymbol x_n^{\rm T}\\ \end{bmatrix} \begin{bmatrix} \boldsymbol x_1, \boldsymbol x_2, \boldsymbol x_3, \cdots, \boldsymbol x_n \end{bmatrix} = \begin{bmatrix} \boldsymbol x_1^{\rm T} \boldsymbol x_1 & \boldsymbol x_1^{\rm T} \boldsymbol x_2 & \boldsymbol x_1^{\rm T} \boldsymbol x_3 & \cdots & \boldsymbol x_1^{\rm T} \boldsymbol x_n \\ \boldsymbol x_2^{\rm T} \boldsymbol x_1 & \boldsymbol x_2^{\rm T} \boldsymbol x_2 & \boldsymbol x_2^{\rm T} \boldsymbol x_3 & \cdots & \boldsymbol x_2^{\rm T} \boldsymbol x_n \\ \boldsymbol x_3^{\rm T} \boldsymbol x_1 & \boldsymbol x_3^{\rm T} \boldsymbol x_2 & \boldsymbol x_3^{\rm T} \boldsymbol x_3 & \cdots & \boldsymbol x_3^{\rm T} \boldsymbol x_n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \boldsymbol x_n^{\rm T} \boldsymbol x_1 & \boldsymbol x_n^{\rm T} \boldsymbol x_2 & \boldsymbol x_n^{\rm T} \boldsymbol x_3 & \cdots & \boldsymbol x_n^{\rm T} \boldsymbol x_n \\ \end{bmatrix} \\ & = \boldsymbol I =\begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} \end{aligned} ATA=x1Tx2Tx3TxnT[x1,x2,x3,,xn]=x1Tx1x2Tx1x3Tx1xnTx1x1Tx2x2Tx2x3Tx2xnTx2x1Tx3x2Tx3x3Tx3xnTx3x1Txnx2Txnx3TxnxnTxn=I=1000010000100001
    由上述推到可以看出,对于 ∀   i ≠ j , x i T x j = 0 \forall \ i \neq j, \boldsymbol x_i^{\rm T} \boldsymbol x_j = 0  i=j,xiTxj=0且对于 ∀ i = j , x i T x j = 1 \forall i = j, \boldsymbol x_i^{\rm T} \boldsymbol x_j = 1 i=j,xiTxj=1,说明如果矩阵的各列向量都是单位向量,并且两两正交,那么该矩阵是正交矩阵。对于行向量来说,也是如此。
    n n n阶正交矩阵的 n n n个行向量或列向量构成向量空间 R n \mathbb{R}^{n} Rn的一个标准正交基。

      v = np.array([1, 2, 3])
      print("以v为对角元素的对角阵: \n", np.diag(v))
      # [[1 0 0]
      #  [0 2 0]
      #  [0 0 3]]
    
      v2 = np.arange(1, 10).reshape(3, 3)
      print(v2)
      # [[1 2 3]
      #  [4 5 6]
      #  [7 8 9]]
      print("二维矩阵的对角元素: \n", np.diag(v2))
      # 二维矩阵的对角元素:
      #  [1 5 9]
    
  • Similarity matrix : Let A , B \boldsymbol A, \boldsymbol BA,B are bothnnMatrix of order n , if there is an invertible matrixP \boldsymbol PP , such thatP − 1 AP = B \boldsymbol{P^{-1}AP = B}P1AP=B , then callB \boldsymbol BB isA \boldsymbol AThe similarity matrix of A , or the two are similar. In layman's terms, the two matrices are similar, and the essence is thatthe matrices of the same linear transformation under different bases are similar.
    Proof: Suppose there is a linear transformationTTT,存在两组不同的基 α = [ α 1 , α 2 , ⋯   , α n ] , β = [ β 1 , β 2 , ⋯   , β n ] \boldsymbol \alpha = [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}], \boldsymbol{\beta} = [\boldsymbol \beta_{1}, \boldsymbol{\beta}_{2}, \cdots, \boldsymbol{\beta}_{n}] α=[α1,α2,,αn]β=[β1,β2,,βn],两者之间的关系通过过渡矩阵来关联,也就是之前在"基变换中提到的 P \boldsymbol P P",则有
    β = α P T ( α ) = α A T ( β ) = β B ⇒ T ( α P ) = β B ⇒ T ( α ) P = α P B ⇒ α A P P − 1 = α P B P − 1 ⇒ α A = α P B P − 1 ⇒ A = P B P − 1 ⇒ P − 1 A P = B \boldsymbol \beta = \boldsymbol \alpha \boldsymbol P \\ T(\boldsymbol \alpha) = \boldsymbol \alpha \boldsymbol A \\ T(\boldsymbol \beta) = \boldsymbol \beta \boldsymbol B \Rightarrow T(\boldsymbol \alpha \boldsymbol P) = \boldsymbol \beta \boldsymbol B \Rightarrow T(\boldsymbol \alpha)\boldsymbol P = \boldsymbol \alpha \boldsymbol P \boldsymbol B \\ \Rightarrow \boldsymbol \alpha \boldsymbol A \boldsymbol P\boldsymbol P^{-1} = \boldsymbol \alpha \boldsymbol P \boldsymbol B \boldsymbol P^{-1} \Rightarrow \boldsymbol \alpha \boldsymbol A = \boldsymbol \alpha \boldsymbol P \boldsymbol B \boldsymbol P^{-1} \Rightarrow \boldsymbol A = \boldsymbol P \boldsymbol B \boldsymbol P^{-1} \\ \Rightarrow \boldsymbol{P^{-1}AP = B} b=a PT ( a )=αAT ( b )=βBT(αP)=βBT ( a ) P=αPBαAPP1=αPBP1αA=αPBP1A=PBP1P1AP=B
    is certified.

    The same linear transformation, the matrices under different bases are called similarity matrices.

    There is also a more popular way of proof,

    Proof : Assume a linear space VVVector a \boldsymbol ain Va在基 α = [ α 1 , α 2 , ⋯   , α n ] \boldsymbol \alpha = [\boldsymbol \alpha_{1}, \boldsymbol{\alpha}_{2}, \cdots, \boldsymbol{\alpha}_{n}] α=[α1,α2,,αn]下的坐标为 x = ( x 1 , x 2 , x 3 , ⋯   , x n ) T \boldsymbol x = (x_1, x_2, x_3, \cdots, x_n)^{\rm T} x=(x1,x2,x3,,xn)T,在基 β = [ β 1 , β 2 , ⋯   , β n ] \boldsymbol \beta = [\boldsymbol \beta_{1}, \boldsymbol{\beta}_{2}, \cdots, \boldsymbol{\beta}_{n}] β=[β1,β2,,βn]下的坐标为 x ′ = ( x 1 ′ , x 2 ′ , x 3 ′ , ⋯   , x n ′ ) T \boldsymbol x' = (x_1', x_2', x_3', \cdots, x_n')^{\rm T} x=(x1,x2,x3,,xn)T , two bases satisfyβ = α P \boldsymbol \beta= \boldsymbol \alpha \boldsymbol Pb=a P .

    • In beta \boldsymbol \betaβ ,x ' \boldsymbol x'x ' byP \boldsymbol PP becomesα \boldsymbol \alphaThe vector under α — P x ′ \boldsymbol {P x'}Px
    • In α \boldsymbol \alphaα , through the matrixA \boldsymbol AA completes the linear transformation —AP x ′ \boldsymbol {AP x'}APx
    • By P − 1 \boldsymbol P^{-1}P1 turns back intoβ \boldsymbol \betaVector under β — P − 1 AP x ′ \boldsymbol {P^{-1}APx'}P1APx;
    • Therefore in β \boldsymbol \betaβ下, B x ′ = P − 1 A P x ′ \boldsymbol {Bx'} = \boldsymbol {P^{-1}APx'} Bx=P1APx
    • Get B = P − 1 AP \boldsymbol {B} = \boldsymbol {P^{-1}AP}B=P1AP

    Proven.
    Or
    according to the above assumptions,
    A x = PB x ′ x = P x ′ ⇒ AP x ′ = PB x ′ ⇒ B = P − 1 AP \boldsymbol {Ax} = \boldsymbol {PBx'} \\ \boldsymbol x = \boldsymbol{Px'} \\ \Rightarrow \boldsymbol {APx'} = \boldsymbol {PBx'} \Rightarrow \boldsymbol {B} = \boldsymbol {P^{-1}AP}Ax=PBxx=PxAPx=PBxB=P1 AP
    Understand geometrically
    insert image description here

    As can be seen from the figure above, the vector x \boldsymbol xx andx ′ \boldsymbol x'x The same vectora \boldsymbol aa has different coordinates in different bases, but the two are essentially the same. Things are seen differently from different angles. A x \boldsymbol {Ax}A x isx \boldsymbol xx in baseα \boldsymbol \alphaThe coordinates after linear transformation through the matrix under α , B x ′ \boldsymbol {Bx'}Bx isx ′ \boldsymbol x'x in baseβ \boldsymbol \betaThe coordinates after linear transformation through the matrix under β , the coordinate space defined by the two bases can pass the transition matrixP \boldsymbol PP is transformed so thatA x = PB x ′ \boldsymbol {Ax} = \boldsymbol {PBx'}Ax=PBx

Eigenvectors and Eigenvalues

The essence of matrix multiplication is space transformation.

Eigenvector: In a geometric sense, it is a vector whose direction does not change before and after space transformation . Indicates the direction of the transformation, such as stretching transformation refers to the direction of stretching. From this we can see that,

Eigenvalue: In a geometric sense, it is the degree of space transformation . Indicates the degree of transformation, such as how many times to scale in a certain direction.

Eigendecomposition: Decomposes a matrix into a set of eigenvectors and eigenvalues. Suppose A \boldsymbol AA is annn -order matrix, numberλ \lambdaλ andnnn- dimensional non-zero vectorx \boldsymbol xx satisfies
A x = λ x ( x ≠ 0 ) \boldsymbol {Ax} = \lambda \boldsymbol x (\boldsymbol x \neq \boldsymbol 0)Ax=λx(x=0 )
is calledλ \lambdaλ is the matrixA \boldsymbol AEigenvalues ​​of A , nonzero vector x \boldsymbol xx is called matrixA \boldsymbol AThe eigenvectors of A.

To understand the formula intuitively , it is equivalent to the vector x \boldsymbol xx passes through the matrixA \boldsymbol AAfter the spatial linear transformation of A , only the expansion and contraction in the direction has been carried out, and no other changes have been made. This is the true meaning of the eigenvectors and eigenvalues. Solving the eigenvalues ​​and eigenvectors of the matrix is ​​to see which vectors the matrix can only change in direction (eigenvectors), and how much the change is (eigenvalues). The meaning is that we can discuss and research based on the extracted feature vectors.

Understand geometrically:

insert image description here

As can be seen from the above figure, from the square on the left through a matrix A \boldsymbol AAfter A is linearly transformed, it becomes the rectangle on the right, but we can easily find that the vectori \boldsymbol ii andj \boldsymbol jj is just transformed in direction, obviously, they are the matrixA \boldsymbol AThe eigenvectors of A. This transformation is expressed as follows:
A = [ 2 0 0 4 ] \boldsymbol A = \begin{bmatrix} 2 & 0 \\ 0 & 4\\ \end{bmatrix}A=[2004]

A [ 0 1 ] = [ 0 4 ] = 4 [ 0 1 ] ⇒ j ′ = 4 j \boldsymbol A \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix}= \begin{bmatrix} 0 \\ 4 \end{bmatrix}= 4 \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix} \Rightarrow \boldsymbol j' = 4 \boldsymbol j A[01]=[04]=4[01]j=4 j

A [ 1 0 ] = [ 2 0 ] = 2 [ 1 0 ] ⇒ i ′ = 2 i \boldsymbol A \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix}= \begin{bmatrix} 2 \\ 0 \end{bmatrix} = 2 \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} \Rightarrow \boldsymbol i' = 2 \boldsymbol i A[10]=[20]=2[10]i=2i _

then 2 22 and4 44 is their corresponding eigenvalues.

特征分解的另一种定义:假设矩阵 A \boldsymbol A A n n n个线性无关的特征向量 V = [ v ( 1 ) , v ( 2 ) , ⋯   , v ( n ) ] \boldsymbol V = [ \boldsymbol v^{(1)},\boldsymbol v^{(2)} , \cdots, \boldsymbol v^{(n)}] V=[v(1),v(2),,v(n)],对应的特征值分别为 λ = [ λ ( 1 ) , λ ( 2 ) , ⋯   , λ ( n ) ] T \boldsymbol \lambda =[ \lambda^{(1)}, \lambda^{(2)} , \cdots, \lambda^{(n)}]^{\rm T} λ=[λ(1),λ(2),,λ(n)]T,则矩阵 A \boldsymbol A A的特征分解可以记为
A = V d i a g ( λ ) V − 1 \boldsymbol A = \boldsymbol V \rm diag(\boldsymbol \lambda) \it \boldsymbol V^{-1} A=Vdiag(λ)V1
并不是所有的矩阵都可以分解特征值;在某些情况下,实矩阵的特征值分解可能会出现复矩阵。

m = np.array(
    [[1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]])

print("特征值: \n", np.linalg.eigvals(m))
eigvals, eigvecs = np.linalg.eig(m)
print("特征值: \n", eigvals)
print("特征向量: \n", eigvecs)

# 特征值: 
#  [ 1.61168440e+01 -1.11684397e+00 -1.30367773e-15]
# 特征值: 
#  [ 1.61168440e+01 -1.11684397e+00 -1.30367773e-15]
# 特征向量: 
#  [[-0.23197069 -0.78583024  0.40824829]
#  [-0.52532209 -0.08675134 -0.81649658]
#  [-0.8186735   0.61232756  0.40824829]]

奇异值分解

矩阵的奇异值分解 ( S V D ) \rm (SVD) (SVD),将矩阵分解为奇异向量 ( s i n g u l a r   v e c t o r ) (\rm singular \ vector) (singular vector)和奇异值 ( s i n g u l a r   v a l u e ) (\rm singular \ value) (singular value)。奇异值分解会得到一些与特征分解相同类型的信息。但是奇异值分解应用更为广泛。每个实数矩阵都有一个奇异值分解,但不一定都有特征分解。比如,非方阵的矩阵没有特征分解,此时只能使用奇异值分解。

奇异值分解的形式为
A = U Σ V T \boldsymbol A = \boldsymbol {U \Sigma V }^{\rm T} A=UΣVT
A \boldsymbol A A m × n m \times n m×n的矩阵,那么 U \boldsymbol U U m × m m \times m m×m的,其列向量成为左奇异向量,而 V \boldsymbol V V n × n n \times n n×n , its column vector is calledright singular vector, andΣ \boldsymbol \SigmaΣ ism × nm \times nm×Diagonal matrix of n whose diagonal elements arematrix A \boldsymbol AThesingular value of A.

Singular values ​​often correspond to important information hidden in the matrix, and the importance is positively correlated with the size of the singular values .

In the field of image processing, singular values ​​can not only be applied to data compression, but also to denoise images. If an image contains noise, we have reason to believe that those small singular values ​​are due to noise. When we force these small singular values ​​to 0, we can remove the noise in the picture.

The main application areas include:

  • Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI);
  • Recommender system can be said to be the most valuable application point;
  • Compression of matrix form data (mainly image data).

A singular matrix must be a square matrix. If the determinant of the matrix is ​​0, the matrix is ​​singular, otherwise it is non-singular . A non-singular matrix is ​​invertible, and a reversible matrix is ​​a non-singular matrix. If the matrix A \boldsymbol AA is a non-singular matrix, thenA x = 0 \boldsymbol {Ax} = \boldsymbol 0Ax=0有唯一零解, A x = b \boldsymbol {Ax} = \boldsymbol b Ax=b有唯一解。

m = np.array([
    [1, 2, 3, 4],
    [4, 5, 6, 8]
])
U, Sigma, V = np.linalg.svd(m)

print("左奇异向量矩阵U: \n", U)
print("奇异值对角矩阵Sigma: \n", Sigma)
print("右奇异向量矩阵V: \n", V)

# 左奇异向量矩阵U: 
#  [[-0.41523775 -0.90971293]
#  [-0.90971293  0.41523775]]

# 奇异值对角矩阵Sigma: 
#  [13.04656085  0.88727114]

# 右奇异向量矩阵V: 
#  [[-0.31074009 -0.41229564 -0.51385119 -0.68513492]
#  [ 0.84668379  0.28938495 -0.26791389 -0.35721851]
#  [-0.25916053  0.51832106  0.54670221 -0.60439705]
#  [-0.34554737  0.69109474 -0.60439705  0.19413726]]

线性代数应用之主成分分析 ( P C A ) \rm (PCA) (PCA)

数据降维:就是将高维度特征数据降到低维度,保留一些比较重要的维度特征数据,去除噪声和不重要的信息,达到提升数据处理速度的目的。虽然降维以损失一些信息为代价,但是它尽可能地保留了原始数据的重要信息,这样就可以节省计算的时间成本、数据集更易使用、更易理解。

主成分分析 ( P r i n c i p a l   C o m p o n e n t   A n a l y s i s ) \rm (Principal \ Component \ Analysis) (Principal Component Analysis):该算法的主要思想就是将原始数据的 n n n维特征降到 k k k维。这 k k k维正交特征也被称为主成分,是在原始 n n n维特征上重构出来的 k k k维特征。 P C A \rm PCA PCA主要工作就是在空间中找出一组两两互相正交的 k k k个基向量,使得原始数据经过该新基的映射(变换)之后变成 k k k维特征。新基的选取取决于原始数据的分布。

提出问题:如何选取 k k k个正交基使得在变换之后能最大程度地保留原始数据的信息呢?如何才能使信息的损失度最小呢?通俗一点说,就是如何选取原始数据的投影方向,才能最大程度保留原始信息呢?

初步想法:原始数据投影之后,在投影方向上应该使得数据尽可能分散,其分散程度用方差表示。当然这只是将二维降到一维的情况。考虑到多维数据的情况,比如三维降到两维,首先选择的第一个投影方向肯定是投影后方差最大的方向,但是第二个方向必然不可能是方差最大的方向,否则就会使得投影方向重叠,数据无法区分了,为了表示更多的信息,我们希望在该方向投影后方差尽可能的大,而且投影方向不能和第一个投影方向线性相关(要是互相垂直正交,那就不存在相关性了,这也是我们寻求的最佳的投影方向)。

⋆ ⋆ ⋆ \star\star\star 对应到线性代数的知识,就是计算原始数据矩阵的协方差矩阵(对角线元素就是每个维度的方差),并进行协方差矩阵的特征分解,从大到小选择 k k k个特征值(方差)及对应的特征向量,这 k k k个特征向量就是选取的新基,原始数据在该基上投影之后就能从 n n n维特征降到 k k k维。

实现原理:在定义优化目标之前需要明确在正交属性空间中表达样本点的"两个性",使用超平面对样本进行恰当表达。

  • 最近重构性:样本点到超平面的距离都足够近;
  • 最大可分性:样本点在这个超平面上的投影尽可能分散。

因此 P C A \rm PCA PCA有两种等价推导,从最近重构性来看:

假设有 m m m n n n维列向量 X = [ x ( 1 ) , x ( 2 ) , ⋯   , x ( m ) ] \boldsymbol X = [\boldsymbol x^{(1)}, \boldsymbol x^{(2)}, \cdots, \boldsymbol x^{(m)}] X=[x(1),x(2),,x( m ) ], has been centered, that is,∑ i = 1 mx ( i ) = 0 \sum_{i=1}^{m} \boldsymbol x^{(i)} = \boldsymbol 0i=1mx(i)=0 . If the data fromnnn dimension down tokkK dimension, then assume that the projection transformation matrix is​​P = [ p 1 , p 2 , ⋯ , pk ] \boldsymbol P = [\boldsymbol p_1, \boldsymbol p_2, \cdots, \boldsymbol p_k]P=[p1,p2,,pk] P \boldsymbol P P is an orthonormal basis, where∥ pi ∥ 2 = 1 , pi T pj = 0 \left \| \boldsymbol p_i \right \|_2 = 1, \boldsymbol p_i^T \boldsymbol p_j = 0pi2=1,piTpj=0

由之前的坐标投影变换公式可知, z ( i ) = P T x ( i ) \boldsymbol z^{(i)} = \boldsymbol P^{\rm T }\boldsymbol x^{(i)} z(i)=PTx(i),即可得到原始数据 x ( i ) \boldsymbol x^{(i)} x(i)在新基 P \boldsymbol P P下的新空间坐标 z ( i ) = [ z 1 ( i ) , z 2 ( i ) , ⋯   , z k ( i ) ] T \boldsymbol z^{(i)} = [z_1^{(i)}, z_2^{(i)}, \cdots, z_k^{(i)}]^{\rm T} z(i)=[z1(i),z2(i),,zk(i)]T,重构成用基向量表示的形式就是 x ^ ( i ) = P z ( i ) \hat{\boldsymbol x}^{(i)} = \boldsymbol P \boldsymbol z^{(i)} x^(i)=Pz(i)
[ p 1 p 2 ⋯ p k ] [ z 1 ( i ) z 2 ( i ) ⋮ z k ( i ) ] = z 1 ( i ) p 1 + z 2 ( i ) p 2 + ⋯ + z k ( i ) p k \begin{bmatrix} \boldsymbol p_1 & \boldsymbol p_2 & \cdots & \boldsymbol p_k \end{bmatrix} \begin{bmatrix} z_1^{(i)} \\ z_2^{(i)} \\ \vdots \\ z_k^{(i)} \end{bmatrix} = z_1^{(i)} \boldsymbol p_1 + z_2^{(i)} \boldsymbol p_2 + \cdots + z_k^{(i)} \boldsymbol p_k [p1p2pk]z1(i)z2(i)zk(i)=z1(i)p1+z2(i)p2++zk(i)pk

Putting it in a two-dimensional space Cartesian coordinate system can actually be well understood, assuming we know the vector x = ( 3 , 2 ) T \boldsymbol x =(3, 2)^{\rm T}x=(3,2)Coordinates of T , base vector P = [ p 1 , p 2 ] \boldsymbol P = [\boldsymbol p_1, \boldsymbol p_2]P=[p1,p2] , then the known coordinates, represented by the basis vector, arex = 3 p 1 + 2 p 2 \boldsymbol x = 3 \boldsymbol p_1 + 2\boldsymbol p_2x=3p1+2p2 p 1 = [ 1 , 0 ] T , p 2 = [ 0 , 1 ] T \boldsymbol p_1 = [1,0]^{\rm T}, \boldsymbol p_2 =[0,1]^{\rm T} p1=[1,0]T,p2=[0,1]T.

So our optimization goal is to minimize ∑ i = 1 m ∥ x ^ ( i ) − x ( i ) ∥ 2 \sum_{i=1}^{m}\left \| \hat{\boldsymbol x}^{ (i)} - \boldsymbol x^{(i)}\right \|_2i=1mx^(i)x(i)2.
min ⁡ ∑ i = 1 m ∥ x ^ ( i ) − x ( i ) ∥ 2    ⟺    min ⁡ ∑ i = 1 m ∥ x ^ ( i ) − x ( i ) ∥ 2 2 \min \sum_{i = 1}^{m} \left \| \hat{\boldsymbol x}^{(i)} - \boldsymbol x^{(i)}\right \|_2 \iff \min \sum_{i = 1}^{m} \left \| \hat{\boldsymbol x}^{(i)} - \boldsymbol x^{(i)}\right \|_2^2 mini=1mx^(i)x(i)2mini=1mx^(i)x(i)22

∑ i = 1 m ∥ x ^ ( i ) − x ( i ) ∥ 2 2 = ∑ i = 1 m ( x ^ ( i ) ) T ( x ^ ( i ) ) − 2 ( x ^ ( i ) ) T ( x ( i ) ) + ( x ( i ) ) T ( x ( i ) ) = ∑ i = 1 m ( P z ( i ) ) T ( P z ( i ) ) − 2 ( P z ( i ) ) T ( x ( i ) ) + ( x ( i ) ) T ( x ( i ) ) = ∑ i = 1 m ( z ( i ) ) T ( P T P ) ( z ( i ) ) − 2 ( z ( i ) ) T P T ( x ( i ) ) + ( x ( i ) ) T ( x ( i ) ) = ∑ i = 1 m ( z ( i ) ) T ( z ( i ) ) − 2 ( z ( i ) ) T ( z ( i ) ) + ( x ( i ) ) T ( x ( i ) ) = − ∑ i = 1 m ( z ( i ) ) T ( z ( i ) ) + ∑ i = 1 m ( x ( i ) ) T ( x ( i ) ) = − t r ( P T ( ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T ) P ) + ∑ i = 1 m ( x ( i ) ) T ( x ( i ) ) = − t r ( P T X X T P ) + ∑ i = 1 m ( x ( i ) ) T ( x ( i ) ) \begin{aligned} \sum_{i = 1}^{m} \left \| \hat{\boldsymbol x}^{(i)} - \boldsymbol x^{(i)}\right \|_2^2 &= \sum_{i =1} ^m (\hat{\boldsymbol x}^{(i)})^{\rm T}(\hat{\boldsymbol x}^{(i)}) - 2 (\hat{\boldsymbol x}^{(i)})^{\rm T}(\boldsymbol x^{(i)}) + (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \\ & = \sum_{i =1} ^m (\boldsymbol P \boldsymbol z^{(i)})^{\rm T}(\boldsymbol P \boldsymbol z^{(i)}) - 2(\boldsymbol P \boldsymbol z^{(i)})^{\rm T}(\boldsymbol x^{(i)}) + (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)})\\ & = \sum_{i = 1}^m (\boldsymbol z^{(i)})^{\rm T}(\boldsymbol P^{\rm T}\boldsymbol P)\boldsymbol (z^{(i)}) - 2 (\boldsymbol z^{(i)})^{\rm T} \boldsymbol P^{\rm T} (\boldsymbol x^{(i)}) + (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \\ & = \sum _{i = 1}^m (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)}) - 2 (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)}) + (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \\ &= -\sum_{i = 1}^m (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)})+ \sum_{i = 1}^m (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \\ &= -\rm tr \left(\boldsymbol P^{\rm T} \left(\sum_{i = 1}^m (\boldsymbol x^{(i)}) (\boldsymbol x^{(i)})^{\rm T}\right) \boldsymbol P \right)+ \sum_{i = 1}^m (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \\ &= -\rm tr\left(\boldsymbol P^{\rm T} \boldsymbol {XX}^{\rm T} \boldsymbol P\right) +\sum_{i = 1}^m (\boldsymbol x^{(i)})^{\rm T}(\boldsymbol x^{(i)}) \end{aligned} i=1mx^(i)x(i)22=i=1m(x^(i))T(x^(i))2(x^(i))T(x(i))+(x(i))T(x(i))=i=1m(Pz(i))T(Pz(i))2(Pz(i))T(x(i))+(x(i))T(x(i))=i=1m(z(i))T(PTP)(z(i))2(z(i))TPT(x(i))+(x(i))T(x(i))=i=1m(z(i))T(z(i))2(z(i))T(z(i))+(x(i))T(x(i))=i=1m(z(i))T(z(i))+i=1m(x(i))T(x(i))=tr(PT(i=1m(x(i))(x(i))T)P)+i=1m(x(i))T(x(i))=tr(PTXXTP)+i=1m(x(i))T(x(i))

The sixth to seventh steps of the above formula are due to ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T \sum_{i = 1}^m (\boldsymbol x^{(i)})(\ boldsymbol x^{(i)})^{\rm T}i=1m(x(i))(x(i))T is to findXXT \boldsymbol X \boldsymbol X^{\rm T}XXThe sum of the elements on the diagonal of the T matrix, so it can be expressed by trace operations.

By observing the above formula, we find that ∑ i = 1 m ( x ( i ) ) T ( x ( i ) ) \sum_{i = 1}^m (\boldsymbol x^{(i)})^{\rm T }(\boldsymbol x^{(i)})i=1m(x(i))T(x(i))是常量, ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T \sum_{i = 1}^m (\boldsymbol x^{(i)}) (\boldsymbol x^{(i)})^{\rm T} i=1m(x(i))(x(i))T is the covariance matrixof the original data set, so minimizing the target expression is equivalent to
arg ⁡ min ⁡ P − tr ( PTXXTP ) s . t . PTP = I \arg \min_{\boldsymbol P} -\rm tr (\boldsymbol P^{\rm T} \boldsymbol {XX}^{\rm T} \boldsymbol P) \quad \quad \rm st \quad \it \boldsymbol P^{\rm T}\boldsymbol P = \ boldsymbol IargPmintr(PTXXTP)s.t.PTP=I

常用迹运算及求导公式:
t r ( A ) = ∑ i A i , i t r ( A B ) = t r ( B A ) t r ( A ) = t r ( A T ) t r ( A B C ) = t r ( C A B ) = t r ( B C A ) ∂ t r ( A B ) A = ∂ t r ( B A ) A = B T \begin{aligned} \rm tr(\boldsymbol A) &= \sum_i A_{i, i} \\ \rm tr(\boldsymbol {AB}) &=\rm tr(\boldsymbol {BA}) \\ \rm tr(\boldsymbol {A}) &= \rm tr(\boldsymbol {A}^{\rm T}) \\ \rm tr(\boldsymbol{A B C})&= \rm tr(\boldsymbol{ C AB}) = \rm tr(\boldsymbol{BCA}) \\ \frac{\partial \rm tr(\boldsymbol{AB})} {\boldsymbol A}& = \frac{\partial \rm tr(\boldsymbol{BA})} {\boldsymbol A} = \boldsymbol B^{\rm T} \end{aligned} tr(A)tr(AB)tr(A)tr(ABC)Atr(AB)=iAi,i=tr(BA)=tr(AT)=tr(CAB)=tr(BCA)=Atr(BA)=BT

Using the Lagrangian multiplier algorithm to optimize the above formula, we can get:
J ( P ) = − tr ( PTXXTP + λ ( PTP − I ) ) For partial derivative of P, ∂ J ∂ P = − XXTP + λ P = 0 ⟺ XXTP = λ P \begin{aligned} J(\boldsymbol P) = -\rm tr \left(\boldsymbol P^{\rm T} \boldsymbol {XX}^{\rm T}\boldsymbol P + \lambda (\boldsymbol P^{\rm T}\boldsymbol P - \boldsymbol I) \right) \\ For partial derivative of \boldsymbol P, \frac{\partial J}{\partial \boldsymbol P} = - \boldsymbol { XX}^{\rm T}\boldsymbol P + \lambda \boldsymbol P = 0\\ \iff \boldsymbol {XX}^{\rm T}\boldsymbol P = \lambda \boldsymbol P \end{aligned}J(P)=tr(PTXXTP+λ ( PTPI))Take the partial derivative with respect to P ,PJ=XXTP+λP _=0XXTP=λP _
From the above final expression, according to the eigendecomposition of the matrix, we can conclude that P \boldsymbol PP isXXT \boldsymbol{XX}^{\rm T}XXkkof T matrixA matrix composed of k eigenvectors,λ \lambdaλ is its eigenvalue. So the final problem is transformed into findingXXT \boldsymbol {XX}^{\rm T}XXT 's exkkK large eigenvalues ​​and corresponding eigenvectors, whereXXT \boldsymbol {XX}^{\rm T}XXT is the matrixX \boldsymbol XThe covariance matrix of X.

Finally, it is only necessary to map the raw data to P \boldsymbol PJust within the space defined by P
, that is, Z = PTX \boldsymbol Z = \boldsymbol P^{\rm T} \boldsymbol XZ=PTX
Z \boldsymbol Z Z is the data after dimensionality reduction.

Next, let's look at the second derivation - based on maximum separability, that is, the variance of the data distribution after projection is as large as possible,

Based on the same assumptions of the previous derivation method, for any data x ( i ) \boldsymbol x^{(i)}x( i ) , it is in the projection transformation matrixP \boldsymbol PThe coordinates under P are z ( i ) = PT x ( i ) \boldsymbol z^{(i) } = \boldsymbol P^{\rm T} \boldsymbol x^{(i)}z(i)=PTx( i ) , the variance at the new coordinates is( z ( i ) ) T ( z ( i ) ) (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i) })(z(i))T(z( i ) ), the goal is to maximize the sum of the variances of all data, that is,
max ⁡ ∑ i = 1 m ( z ( i ) ) T ( z ( i ) ) \max \sum_{i = 1}^{ m} (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)})maxi=1m(z(i))T(z(i))

∑ i = 1 m ( z ( i ) ) T ( z ( i ) ) = ∑ i = 1 m ( P T x ( i ) ) T ( P T x ( i ) ) = ∑ i = 1 m ( x ( i ) ) T P P T ( x ( i ) ) = t r ( P T ( ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T ) P ) = t r ( P T X X T P ) \begin{aligned} \sum_{i = 1}^{m} (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)}) &= \sum_{i = 1}^{m} (\boldsymbol P^{\rm T} \boldsymbol x^{(i)})^{\rm T} (\boldsymbol P^{\rm T} \boldsymbol x^{(i)})\\ &= \sum_{i = 1}^{m} (\boldsymbol x^{(i)})^{\rm T} \boldsymbol P \boldsymbol P^{\rm T} (\boldsymbol x^{(i)}) \\ & = \rm tr \left( \boldsymbol P^{\rm T} \left(\sum_{i = 1}^{m} (\boldsymbol x^{(i)}) (\boldsymbol x^{(i)})^{\rm T}\right) \boldsymbol P\right) \\ & = \rm tr \left( \boldsymbol P^{\rm T}\boldsymbol{XX}^{\rm T} \boldsymbol P \right) \end{aligned} i=1m(z(i))T(z(i))=i=1m(PTx(i))T(PTx(i))=i=1m(x(i))TPPT(x(i))=tr(PT(i=1m(x(i))(x(i))T)P)=tr(PTXXTP)

It is known that the expression in the objective optimization formula is the sum of variances, that is ( z ( i ) ) T ( z ( i ) ) (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z ^{(i)})(z(i))T(z( i ) )is a scalar, in fact it is the matrix( z ( i ) ) ( z ( i ) ) T (\boldsymbol z^{(i)})(\boldsymbol z^{(i)})^{\ rm T}(z(i))(z(i))The sum of the elements on the diagonal of T , so that it can be related to the trace of the matrix. That is
( z ( i ) ) T ( z ( i ) ) = tr ( ( z ( i ) ) ( z ( i ) ) T ) (\boldsymbol z^{(i)})^{\rm T}(\ boldsymbol z^{(i)}) = \rm tr \left( (\boldsymbol z^{(i)}) (\boldsymbol z^{(i)})^{\rm T} \right)(z(i))T(z(i))=tr((z(i))(z(i))T)
那么
∑ i = 1 m ( z ( i ) ) T ( z ( i ) ) = t r ( Z Z T ) = t r ( P T X X P T ) \begin{aligned} \sum_{i = 1}^{m} (\boldsymbol z^{(i)})^{\rm T} (\boldsymbol z^{(i)}) & = \rm tr \left( \boldsymbol {ZZ}^{\rm T}\right) \\ & = \rm tr \left( \boldsymbol P^{\rm T} \boldsymbol X \boldsymbol {XP}^{\rm T}\right) \end{aligned} i=1m(z(i))T(z(i))=tr(ZZT)=tr(PT XXPT)
This seems to be a better understanding of the above derivation process (4) (4)( 4 ) The origin of the step.

Therefore, the formula for maximizing the objective is to maximize tr ( PTXXTP ) \rm tr \left( \boldsymbol P^{\rm T}\boldsymbol{XX}^{\rm T} \boldsymbol P \right)tr(PTXXTP),即
arg ⁡ max ⁡ P t r ( P T X X T P )    s . t .    P T P = I \arg \max_{\boldsymbol P} \rm tr \left( \boldsymbol P^{\rm T}\boldsymbol{XX}^{\rm T} \boldsymbol P \right) \ \ \rm s.t. \ \ \it \boldsymbol P^{\rm T} \boldsymbol P \rm= \boldsymbol I argPmaxtr(PTXXTP)  s.t.  PTP=It
can be seen that this expression is equivalent to the objective optimization formula in the previous derivation method.

References

[1] "Deep Learning" flower book
[2] related github
[3] Tongji University "Linear Algebra Sixth Edition"

Guess you like

Origin blog.csdn.net/qq_41139677/article/details/120861770