[Machine Learning] Sparse coding and matrix factorization


Dictionary Learning

Dictionary learning is a representation learning method that aims to represent high-dimensional data (such as images, audio, etc.) in a low-dimensional and sparse manner, while trying to retain the key information of the original data. Sparsity means that most of the coefficients are zero and only a few are non-zero. Such a representation can be more efficient, while capturing key information in the data and filtering out noise. Furthermore, sparse representations can also be used for compression, denoising, and other tasks when we have a proper dictionary.

Consider a data point xxx , we want to pass a "dictionary"DDD (which is a matrix in which each column is a basis) and a sparse coefficient vectorα \alphaα to approximately represent this data point. Mathematically, we can describe it asx ≈ D α x \approx D \alphaxD a .

The reconstruction error is the actual data point xxThe difference between x and the data reconstructed using the dictionary and its corresponding sparse coefficient vector. Mathematically, this error can be expressed as∥ x − D α ∥ 2 \| x - D \alpha \|^2xDα2

Our goal is to find a coefficient vector α \alphaα to minimize this error, that is:
α ∗ = arg min ⁡ α ∥ x − D α ∥ 2 \alpha^* = \argmin\limits_{\alpha} \|x - D \alpha\|^2a=aargminxDα2

For multiple data points x 1 , x 2 , … , xn x_1, x_2, \dots, x_nx1,x2,,xn, we can similarly define a global reconstruction error, and the goal is to find a common dictionary DDD and a sparse representation of each data point. We can stack all the data points into a matrixXXX , all representations are stacked into a matrixRRR , and then minimize the error overall:

arg min ⁡ D ∈ D , R ∈ R ∥ X − D R ∥ F 2 \argmin\limits_{D \in \mathcal{D}, R \in \mathcal{R}}\|X - D R\|_F^2 DD,RRargminXDRF2

Among them, D \mathcal{D}D andR \mathcal{R}R is a dictionary and a constrained space of representations. For example,D \mathcal{D}D may include column vectors of all unit norms, whileR \mathcal{R}R may include all coefficient vectors with sparsity constraints.

∥ ⋅ ∥ F \| \cdot \|_F Fis the Frobenius norm, which measures the difference between two matrices. For any matrix AAA , whose Frobenius norm is defined as:

∥ A ∥ F = ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 \| A \|_F = \sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n} |a_{ij}|^2} AF=i=1mj=1naij2

The goal of dictionary learning is to find an overcomplete dictionary D and a sparse representation R so as to minimize the reconstruction error.

An overcomplete dictionary D means that it has more columns than its rows, that is, the dictionary D contains more atoms (or bases) than the dimensions of the data. This means that it has multiple base elements to choose from to approximate the input data X. The sparsity of R ensures that most elements in R are zero or close to zero. This means that, although D provides many possible base elements, only a few will be activated or used in any particular representation. This not only makes the representation more concise and computationally efficient, but also helps avoid overfitting and makes it more interpretable.

To introduce sparsity, we can modify the optimization problem and add a regularization term:

arg min ⁡ D ∈ D , R ∈ R ∥ X − DR ∥ F 2 + λ ∥ R ∥ pp \argmin\limits_{D \in \mathcal{D}, R \in \mathcal{R}}\|X - DR\|_F^2 + \lambda \|R\|_p^pDD,RRargminXDRF2+λRpp

First, we define ℓ p \ell_ppnorm. For any vector v ∈ R n \mathbf{v} \in \mathbb{R}^nvRn , whoseℓ p \ell_ppThe norm is defined as:

∥ v ∥ p = ( ∑ i = 1 n ∣ v i ∣ p ) 1 p \| \mathbf{v} \|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{\frac{1}{p}} vp=(i=1nvip)p1

这り,∣ vi ∣ |v_i|vi represents the absolute value of the i-th element in the vector, and p is a positive real number.

As we adjust the value of p, this norm will emphasize different properties of the vector.

ℓ 0 \he_00The norm directly counts the number of non-zero elements, however, using ℓ 0 \ell_00Norm optimization problems are NP-hard. Therefore, in practice, we rarely directly optimize ℓ 0 \ell_00norm. ℓ 1 \ell_11The norm is ℓ 0 \ell_00Optimal convex approximation to the norm and is easier to handle in optimization problems.

When p = 1, we have:

∥ v ∥ 1 = ∑ i = 1 n ∣ vi ∣ \| \mathbf{v} \|_1 = \sum_{i=1}^{n} |v_i|v1=i=1nvi

This is exactly the vector v \mathbf{v}The sum of the absolute values ​​of all elements in v . Therefore, we callℓ 1 \ell_11The norm is the sum of the absolute values ​​of the vectors.

Consider ℓ 1 \ell_11The normed unit ball is in R 2 \mathbb{R}^2R2 Performance in space. It is a rhombus with its vertices on the axes. InR 2 \mathbb{R}^2R2 , the vertices of this rhombus are (1,0), (-1,0), (0,1), and (0,-1).

Now, consider an optimization problem where we wish to minimize some loss function subject to ℓ 1 \ell_11Norm constraints. Assume that the contours of the loss function are elliptical. Our goal is to find the loss function contours with ℓ 1 \ell_11The minimum point where the normed unit spheres intersect.

Since ℓ 1 \ell_11The geometry of the normed unit sphere, whose sharp corners make the first intersection with the loss function contours likely to be at these corners. When the solution lies in R 2 \mathbb{R}^2R2 at one corner of the rhombus in space, one of the coordinates will be zero, so the solution is sparse.

For higher dimensional spaces R n \mathbb{R}^nRn , the unit ball will have more corners. These corners lie on coordinate axes or coordinate planes, so solutions at the corners will have zero values ​​in one or more dimensions, resulting in sparse solutions.

Therefore, by adding ∥ R ∥ 1 = ∑ i , j ∣ R ij ∣ \|R\|_1 = \sum_{i,j} |R_{ij}|R1=i,jRij∣regularizer into our optimization problem, we can encourage the sparsity of R such that it has zero value in many dimensions, whereR ij R_{ij}RijRepresents the elements of matrix R, located in the i-th row and j-th column.

Regularization and algorithm stability

Stability refers to how much the output of a learning algorithm changes when you make small modifications in the training data (such as changing or removing a sample). Stability is an important indicator of generalization performance because if the algorithm is very sensitive to small changes in the data, then it may be prone to overfitting.

To formalize this concept, consider two training sample sets SSSsum S i S^iSi , in whichS i S^iSi isSSA small variant of S , with only the i-th sample different. If the difference between the output produced by an algorithm on these two training sets is small, then we say that the algorithm is stable.

Mathematically, stability can be defined as:

∣ ℓ ( X , Y , h S ) − ℓ ( X , Y , h S i ) ∣ ≤ ϵ ( n ) |\ell(X, Y, h_S) - \ell(X, Y, h_{S^i})| \leq \epsilon(n) (X,Y,hS)(X,Y,hSi)ϵ ( n )

Among them, ℓ \ell is the loss function,h S h_ShSsum h S i h_{S^i}hSiis the output of the algorithm on the two data sets. If as the number of training samples n increases, the difference ϵ ( n ) \epsilon(n)ϵ ( n ) approaches zero, then the algorithm is uniformly stable.

Generalization error describes the expected error of a model on unseen data. More specifically, it is the difference between the model's average error over the entire data distribution and its error on the training data.

A common bound on generalization error is:

R ( h S ) − R ( h ∗ ) ≤ 2 sup ⁡ h ∈ H ∣ R ( h ) − R S ( h ) ∣ R(h_S) - R(h^*) \leq 2 \sup_{h \in H} |R(h) - R_S(h)| R(hS)R(h)2hHsupR(h)RS(h)

Among them, RRR is the expected error over the entire data distribution, andRS R_SRSis the average error on the training data. h ∗ h^*h is the best possible model, whileHHH is the hypothesis space of all possible models.


Sparse algorithms are not stable

A learning algorithm is said to be stable if slight perturbations in the training data result in small changes in the output of the algorithm, and these changes vanish as the data set grows bigger and bigger.

Algorithmic Stability

We have two different training samples

S = { ( X 1 , Y 1 ) , . . . , ( X i − 1 , Y i − 1 ) , ( X i , Y i ) , ( X i + 1 , Y i + 1 ) , . . . , ( X n , Y n ) } S = \{(X_1, Y_1), ..., (X_{i-1}, Y_{i-1}), (X_i, Y_i), (X_{i+1}, Y_{i+1}), ..., (X_n, Y_n)\} S={(X1,Y1),...,(Xi1,Yi1),(Xi,Yi),(Xi+1,Yi+1),...,(Xn,Yn)}

S i = { ( X 1 , Y 1 ) , . . . , ( X i − 1 , Y i − 1 ) , ( X i ′ , Y i ′ ) , ( X i + 1 , Y i + 1 ) , . . . , ( X n , Y n ) } S^i = \{(X_1, Y_1), ..., (X_{i-1}, Y_{i-1}), (X_i', Y_i'), (X_{i+1}, Y_{i+1}), ..., (X_n, Y_n)\} Si={(X1,Y1),...,(Xi1,Yi1),(Xi,Yi),(Xi+1,Yi+1),...,(Xn,Yn)}

They are different because of only one training example.

An algorithm is uniformly stable if for any example ( X , Y ) (X, Y) (X,Y)

∣ ℓ ( X , Y , h S ) − ℓ ( X , Y , h S i ) ≤ ϵ ( n ) ∣ |\ell(X, Y, h_S) - \ell(X, Y, h_{S^i}) \leq \epsilon(n)| (X,Y,hS)(X,Y,hSi)ϵ ( n )

Note that ϵ ( n ) \epsilon(n)ϵ(n) will vanish as n goes to infinity.

Generalisation Error

R ( h S ) − min ⁡ h ∈ H R ( h ) = R ( h S ) − R ( h ∗ ) = 2 sup ⁡ h ∈ H ∣ R ( h ) − R S ( h ) ∣ R(h_S) - \min\limits_{h \in H} R(h) = R(h_S) - R(h^*) = 2 \sup\limits_{h \in H} |R(h) - R_S(h)| R(hS)hHminR(h)=R(hS)R(h)=2hHsupR(h)RS(h)


Define arg min ⁡ D ∈ D , R ∈ R ∥ X − DR ∥ F 2 \argmin\limits_{D \in \mathcal{D}, R \in \mathcal{R}}\|X - DR\|_F^ 2DD,RRargminXDRF2The local minimum D ∗ D^*D andR ∗ R^*R , which means wehaveXDR

But we can also find another matrix pair, such as D ∗ AD^*AD AsumA − 1 R ∗ A^{-1}R^*A1R , they can also be close toXXX A A A is an invertible matrix), becauseD ∗ R ∗ = ( D ∗ A ) ( A − 1 R ∗ ) D^*R^* = (D^*A) (A^{-1}R^*)DR=(DA)(A1R)

So although D ∗ D^*D andR ∗ R^*R is the local minimum of the original problem, but we can multiply it by an invertible matrixAAA and its inverseA − 1 A^{-1}A1 to find different pairs of matrices that produce the same product, i.e. the same reconstruction error, which is why the problem is non-convex.

Therefore, for the optimization problem of dictionary learning, the alternating optimization method is usually used, that is, fixing one variable and optimizing another variable. This method is also called the coordinate descent method:

  1. Initialize dictionary DDD

  2. Fixed DDD , optimizeRRR

    Use Lasso or other sparse encoding methods to solve the following problems:

    R ∗ = arg min ⁡ R ∥ X − D R ∥ F 2 + λ ∥ R ∥ 1 R^* = \argmin_R \|X - D R\|_F^2 + \lambda\|R\|_1 R=RargminXDRF2+λR1

    where, λ λλ is a regularization parameter that ensuresRRThe sparsity of R.

  3. Fixed RRR , optimizeDDD

    This step can be complicated because we want to find the DD that minimizes the reconstruction errorD. _ A common method is to use gradient-based methods to optimizeDDD , or use other more complex optimization techniques. This question can be written as:

    D ∗ = arg min ⁡ D ∥ X − D R ∥ F 2 D^* = \argmin_D \|X - D R\|_F^2 D=DargminXDRF2

    And there may be constraints to ensure that DDThe columns of D are unit normed.

  4. Repeat steps 2 and 3 until the value of the objective function changes very little or other stopping criteria are met.

This alternating optimization method can give good results in most practical applications, although it may only find local optimal solutions rather than global optimal solutions. However, due to the non-convex nature of the problem, it is very difficult to find the global optimal solution.


K-SVD
∥ R i ∥ 0 ≤ k ′ \| R_i \|_0 \leq k' Ri0k , displayR i R_iRiThe number of non-zero elements in is less than or equal to k', and k ′ ≪ k k' \ll kkk


Principal Components Analysis (PCA)

When we only care about the reconstruction error, the goal of dictionary learning becomes finding the best linear combination to represent the data.

In PCA, the dictionary D consists of the first k principal components, and the representation coefficient R is the projection of the data onto these principal components. If we do not enforce the sparsity of R and allow D to consist of the eigenvectors of the covariance matrix of the data, then the reconstruction from PCA is identical to that from dictionary learning. Therefore, PCA can be considered to be a special dictionary learning case that does not consider the sparsity of R, where D is composed of principal components.

The goal of PCA is to find orthonormal bases of the data that maximize the variance of the data. What it produces is a fixed basis set, which means that the representation of each data point is linear and global.

K-means

Without considering the sparsity of the representation R, the goal of dictionary learning is to find the dictionary D and coefficients R in order to approximately reconstruct the data with some vectors in the dictionary.

If we restrict R to be a unit vector of vectors (i.e. only one element is 1 and the rest are 0), then this means that each data point can only be represented by one entry in the dictionary. This can be expressed by the following conditions:

  • ∥ R i ∥ 0 = 1 \| R_i \|_0 = 1 Ri0=1,DisplayR i R_iRiThere is only one non-zero element in .
  • ∥ R i ∥ 1 = 1 \| R_i \|_1 = 1 Ri1=1,DisplayR i R_iRiThe sum of the elements is 1.

In K-means, each data point is associated with a nearest cluster center. This means that the data point xi x_ixican be fully represented by its nearest cluster center, while the contribution of other cluster centers is zero. This representation is an extreme form of sparse representation in which there is only one non-zero element (i.e. the nearest cluster center) and the other elements are zero.

When we do not enforce sparsity, the cluster centers of K-means can be viewed as dictionary items in dictionary learning. For a given data point xi x_ixi, which in K-means is associated with a nearest cluster center, while in dictionary learning it can be viewed as a linear combination of all dictionary terms. However, due to the hard assignment property of K-means, this linear combination becomes very sparse and has only one non-zero element.

Non-negative Matrix Factorisation (NMF)

NMF is a matrix factorization technique where we constrain all elements of the factor matrix to be non-negative. It provides us with a way to clearly interpret and visualize the structure hidden in the data.

When W is treated as a dictionary, NMF can be viewed as a special case of dictionary learning, where both the dictionary and the representation are non-negative. NMF does not directly pursue sparsity, but can increase sparsity constraints through regularization.

Consider the following optimization problem:

arg min ⁡ D , R ∥ X − D R ∥ F 2 \argmin_{D, R} \|X - D R\|_F^2 D,RargminXDRF2

Among them, the constraints are D ∈ R + d × k , R ∈ R + k × n D \in \mathbb{R^{d \times k}_{+}} \text{, }R \in \mathbb{ R^{k \times n}_{+}}DR+d×kRR+k×n

We want to find two matrices D and R such that their product is as close as possible to the given matrix X, while satisfying that all elements in D and R are non-negative. This means that all elements of each column of D and R are non-negative. Since each column is non-negative, this means they all lie in the orthogonal quadrant. If we consider the column space, then since every element in the column is non-negative, these column vectors lie in the orthogonal quadrants.

For each column of X, we can think of it as a linear combination of the columns of D (also called dictionary elements or bases), and the coefficients of these linear combinations come from R. When we take two vectors from the orthogonal quadrants and linearly combine them, the elements do not cancel each other out since there are no negative coefficients or elements.

This means they can only capture patterns in the data through additive combinations. This means that decomposition can only "add" features, not "subtract" or negate them. When applied to real-world data such as images, this often means that decomposed features represent discernible parts of the data rather than overall patterns. This part-based representation is often more interpretable, for example, when processing image data, non-negative constraints can make each base ( D:1 D_{:1}D:1) represents a part or feature of an image rather than a fuzzy combination of the entire image.

Guess you like

Origin blog.csdn.net/weixin_45427144/article/details/132676029