Sparsity: The history of sparse signal sparse matrix

Author: over-fitting
link: https: //www.zhihu.com/question/26602796/answer/36470745
Source: know almost
copyrighted by the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
 

In the beginning, sparsity naturally arises in the field of signal processing, because the low-frequency signal in nature is mostly low-frequency, and the high-frequency part is basically noise. Therefore, when wavelet or Fourier is used as the basis matrix, the expression coefficients are often only based on a few low-frequency basis. Relatively large, and the coefficients corresponding to the high-frequency basis are basically close to zero. For this reason, Donoho et al. proposed soft-thresholding the expression coefficients to remove high-frequency components, which can filter out noise and improve the signal recovery effect. Since these base matrices are orthogonal, it can be proved that the solution of \min\|y-Ax\|_2^2+\lamba\|x\|_1 is the soft-thresholding of A'y. Later, people departed from the background of signal processing and began to consider the general basis matrix A, and considered the above problem as a regularization of least squares, which led to LASSO (Tibshirani), and compressed sensing (Candes) and others. the work.

In fact, l2 regularization for least squares has already existed (Stein, 1956 James ), so it is a natural idea to consider l1 regularization. The basic principle behind this type of regularization is bias-variance tradeoff , that is, to reduce the complexity of the model. Although a little bias is sacrificed, the overall prediction accuracy is improved by greatly reducing the variance. This is a very common method in high variance. This method even runs through the entire book Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition . Another new meaning of l1 regularization is the introduction of sparsity, which brings interpretability to the model (Model interpretability), that is, the actual meaning of the model is explained based on the actual meaning of the base corresponding to the non-zero coefficient (for example, Face Recognition via Sparse Representation ). Note that a large number of very small coefficients can be obtained using traditional methods such as l2, and it seems that an additional step of truncation can be done to obtain a large number of zero coefficients. But it needs to be emphasized that there is a fundamental difference between zero and non-zero decimals. Because we must first determine what is small enough, this is equivalent to introducing additional parameters (ie, the cut-off threshold), which brings additional errors (in practice, the cut-off threshold needs to be manually adjusted). The coefficients may also have different scales. Sometimes 0.001 is actually a very large coefficient but is truncated, and sometimes 0.1 is actually very small but is left behind. In addition, some solving algorithms need to introduce some approximate strategies in numerical calculations, so that the actual small coefficients may actually be caused by unstable numerical calculations, which makes it more difficult to distinguish whether they are actually zero or non-zero. For the solution of l1, zero and non-zero are exact. Using LARS and other methods to draw a graph of the solution change with lambda, you can even see that some coefficients start to change from zero to non-zero when lambda takes certain values. This can be said to be an advantage.

After 2006, sparse representation has produced several interesting new ideas. One is to extend the sparsity of coefficients to the sparsity of matrix singular values . If the matrix is ​​composed of a row (or a column) of data, then the number of non-zero singular values ​​is the dimension of the low-dimensional subspace where the data is really located. The traditional PCA method is derived from this, by observing the curve of singular value decline, and doing manual truncation to judge the dimensionality of dimensionality reduction (so it can be regarded as the corresponding version of Donoho's idea on the matrix). Candes et al. proposed Robust PCA, which automatically learns the dimensionality reduction by applying singular value sparsity to the matrix (from l1 to nuclear norm), and at the same time denoising the original matrix (you can see what I wrote Another answer how to understand the "rank" of the matrix?-Overfitting answer ). This also leads to another new idea: Generally speaking, there may be outliers in the data matrix, that is, a small part of the rows are contaminated, so a sparse outlier matrix can be obtained to "purify" the original matrix. Here the sparsity is imposed on the outlier matrix to be solved. Another new idea is the Sparse PCA proposed by Zou & Hastie, which is to the effect of applying sparsity to the loading vector, so that the principal component can be modeled. In general, people in this field are now basically developing new ideas on matrices (some people have even begun to play with higher-order tensors).

It is worth mentioning that all the figures mentioned in this article are from the Stanford Department of Statistics.
In fact, most of the classic articles in this field are also from here.

Guess you like

Origin blog.csdn.net/a493823882/article/details/104542031