pca learning diary

This content is written to myself only. The information is very messy, the content is chaotic, and the quality is poor. It is only used as a diary. If readers want to learn, it is recommended not to read this article. . . Learn the systematic materials in other high-quality blogs .

Two days ago, I wanted to project the calculated word vector on three dimensions, but it turned out to be a dimensionality reduction projection first. There are T-SNE and PCA. Although the former can retain more information, it is not easy to explain. After all, our sociology cares more about "interpretation", plus I don’t know what its internal mechanism is, so I did PCA. Although the three-dimensional things I made by PCA only have 14% explanatory power, which is miserable. For projection, please refer to this article: https://www.douban.com/note/740965280/ .

pca

There are two functions in r that can be used for pca, according to  http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in -r-prcomp-vs-princomp/ is prcomp() and princomp(), the difference is:

General methods for principal component analysis

There are two general methods to perform PCA in R :

  • Spectral decomposition which examines the covariances / correlations between variables
  • Singular value decomposition which examines the covariances / correlations between individuals

The function princomp() uses the spectral decomposition approach. The functions prcomp() and PCA()[FactoMineR] use the singular value decomposition (SVD).

  • princomp (): Spectral decomposition is used to decompose the spectrum and check the covariance/correlation covariances/correlations between variables
  • prcomp (): a singular value decomposition Singular value decomposition, check individual covariance between individuals / observation point observation / correlation covariances / correlations

Basic format of prcomp() and princomp() functions

The simplified format of these 2 functions are : 

prcomp(x, scale = FALSE)
princomp(x, cor = FALSE, scores = TRUE)
  1. Arguments for prcomp():
  • x: a numeric matrix or data frame
  • scale: a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place
    1. . . . . Actually, the unit variance sounds like z score scaling, yes, unit variance scaling and z score scaling are the same, that is, the mean value becomes 0, and then each number is divided by the standard deviation-The data for each variable (metabolite) is mean centered and then divided by the standard deviation of the variable. This way each variable will have zero mean and  unit  standard deviation. -------> wiki:  https://en.wikipedia.org/ wiki/Feature_scaling
    2. Regarding why scaling: The real reason is to make each dimension the same importance . In fact, in mathematics, singular value decomposition itself does not need to standardize or decentralize the elements in the matrix. However, PCA is usually used to reduce the dimensionality of high-dimensional data. It can project the original high-dimensional data onto a low-dimensional space and make its variance as large as possible. If the value of a certain feature of the data (a certain column of the matrix) is particularly large, then it has a large proportion of the entire error calculation. Then it can be imagined that after projecting into a low-dimensional space, in order to make the low-rank decomposition approximate the original data, The entire projection will try to approximate the largest feature while ignoring the feature with a smaller value. Because we don't know the importance of each feature before modeling, this may lead to a large amount of information missing. For the sake of "fairness" and to prevent over-capturing certain features with large values, we will first standardize each feature so that their sizes are within the same range, and then perform PCA. ///  In addition, from a computational perspective, there is another benefit to data standardization before PCA. Because PCA is usually a numerical approximation decomposition, rather than seeking eigenvalues, singular values ​​to obtain analytical solutions, so when we use gradient descent and other algorithms for PCA, we have to standardize the data first, which is beneficial to the gradient descent method Of convergence. ( Http://sofasofa.io/forum_main_post.php?postid=1000375 )
    3. According to the sofasofa website above, there are still interesting content, which is not relevant to this article, but it is especially excerpted: PCA that is not standardized is the eigenvector of the covariance matrix, and the standardized PCA is the eigenvector of the correlation matrix. As Qingfeng said on the first point, if there is no standardization, the eigenvector will be biased towards the variable with the largest variance and deviate from the theoretically best value. for example. Assuming a 2-dimensional Gaussian, the correlation matrix is ​​[1 0.4; 0.4 1], std(x1)=10, std(x2)=1. Theoretically the best decomposition vector is the long axis of the ellipse. If there is no standardization, the vector calculated by PCA will deviate from the long axis. After standardization, the deviation will be reduced to a small amount.
    4. It was also noted: the way PCA realized in fact there are four: 1. covariance matrix of the normalized correlation coefficient data after the data matrix 2. Standardization 3. After the data is not normalized correlation coefficient matrix 4. SVD mode after the standardized data   it The four methods are equivalent.
    5. For the relationship between pca and svd: see http://wap.sciencenet.cn/blog-2866696-1136451.html
    6. Add a pca principle: http://wap.sciencenet.cn/blog-2866696-1136447.html (os: Sciencenet is really good, that is, "blog comments are prohibited from 23:00 to 7:00 the next day." The rules confuse me
    7. Regarding eigenvector and eigenvalue: This question stems from the fact that I want to calculate a sif model in the sentence embedding model (sentence embedding), which mentioned that " calculate the first principal component u of the sentence vector matrix, and let each sentence vector subtract Its projection on u (similar to PCA)  ", (a brief introduction  https://developer.aliyun.com/article/714547 , original text: https://openreview.net/pdf?id=SyK00v5xx ), then what is called " What about the first principal component?” According to another code of the model in github, which uses u = pca.components_[0] to represent the first principal component, then what is pca.components_[0]? https://blog.csdn.net/sinat_31188625/article/details/72677088  //  https://github.com/jx00109/sentence2vec/blob/master/s2v-python3.py ,根据python代码实现:(传不上图)总之这东西原本的变量是多少个,它就是几维的。
    8. 第一个主成分到底是啥呢?According to stackexchange  https://stats.stackexchange.com/questions/311908/what-is-pca-components-in-sk-learn :
      • Annoyingly there is no SKLearn documentation for this attribute, beyond the general description of the PCA method.
      • Here is a useful application of pca.components_ in a classic facial-recognition project (using data bundled with SKL, so you don't have to download anything extra). Working through this concise notebook is the best way to get a feel for the definition & application of pca.components_
      • From that project, and this answer over on StackOverflow, we can learn that pca.components_ is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors using pca.components_here's how to get eigenvalues.
      • For further info on the definitions & applications of eigenvectors vs loadings (including the equation that links all three concepts), see here.
      • For a 2nd project/notebook applying pca.components_ to (the same) facial recognition data, see here. It features a more traditional scree plot than the first project cited above 
    9. Look at the words marked in red: pca.components_是一组在投影空间中的奇异向量(eigenvectors)as know as(也被称为)loading。每个主成分对应一个奇异向量。
  1. Arguments for princomp():
  • x: a numeric matrix or data frame
  • cor: a logical value. If TRUE, the data will be centered and scaled before the analysis
  • scores: a logical value. If TRUE, the coordinates on each principal component are calculated

The elements of the outputs returned by the functions prcomp() and princomp() includes :

prcomp() name princomp() name Description
sdev sdev the standard deviations of the principal components
rotation loadings the matrix of variable loadings (columns are eigenvectors)
center center the variable means (means that were substracted)
scale scale the variable standard deviations (the scaling applied to each variable )
x scores The coordinates of the individuals (observations) on the principal components.

Finally, add some useful information: https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

In PCA, you split covariance (or correlation) matrix into scale part (eigenvalues) and direction part (eigenvectors). You may then endow eigenvectors with the scale: loadings. So, loadings are thus become comparable by magnitude with the covariances/correlations observed between the variables, - because what had been drawn out from the variables' covariation now returns back - in the form of the covariation between the variables and the principal components. Actually, loadings are the covariances/correlations between the original variables and the unit-scaled componentsThis answer shows geometrically what loadings are and what are coefficients associating components with variables in PCA or factor analysis.

In PCA, we divide the covariance (or correlation) matrix into a scale part (eigenvalue) and a direction part (eigenvector). Then we assign scale to the feature vector: loading. Therefore, loading can be compared with the observed covariance/correlation between variables in terms of vector magnitude (by magnitude). This is because the content extracted from the covariance of the variable is now based on the covariance between the variable and the principal component The form is returned. In fact,  loadings  is the covariance/correlation between the original variable and the unit scaling component.

Loadings:

  1. Help you interpret principal components or factors; Because they are the linear combination weights (coefficients) whereby unit-scaled components or factors define or "load" a variable.

    (Eigenvector is just a coefficient of orthogonal transformation or projection, it is devoid of "load" within its value. "Load" is (information of the amount of) variance, magnitude. PCs are extracted to explain variance of the variables. Eigenvalues are the variances of (= explained by) PCs. When we multiply eigenvector by sq.root of the eivenvalue we "load" the bare coefficient by the amount of variance. By that virtue we make the coefficient to be the measure of association, co-variability.)

  2. Loadings sometimes are "rotated" (e.g. varimax) afterwards to facilitate interpretability (see also);

  3. It is loadings which "restore" the original covariance/correlation matrix (see also this thread discussing nuances of PCA and FA in that respect);

  4. While in PCA you can compute values of components both from eigenvectors and loadings, in factor analysis you compute factor scores out of loadings.

  5. And, above all, loading matrix is informative: its vertical sums of squares are the eigenvalues, components' variances, and its horizontal sums of squares are portions of the variables' variances being "explained" by the components.

  6. Rescaled or standardized loading is the loading divided by the variable's st. deviation; it is the correlation. (If your PCA is correlation-based PCA, loading is equal to the rescaled one, because correlation-based PCA is the PCA on standardized variables.) Rescaled loading squared has the meaning of the contribution of a pr. component into a variable; if it is high (close to 1) the variable is well defined by that component alone.

An example of computations done in PCA and FA for you to see.

Eigenvectors are unit-scaled loadings; and they are the coefficients (the cosines) of orthogonal transformation (rotation) of variables into principal components or back. Therefore it is easy to compute the components' values (not standardized) with them. Besides that their usage is limited. Eigenvector value squared has the meaning of the contribution of a variable into a pr. component; if it is high (close to 1) the component is well defined by that variable alone.

Although eigenvectors and loadings are simply two different ways to normalize coordinates of the same points representing columns (variables) of the data on a biplot, it is not a good idea to mix the two terms. This answer explained why. See also.

 

Projection calculation of vector:

An implementation of self-editing code: https://www.jianshu.com/p/b26c1eb2abb1

I want to watch "Downton Abbey", the name sounds like it, and it feels like "The Age of Innocence".

Take a look at this Writing Your Research in English, Jay Wang, March 3, 2020  https://www.bilibili.com/video/av93800074/  There is a certain threshold for viewing, but if you read it carefully a few times, it will help academic writing. Reposted from Douban  https://www.youtube.com/watch?v=QR7m9GR7Iic

 

Long-winded, I don't know the principle, nor the code, but I have a lot of sociological imagination.

Record the process of thinking and learning. There should be two conflicts inside, but I am too lazy to change it for the time being.

In short, try to record your own learning process, and find that your thinking is really good, it is intuitive, jumps and jumps. . . . . . .

I always go on a small run and look at this one. I was shocked myself! !

 

This content is written to myself only. The information is abundant and the content is disordered. It is recommended that readers learn other systematic materials.

 

Guess you like

Origin blog.csdn.net/weixin_40895857/article/details/105143785