Principal Component Analysis PCA and give explained percentages

Plotting: Includes scatterplots of PC1 and PC2, and explained percentages of PC1 and PC2.

1. Processing ideas

Ideas:
1. According to the plink file, perform pca analysis
2. Calculate the interpretation percentage of pca1 and pca2 according to the eigenvalues
​​3. Perform pca mapping according to the eigenvector results

2. Precautions

Notice:

The eigenvalue is the variance of the eigenvector in the corresponding dimension, and the ratio of the eigenvalue to the sum of all eigenvalues ​​is the variance contribution rate of the corresponding eigenvector.

simply put:

  • PCA1 is the eigenvector whose variance is the eigenvalue of PC1 and whose variance contribution rate is the percentage of the PC1 eigenvalue
  • PCA2 is the eigenvector whose variance is the eigenvalue of PC2 and whose variance contribution rate is the percentage of the PC2 eigenvalue

3. Example demonstration

Example:
For example, to calculate 3 pca of a plink file, the result is as follows:

plink --bfile geno/b --pca 3

Results include:

  • plink.eigenval , eigenvalue, a total of 3 rows of data, which are the eigenvalues ​​of 3 PCA
  • plink.eigenvec, eigenvector, the third, fourth and fifth columns are the eigenvectors of 3 PCA, the first two PCA are used for drawing
$ head plink.eigenvec
0 ID1 -0.032 0.0185407 0.0351135
0 ID2 -0.0330665 0.0213082 0.0575101
0 ID3 -0.0340043 0.0209365 -0.00264537
0 ID4 -0.0323621 0.0203962 0.0503156
0 ID5 -0.0325016 0.0191183 0.0426273
0 ID6 -0.0346765 0.0196053 -0.0408817

$ head plink.eigenval
145.367
74.7594
6.10604

4. Calculate the PCA percentage

If we want to calculate the score of each PCA very accurately, then we need to calculate the value of all PCA, the number of PCA is equal to the number of samples.

For example, if we have 575 samples, the code to calculate PCA is:

plink --bfile geno/b --pca 575

It can be seen that the number of samples and the number of rows of pca are both 575 rows

$ wc -l geno/b.fam
575 geno/b.fam
$ wc -l plink.eigenvec
575 plink.eigenvec
$ wc -l plink.eigenval
575 plink.eigenval

Calculate the percentage of each PCA in R language, and PCA visualization:

library(tidyverse)
library(tidyverse)
re1a = fread("plink.eigenval")
re1b = fread("plink.eigenvec")

re1a$por = re1a$V1/sum(re1a$V1)*100
head(re1a)

ggplot(re1b,aes(x = V3,y = V4)) + geom_point() + 
  xlab(paste0("PC1 (",round(re1a$por[1],2),"%)")) + 
  ylab(paste0("PC2 (",round(re1a$por[2],2),"%)")) 

result:

5. Use the first 10 to do PCA percentage calculation

Because the eigenvectors of PCA are arranged from large to small, the first 3 or the first 10 can also be used as representatives to calculate the percentage of PC1 and PC2. Let's test:

Take the first three
. The deviation is too big. PC1 has changed from 21% to 64% now, which is unreliable!

Taking the first ten
is not reliable, and the change is relatively large. It is better to use all the eigenvalues ​​​​to calculate the percentage honestly. Although the sparrow is small, it accumulates a mountain of soil!

Take all
this is the most correct!

6. One step in place

The problem now is that the number of samples needs to be checked, then define –pca number, and then read, which can be done in one step in R:

Ideas:

  • Read the fam of the plink file and determine the number
  • Call plink in R, pass the number of parameters
  • drawing

args="geno/b"
nn = dim(fread(paste0(args[1],".fam"),header=F))[1]
system(sprintf("~/bin/plink --bfile %s --allow-extra-chr --chr-set 30 --pca %s",args[1],nn))
re1a = fread("plink.eigenval")
re1b = fread("plink.eigenvec")

re1a$por = re1a$V1/sum(re1a$V1)*100
head(re1a)

ggplot(re1b,aes(x = V3,y = V4)) + geom_point() + 
  xlab(paste0("PC1 (",round(re1a$por[1],2),"%)")) + 
  ylab(paste0("PC2 (",round(re1a$por[2],2),"%)")) 

Guess you like

Origin blog.csdn.net/yijiaobani/article/details/127770834