gwas data acquisition How to obtain complete GWAS summary data (1) ------GWAS catalog database

This is the OpenGWAS project (mrcieu.ac.uk)

UK Biobank - UK Biobank

GWAS Catalog 

In Mendelian randomization (MR) studies, we only need significant SNP information for exposure data, and such information is easily available in various GWAS databases. However, regarding the outcome data, since the SNP is not related to the outcome, many times this insignificant result cannot be directly queried from the article or database. At this time, we need to download the complete GWAS summary data. This kind of data It generally contains millions or even tens of millions of SNP information, so the amount of data is relatively large (about 200M after compression). I hope everyone is aware of it and is prepared.

Next, I will introduce how to download the complete GWAS summary data from the GWAS catalog

First, enter the official website of GWAS catalog (https://www.ebi.ac.uk/gwas/) and click a>Summary statistics (as shown in the figure below)

Enter Summary statistics and clickAvailable studies (as shown in the figure below)

Finally, you will enter the following interface (link:https://www.ebi.ac.uk/gwas/downloads/summary-statistics)

The interface mainly consists of three parts

The first block is "List of published studies with summary statistics" (as shown in the figure below ): The GWAS studies here are allpublished, and their quality is guaranteed, you can enter keywords in the search box (marked in red) to search for the phenotype of interest.

The second block is "List of prepublished/unpublished studies with summary statistics" (as shown below shown): The GWAS study here is unpublished (may be derived from a preprint), The quality cannot be guaranteed. You can enter keywords in the search box (marked in red) to search for the phenotype of interest. The phenotypes here are likely to be relatively new and complementary to published data. When you really can’t find the data, you might as well try here.

The third block is "Additional sources of summary statistics" (as shown in the figure below): Here is a summary of the current GWAS research collaboration (consortium) related information. Generally, these collaborations have their own websites to store data. We can download the complete GWAS summary data from their official websites. Marked in red in the picture are the coronary heart disease research collaborations.

The GWAS catalog database is a treasure. Mickey Mouse is here to inspire others. I hope everyone can study and use it more deeply. You are also welcome to exchange your ideas via private messages (WeChat: MedGen16)!

PS: Sometimes the GWAS catalog needs to be opened in foreign agency mode before it can be used, friends, be prepared in advance!

ssgac

Get the source of gwas

Data included

 

1 Read exposure data

1.2 Save exposure

Start practicing

Read exposed data

Read ending data

harmonize data 

mr

Sensitivity analysis 

 Significant and independent, obtain instrumental variables

 The advantage is that it is fast, but the disadvantage is that it is possible

May not be independent of each other Linkage disequilibrium

5 * 10 -8

It shows that the instrumental variable is related to the exposure but not related to the outcome.

Maybe I lost my snp

step1 r reads exposed data

 Requires correlation setting subset function 5*10 -8

Independence setting clump function to remove linkage disequilibrium ld r2 The smaller the better, usually 0.001 and the maximum is 0.1.   

Depends on the number of snp, distance 500kb is also ok

Statistical strength setting f>10 is better

 1.1 Requires correlation setting subset function 5*10 -8

1.2 Modify the column name of the file

1.3 Independence setting Exposed data after re-reading subset read_exposure_data

clump default ldr2<0.01

You can clump it later clump_data

 step2 read outcomedata

1 read.table 

2 merge to get the intersection

2.1 Change the listing name

3 read_out_come_data

summary

 Effect allele

 Need to use code coordination A--.>T

agent snp

The agent snp is set to 0.8. The larger it is, the more it indicates that there is linkage disequilibrium between them, indicating that they have a large influence on each other, and the possibility of them replacing each other is high.

But when setting the independence, make ld r2 as small as possible 0.001

Samples overlap

Exposed data 500,000 

Ending data 1 million

SNP data must be greater than 500w to be used. Normally it can reach 1000w.

step3 coordination harmonise

Eliminate palindrome sequences 

save document

 Ensure that the exposed SNP is not related to the outcome

snp is related to exposure

SNP is not related to the outcome, consistent with the hypothesis

step4 mr

ivw is a random effects model

Outcomes are continuous variables using beta values ​​bounded by 0

When the outcome is a categorical variable, it needs to be logarithmically transformed, use or and use 1 as the boundary.

Use other methods

mr(dat,method_list=c())

 When drawing a scatter plot, choose the method you want to draw it.

5 Visualization of results

6 Sensitivity analysis includes: heterogeneity detection pleiotropy detection 

Heterogeneity detection

If heterogeneity <0.05, there is heterogeneity.

There is heterogeneity and it does not affect the reliability of the results.

nbdistribution is set to 1w, which is more accurate

6.1 Find the snp run_mr_pressor that has the greatest impact on heterogeneity

nb

 

Does this outlier have an impact on the direction? If not, then p>0.05 

l List outliers, p is less than 0.05, indicating the existence of heterogeneity

If there is a lot of heterogeneity, throw in a few SNPs in time and recalculate and there will still be heterogeneity.

6.2 Heterogeneity visualization funnel plot

The more symmetrical the better 

will exist; even if there is no heterogeneity, the funnel plot is asymmetrical

6.2 Multiple Effects mr_pleiotropy_test() If the result is not good, it will be withdrawn and the article will not be published.

Functional pleiotropy Horizontal pleiotropy

For example, snp may affect ad through other phenotypes, rather than through the bmi phenotype.

 0.078》0.05 No pleiotropy

Use egger_intercept to evaluate multiple effects

The p value of the intercept between egger and the y-axis is to evaluate whether the intercept exists

If p》0.05, there is no significance, indicating that the intercept does not exist

If p<0.05, it is significant. It shows that when SNP is 0, there is a non-zero effect on outcome, indicating that SNP may affect the outcome by affecting other phenotypes. This indicates the existence of horizontal pleiotropy. Such results cannot be used

(When the effect of SNP on exposure is 0, it still has a non-zero effect on the outcome, indicating that there are other intermediate factors that affect the outcome, and it has horizontal pleiotropy)

6.3 leave-one-out 

If the result is good, the confidence interval should be to the right of the dotted line 

When the first rs3817334 is lost, do the remaining snp again.

Summarize

Use r to analyze

1 Extract exposure data 

2 Import ending data 

 

The follow-up is the same 

Screen for the second phenotype of SNP. If a second phenotype exists, it may need to be screened out.

7 Statistical performance calculation power

sample size is the total sample size

 aDefault 0.05

k Proportion of the number of cases to the total number

or value is the calculated value

  r2 is the sum of r2 of all snp (60)  

Guess you like

Origin blog.csdn.net/qq_52813185/article/details/134521955