R data analysis: Mendelian randomization in practice

Many students asked about Mendelian randomization, I will try to sort it out again, I hope it will be helpful to everyone, first look at the picture below for 1 minute, stare at it and print the picture in my mind:

The above picture is the model diagram of instrumental variables (if you don’t know about instrumental variables, please refer to the previous article), let’s make one point clear: when we were doing Mendel, we were interested in the relationship between x and y, that is, the small b , but it is definitely wrong for us to directly run the regression of x on y, because there are many Us, so we use the instrumental variable G (our previous article on instrumental variables has a detailed explanation, please refer to it yourself) to estimate the small b we are interested in.

Now there is a natural and good instrumental variable G, which is our genetic variable. At this time, we have the above figure, and reiterate again: the value we are interested in, and we hope to get an accurate estimate in the end, is small b. According to the above figure, we should have the relationship between GY and GX .

If the above formula is to work, we need to know the relationship between GY and GX.

But our GY, that is, the relationship between genes and outcomes, has already been studied for us. We can directly go to GWAS to find the researched summary data and use it.

But our GX, that is, the relationship between genes and exposure, has already been studied for us. We can directly go to GWAS to find the researched summary data and use it.

That is to say, through Mendelian randomization, we can easily estimate the small b we need, that is, the relationship between exposure and outcome----this is the Mendelian randomization study that I will introduce to you again today.

The idea is so clear. It's that clear. Students who don't understand read it a few more times.

Terminology resolution

In order to help everyone understand the idea, there are a few terms that need to be mentioned in the practice of Mendelian randomization:

Linkage disequilibrium (linkage disequilibrium): We just said that we can have a lot of gene outcome/exposure relationships, that is, many genes in GWAS can be used. At this time, we do not want to have correlations between genes (it will cause double counting, making the results biased):

When we actually do it, the pattern is like the picture above, if you say it is irrelevant between SNPs, is it irrelevant? When the association frequency of different alleles at two sites is higher or lower than the expected frequency under the condition of independent random association, this situation exists objectively. At this time, the correlation between these instrumental variables is called linkage disequilibrium, and its size can be represented by LD r square. This index is also one of the indexes that we need to set during operation.

Horizontal Pleiotropy : To understand this concept, first look at the picture below:

It means that my ideal situation is to estimate b through the operation of ab/a, but looking at the above picture, is it inevitable that the path of f will appear? If f appears, the relationship between our genes and the outcome is f+ab. At this time, what I estimated using the original method is not b, but b+f/a, which is wrong (always remember that we care about b ) .

But if I have a lot of genetic variables, so there are a lot of f, if the expected mean value of all f is 0, then the result we get after summarizing is basically b, which is harmless. But I am afraid that all fs are biased to one side (all greater than 0 or less than 0), and there is a problem at this time, which is called directional pleiotropy, which is why we finally make a funnel diagram.

It is through the funnel diagram that all instrumental variables are funnel-distributed, which means that there is no bias. At this time, we think that the directional pleiotropic effect has been washed out and has no effect.

Alright, now that some of the terms above are explained, let's get into practice.

Practical

The most basic example: the example of BMI on CHD, I want to see BMI as the exposure, CHD as the ending mr, the code is only 4:

bmi_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
chd_out_dat <- extract_outcome_data(snps = bmi_exp_dat$SNP, outcomes = 'ieu-a-7')
dat <- harmonise_data(bmi_exp_dat, chd_out_dat)
res <- mr(dat)

The results are as follows. In the figure below, there are different methods for the small b we care about:

Even if this is done, it's that simple and fast.

The next step is the sensitivity analysis, the first is the heterogeneity test of each instrumental variable:

mr_heterogeneity(dat)

Cochran's  Q statistics can be obtained after running the code

Then there is the horizontal gene pleiotropy test, the code is as follows:

mr_pleiotropy_test(dat)

Run the code to get egger_intercept

Then there is a single SNP result test, the code is as follows:

res_single <- mr_singlesnp(dat)

After running, you can get the small b of each SNP

Then there is a leave-one-out test, the code is as follows:

mr_leaveoneout(dat)

Next, there will be several pictures in the paper, the first is the point map, the code is as follows:

mr_scatter_plot(res, dat)

In the dot plot, the effect of the same SNP on exposure is placed on the horizontal axis, and the effect on the outcome is placed on the vertical axis. At this time, the slope of the graph is our estimated small b.

Then the forest plot of the single SNP effect combination can be obtained with the mr_forest_plot function, mr_leaveoneout_plot can obtain the forest plot of the leave-one-out analysis, and mr_funnel_plot can help us obtain the funnel plot.

That's all there is to report, done.

But the above process has many prerequisites. For example, you need to know the GWASid of exposure and ending to proceed. There are many GWASs. For example, if you use the above code directly, it is actually the GWAS in the MR Base GWAS catalog. Of course, you can choose others, or use the latest GWAS you found .

The first step is to find the exposed summary data in the corresponding GWAS:

So what GWAS can we use? We can directly call out the directory of GWAS, the code is as follows:

data(gwas_catalog)

After running, about 150,000 genome-wide association study data can be obtained. The screenshot is as follows:

So for us, we now need to find the GWAS corresponding to the exposure we care about. For example, if I want to find the GWAS related to the "blood" phenotype, I can write the following code:

exposure_gwas <- subset(gwas_catalog, grepl("Blood", Phenotype_simple))

The above code is equivalent to only using the Phenotype_simple column for filtering. Of course, you can also combine other columns such as people, such as authors, such as regions, etc., are all possible.

What to do after selecting exposure-related GWAS is to further determine the intensity of genetic tool variables and exposure, which is generally described in the paper: First, relevance assumption was met considering that all SNPs have reached genome-wide significance ( p < 5 × 10−8  )

The specific operation is as follows:

exposure_gwas<-exposure_gwas[exposure_gwas$pval<5*10^-8,]

Through the above steps, we ensure that our genetic tool variables must be strongly correlated with exposure.

Then it is to form the prepared exposed GWAS data into a data format that can be used for MR analysis. The format_data() function is needed:

exposure_data<-format_data(exposure_gwas)

The exposure_data at this time looks like this:

It can be seen that there are many genetic tool variable SNPs. At this time, we need to consider linkage disequilibrium (linkage disequilibrium):

exposure_data<-clump_data(exposure_data, clump_r2 = 0.001)

In the above code, clump_r2 is the allowable correlation set. So far, we have manually screened out all instrumental variables, which solves the problem of finding instrumental variables. Another method is to automatically screen instrumental variables. For example, if I expose bmi, I can write the following code:

subset(ao, grepl("body mass", trait))

After running, I know that the gwasid I can choose is ieu-b-40. At this time, I can also automatically extract the instrumental variables. The purpose of these two methods is the same:

extract_instruments('ieu-b-40')

Then extract the summary estimates of the outcome according to the instrumental variables. The summary data of the extraction outcome also needs to know the GWASid, right? For example, the outcome I care about now is the systolic blood pressure, so I can write the following code:

outcome_gwas <- subset(ao, grepl("Systolic", trait))

After running, I can know all the gwasids related to systolic blood pressure. I choose the latest one. For example, I choose the following 2021:

Looking at the picture, we know that its id is ieu-b-5075, so I wrote it like this:

outcome_data <- extract_outcome_data(
    snps = exposure_data$SNP, outcomes = "ieu-b-5075")

Subsequent MR analysis can be done directly by merging, and the process is no different.

summary

Today I wrote the practical operation of Mendelian random words for everyone. The examples of the article are from [Chinese Mendelian Randomization] MRC-IEU of the University of Bristol, UK "Mendelian Randomization in R Language" Chapter 1: Using the MRBase web tool and the R package TwoSampleMR to do two-sample Mendelian randomization_哔哩哔哩_bilibili, thank you for reading it patiently

Guess you like

Origin blog.csdn.net/tm_ggplot2/article/details/127812640