Application of R language gWQS package in weighted quantile and regression models

In epidemiological studies, simultaneous exposure to multiple factors is more common than exposure to a single factor. Traditional models have problems such as high data dimensions and multicollinearity when evaluating joint exposure to multiple factors. The basic principle of the WQS regression model is to synthesize the effects of multiple research factors into an index through the quantile interval and weighting methods, and then Perform regression analysis. The weight given to different factors reflects the degree of their impact on the outcome. When using this model, the basic assumption that each research factor
affects the outcome in the same direction should be met.
Insert image description here
The general form of the model is:
Insert image description here
In the formula: c represents the type of pollutant; β 0 represents the intercept; β 1 represents the regression coefficient, which is used to limit the direction of the joint effect on the outcome; w i represents the unknown weight of the i-th factor, with a value range of [0 , 1], and ∑wi = 1, q i represents the q quantile of factor i (such as third, fourth quartile, etc.);
Insert image description here
The above formula represents the comprehensive weight index of c research factors; z is the covariate matrix, φ is the regression coefficient of the matrix; g ( ) is the connection function, and μ is the mean.

Insert image description here
Let's demonstrate it below. First, import the R package and data. The data uses the data that comes with gWQS.

library(gWQS)
library(ggplot2)
library(reshape2)
data(wqs_data)

Insert image description here
The data is quite large. The above figure is only a part of the data. These data reflect the exposure to 34 types of PCBs and 25 types of phthalates among the subjects participating in the NHANES study (2001-2002). The 59 exposure concentrations simulated by the distribution of biomarkers are, in summary, the concentrations of some indicators. The outcomes include continuous variables and categorical variables, and gender as a covariate.
The idea of ​​the WQS regression model is to package the indicators into an index. The first step is to determine which indicators we study. Suppose we study the first 34 indicators

PCBs <- names(wqs_data)[1:34]
PCBs

Insert image description here
Then you can generate a model, using y ~ wqs+sex to combine the joint effects of y and 34 PCBs, establish a regression equation and adjust gender (sex). Among them, wqs is a fixed parameter (that is, it must contain items), mix_name=mix means specifying the joint exposure pollutant, data =wqs_data means the input data set is wqs_data; q=10 means dividing the joint effect into 10 quantiles, and in the actual application process Researchers can set different quantiles; validation=0.6 means that 60% of the data set is randomly selected as the validation set, and the remaining 40% is used as the training set; b means the number of bootstrap random samplings, and this parameter is at least 100; b1_pos=TRUE Indicates that the weight of the joint effect is set to be positive (if it is
negative, it is set to FALSE); b1_constr=FALSE indicates that there is no restriction when using the optimization algorithm to estimate the weight (if it is restricted, it is set to TURE); family="gaussian" indicates that Gaussian distribution is used for fitting, and binomial distribution, polynomial or Poisson distribution can also be used for fitting according to the data type of the research object; due to the bootstrap random sampling process involved, the The random seed number (seed) is set to 2021.

results2i <-gwqs (y ~ wqs+sex, mix_name=PCBs, data=wqs_data,
                         q=10, validation=0.6, b=100, b1_pos=TRUE,
                         b1_constr=FALSE, family="gaussian", seed=2021)

Insert image description here
Analyzing the results, we can see that this joint index is related to the outcome.

summary(results2i)

Insert image description here
You can also use gwqs parsing function to generate standardized tables

gwqs_summary_tab(results2i)

Insert image description here
You can also view coefficients and confidence intervals

summary(results2i)[["coefficients"]]
confint(results2i)

Insert image description here
Next we
Let’s check the weight composition ratio of pollutants

gwqs_weights_tab(results2i)

Insert image description here
That's okay too

results2i$final_weights

We can further visualize this by drawing a bar chart. We can see that the first four indicators have the greatest impact on the outcome.

gwqs_barplot(results2i)

Insert image description here
We can also extract the data and use ggplot, which is more beautiful.

w_ord <- order(results2i$final_weights$mean_weight)
mean_weight <- results2i$final_weights$mean_weight[w_ord]

mix_name <- factor(results2i$final_weights$mix_name[w_ord],
                   levels = results2i$final_weights$mix_name[w_ord])
dataplot <- data.frame(mean_weight, mix_name)

ggplot(dataplot, aes(x = mix_name, y = mean_weight, fill = mix_name)) +
  geom_bar(stat = "identity", color = "black") + theme_bw() +
  theme(axis.ticks = element_blank(),
        axis.title = element_blank(),
        axis.text.x = element_text(color='black'),
        legend.position = "none") + coord_flip()

Insert image description here
Draw a correlation curve and you can see that there is a positive correlation.

gwqs_scatterplot(results2i)

Insert image description here
Plot the residuals and you can check if they are randomly distributed around 0 or if there is a trend

gwqs_fitted_vs_resid(results2i)

Insert image description here
We can also draw boxplots, but to draw boxplots we need to use the gwqsrh function to generate the following results

results3i <-gwqsrh (y ~ wqs+sex, mix_name=PCBs, data=wqs_data,
                  q=10, validation=0.6, b=5, b1_pos=TRUE,seed=2021,
                  b1_constr=FALSE, family="gaussian", future.seed=TRUE)

Plot after generating results

gWQS::gwqsrh_boxplot(results3i)

Insert image description here
You can also use ggplot to extract data for plotting

wboxplot <- melt(results3i$wmat, varnames = c("rh", "mix_name"))

wboxplot$mix_name <- factor(wboxplot$mix_name, levels = results3i$final_weights$mix_name)

ggplot(wboxplot, aes(x = mix_name,  y = value,fill=mix_name))+
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45,  hjust = 1)) 

Insert image description here
You can also adjust it

ggplot(wboxplot, aes(x = mix_name,  y = value,fill=mix_name))+
  geom_boxplot()+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45,  hjust = 1))+
  ylab("Weight (%)") + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 3) + 
  geom_jitter(alpha = 0.3)

Insert image description here
If the outcome is a two-category indicator, we can also draw a roc curve and we can regenerate a result.

results4i <-gwqs (ybin ~ wqs+sex, mix_name=PCBs, data=wqs_data,
                  q=10, validation=0.6, b=100, b1_pos=TRUE,
                  b1_constr=FALSE, family="binomial", seed=2021)

gwqs_ROC(results4i,wqs_data)

Insert image description here
references

  1. gwqs documentation
  2. Carrico C , Gennings C , Wheeler D C ,et al.Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting[J].Journal of Agricultural, Biological, and Environmental Statistics, 2014.DOI:10.1007/s13253-014-0180-3.
  3. Li Juejun, Huang Junli, Chen Haijian, Mo Chunbao. Application of weighted quantile and regression models and implementation with R software [J]. Preventive Medicine, 2023, 35(3): 275-276. DOI: 10.19485/j.cnki. issn2096-5087.2023.03.021.
  4. https://blog.csdn.net/qq_42458954/article/details/120157806
  5. https://blog.csdn.net/weixin_42812146/article/details/126192945

Guess you like

Origin blog.csdn.net/dege857/article/details/134684122