In epidemiological studies, simultaneous exposure to multiple factors is more common than exposure to a single factor. Traditional models have problems such as high data dimensions and multicollinearity when evaluating joint exposure to multiple factors. The basic principle of the WQS regression model is to synthesize the effects of multiple research factors into an index through the quantile interval and weighting methods, and then Perform regression analysis. The weight given to different factors reflects the degree of their impact on the outcome. When using this model, the basic assumption that each research factor
affects the outcome in the same direction should be met.
The general form of the model is:
In the formula: c represents the type of pollutant; β 0 represents the intercept; β 1 represents the regression coefficient, which is used to limit the direction of the joint effect on the outcome; w i represents the unknown weight of the i-th factor, with a value range of [0 , 1], and ∑wi = 1, q i represents the q quantile of factor i (such as third, fourth quartile, etc.);
The above formula represents the comprehensive weight index of c research factors; z is the covariate matrix, φ is the regression coefficient of the matrix; g ( ) is the connection function, and μ is the mean.
Let's demonstrate it below. First, import the R package and data. The data uses the data that comes with gWQS.
library(gWQS)
library(ggplot2)
library(reshape2)
data(wqs_data)
The data is quite large. The above figure is only a part of the data. These data reflect the exposure to 34 types of PCBs and 25 types of phthalates among the subjects participating in the NHANES study (2001-2002). The 59 exposure concentrations simulated by the distribution of biomarkers are, in summary, the concentrations of some indicators. The outcomes include continuous variables and categorical variables, and gender as a covariate.
The idea of the WQS regression model is to package the indicators into an index. The first step is to determine which indicators we study. Suppose we study the first 34 indicators
PCBs <- names(wqs_data)[1:34]
PCBs
Then you can generate a model, using y ~ wqs+sex to combine the joint effects of y and 34 PCBs, establish a regression equation and adjust gender (sex). Among them, wqs is a fixed parameter (that is, it must contain items), mix_name=mix means specifying the joint exposure pollutant, data =wqs_data means the input data set is wqs_data; q=10 means dividing the joint effect into 10 quantiles, and in the actual application process Researchers can set different quantiles; validation=0.6 means that 60% of the data set is randomly selected as the validation set, and the remaining 40% is used as the training set; b means the number of bootstrap random samplings, and this parameter is at least 100; b1_pos=TRUE Indicates that the weight of the joint effect is set to be positive (if it is
negative, it is set to FALSE); b1_constr=FALSE indicates that there is no restriction when using the optimization algorithm to estimate the weight (if it is restricted, it is set to TURE); family="gaussian" indicates that Gaussian distribution is used for fitting, and binomial distribution, polynomial or Poisson distribution can also be used for fitting according to the data type of the research object; due to the bootstrap random sampling process involved, the The random seed number (seed) is set to 2021.
results2i <-gwqs (y ~ wqs+sex, mix_name=PCBs, data=wqs_data,
q=10, validation=0.6, b=100, b1_pos=TRUE,
b1_constr=FALSE, family="gaussian", seed=2021)
Analyzing the results, we can see that this joint index is related to the outcome.
summary(results2i)
You can also use gwqs parsing function to generate standardized tables
gwqs_summary_tab(results2i)
You can also view coefficients and confidence intervals
summary(results2i)[["coefficients"]]
confint(results2i)
Next we
Let’s check the weight composition ratio of pollutants
gwqs_weights_tab(results2i)
That's okay too
results2i$final_weights
We can further visualize this by drawing a bar chart. We can see that the first four indicators have the greatest impact on the outcome.
gwqs_barplot(results2i)
We can also extract the data and use ggplot, which is more beautiful.
w_ord <- order(results2i$final_weights$mean_weight)
mean_weight <- results2i$final_weights$mean_weight[w_ord]
mix_name <- factor(results2i$final_weights$mix_name[w_ord],
levels = results2i$final_weights$mix_name[w_ord])
dataplot <- data.frame(mean_weight, mix_name)
ggplot(dataplot, aes(x = mix_name, y = mean_weight, fill = mix_name)) +
geom_bar(stat = "identity", color = "black") + theme_bw() +
theme(axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text.x = element_text(color='black'),
legend.position = "none") + coord_flip()
Draw a correlation curve and you can see that there is a positive correlation.
gwqs_scatterplot(results2i)
Plot the residuals and you can check if they are randomly distributed around 0 or if there is a trend
gwqs_fitted_vs_resid(results2i)
We can also draw boxplots, but to draw boxplots we need to use the gwqsrh function to generate the following results
results3i <-gwqsrh (y ~ wqs+sex, mix_name=PCBs, data=wqs_data,
q=10, validation=0.6, b=5, b1_pos=TRUE,seed=2021,
b1_constr=FALSE, family="gaussian", future.seed=TRUE)
Plot after generating results
gWQS::gwqsrh_boxplot(results3i)
You can also use ggplot to extract data for plotting
wboxplot <- melt(results3i$wmat, varnames = c("rh", "mix_name"))
wboxplot$mix_name <- factor(wboxplot$mix_name, levels = results3i$final_weights$mix_name)
ggplot(wboxplot, aes(x = mix_name, y = value,fill=mix_name))+
geom_boxplot()+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
You can also adjust it
ggplot(wboxplot, aes(x = mix_name, y = value,fill=mix_name))+
geom_boxplot()+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
ylab("Weight (%)") + stat_summary(fun.y = mean, geom = "point", shape = 18, size = 3) +
geom_jitter(alpha = 0.3)
If the outcome is a two-category indicator, we can also draw a roc curve and we can regenerate a result.
results4i <-gwqs (ybin ~ wqs+sex, mix_name=PCBs, data=wqs_data,
q=10, validation=0.6, b=100, b1_pos=TRUE,
b1_constr=FALSE, family="binomial", seed=2021)
gwqs_ROC(results4i,wqs_data)
references
- gwqs documentation
- Carrico C , Gennings C , Wheeler D C ,et al.Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting[J].Journal of Agricultural, Biological, and Environmental Statistics, 2014.DOI:10.1007/s13253-014-0180-3.
- Li Juejun, Huang Junli, Chen Haijian, Mo Chunbao. Application of weighted quantile and regression models and implementation with R software [J]. Preventive Medicine, 2023, 35(3): 275-276. DOI: 10.19485/j.cnki. issn2096-5087.2023.03.021.
- https://blog.csdn.net/qq_42458954/article/details/120157806
- https://blog.csdn.net/weixin_42812146/article/details/126192945