NHANES数据库的介绍及使用(二)

前一篇介绍了NHANES数据库的加权及数据的下载NHANSE数据库的介绍及使用(一)_Christina-CSDN博客,这一篇主要介绍数据库如何导入软件进行下一步计算合并。

例一:

以NHANSE数据库的文章为例(Brody DJ, Pratt LA, Hughes J. Prevalence of depression among adults aged 20 and over: United States, 2013-2016. NCHS Data Brief, no 303. Hyattsville, MD: National Center for Health Statistics. 2018.)

1.加载安装包

library(dplyr)
library(survey)

2.下载数据

此步骤可以在官网上下载,或使用软件下载。

# Download & Read SAS Transport Files
# Demographic (DEMO)
download.file("https://wwwn.cdc.gov/nchs/nhanes/2013-2014/DEMO_H.XPT", tf <- tempfile(), mode="wb")
DEMO_H <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]
download.file("https://wwwn.cdc.gov/nchs/nhanes/2015-2016/DEMO_I.XPT", tf <- tempfile(), mode="wb")
DEMO_I <- foreign::read.xport(tf)[,c("SEQN","RIAGENDR","RIDAGEYR","SDMVSTRA","SDMVPSU","WTMEC2YR")]

# Mental Health - Depression Screener (DPQ) 
download.file("http://wwwn.cdc.gov/nchs/nhanes/2013-2014/DPQ_H.XPT", tf <- tempfile(), mode="wb")
DPQ_H <- foreign::read.xport(tf)
download.file("http://wwwn.cdc.gov/nchs/nhanes/2015-2016/DPQ_I.XPT", tf <- tempfile(), mode="wb")
DPQ_I <- foreign::read.xport(tf)

3.合并数据

# Append Files
DEMO <- bind_rows(DEMO_H, DEMO_I)
DPQ <- bind_rows(DPQ_H, DPQ_I)

# Merge DEMO and DPQ files and create derived variables

One <- left_join(DEMO, DPQ, by="SEQN") %>%
  # Set 7=Refused and 9=Don't Know To Missing for variables DPQ010 thru DPQ090 ##
  mutate_at(vars(DPQ010:DPQ090), ~ifelse(. >=7, NA, .)) %>%
  mutate(. , 
         # create indicator for overall summary
         one = 1,
         # Create depression score as sum of variables DPQ010 -- DPQ090
         Depression.Score = rowSums(select(. , DPQ010:DPQ090)),
         # Create depression indicator as binary 0/100 variable. (is missing if Depression.Score is missing)
         Depression= ifelse(Depression.Score >=10, 100, 0), 
         # Create factor variables
         Gender = factor(RIAGENDR, labels=c("Men", "Women")),
         Age.Group = cut(RIDAGEYR, breaks=c(-Inf,19,39,59,Inf),labels=c("Under 20", "20-39","40-59","60 and over")),
         # Generate 4-year MEC weight (Divide weight by 2 because we are appending 2 survey cycles)                                                             
         # Note: using the MEC Exam Weights (WTMEC2YR), per the analytic notes on the               
         #       Mental Health - Depression Screener (DPQ_H) documentation                          
         WTMEC4YR = WTMEC2YR/2 ,
         # Define indicator for analysis population of interest: adults aged 20 and over with a valid depression score
         inAnalysis= (RIDAGEYR >= 20 & !is.na(Depression.Score))
  ) %>% 
  # drop DPQ variables
  select(., -starts_with("DPQ"))

由于使用了两年的数据,因此weight需要计算,WTMEC4YR=WTMEC2YR

4.定义survey数据集

NHANES_all <- svydesign(data=One, id=~SDMVPSU, strata=~SDMVSTRA, 
                        weights=~WTMEC4YR, nest=TRUE)

选择子集

NHANES <- subset(NHANES_all, inAnalysis)

5.统计分析

计算加权均值及标准差,定义函数

getSummary <- function(varformula, byformula, design){
  # Get mean, stderr, and unweighted sample size
  c <- svyby(varformula, byformula, design, unwtd.count ) 
  p <- svyby(varformula, byformula, design, svymean ) 
  outSum <- left_join(select(c,-se), p) 
  outSum
}

计算抑郁分层的结果 

getSummary(~Depression, ~one, NHANES)
#' By sex
getSummary(~Depression, ~Gender, NHANES)
#' By age
getSummary(~Depression, ~Age.Group, NHANES)
#' By sex and age
getSummary(~Depression, ~Gender + Age.Group, NHANES)

注意,在NHANSE数据库使用过程中,首先要定义survey数据集,再进行subset运算,  不能直接subset取子集计算,否则会导致有偏估计。

例二:以今年发表在EST上的文献为例

Exposure: chloroform (TCM); bromodichloromethane (BDCM); dibromochloromethane (DBCM); bromoform (TBM)

Outcome: thyroid function (FT4;FT3; TT4;TT3; TPOAb; TgAb)

Exclusion criterion: thyroid diseases; prescription medications, pregnant status, <20 years old

Covariates: demographic data; serum cotinine

Year: 2007-2008

1.数据下载

根据文献中的暴露变量及结局变量,协变量,下载相应数据集。

2.数据导入


setwd("C:\\Users\\18896\\Desktop\\NHANSE20211110\\example1")
library(foreign)

DEMO_E <- read.xport("DEMO_E.XPT")
BMX_E<-read.xport("BMX_E.XPT")
MCQ_E <- read.xport("MCQ_E.XPT")
RHQ_E <- read.xport("RHQ_E.XPT")
RXQ_RX_E <- read.xport("RXQ_RX_E.XPT")
RXQ_RX_E<-subset(RXQ_RX_E,!duplicated(RXQ_RX_E$SEQN))

THYROD_E <- read.xport("THYROD_E.XPT")
VOCMWB_E <- read.xport("VOCMWB_E.XPT")

3.数据合并 

##数据库整合
##并集(Union)
data_E <- DEMO_E
data_E <- merge(data_E, MCQ_E, by = "SEQN", all = T)
data_E <- merge(data_E, BMX_E, by = "SEQN", all = T)
data_E <- merge(data_E, RHQ_E, by = "SEQN", all = T)
data_E <- merge(data_E, RXQ_RX_E, by = "SEQN", all = T)
data_E <- merge(data_E, THYROD_E, by = "SEQN", all = T)
data_E <- merge(data_E, VOCMWB_E, by = "SEQN", all = T)

SEQN为唯一ID识别码,注意merge时,使用all=T,否则会丢失样本。

4.数据重命名

在进行数据计算时,需要将我们选择的变量进行重新命名以更好识别

data_new <- plyr::rename(data_E,
                         c(RIDAGEYR="age",DMDEDUC2="Education",
                           RIDEXPRG="pregnant.status",RIAGENDR="Gender",
                           RIDRETH1="race",
                           BMXWT="weight",BMXHT="height",BMXBMI="BMI",
                           LBXVBF="Bromoform",LBXVBM="Bromodichloromethane",LBXVCF="Chloroform",
                           LBXVCM="Dibromochloromethane",
                           LBXT3F="FT3",LBXT4F="FT4",LBXTT3="TT3",LBXTT4="TT4",
                           LBXTPO="TPOAb",LBXATG="TgAb",
                           RXDUSE="medication",MCQ160M="thyroid.deseases"
                         ))

5.数据加权


library(survey)
design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, 
                    weights=~WTMEC2YR, nest=TRUE,data=data_new)

design_new<-subset(design,SEQN%in%VOCMWB_E$SEQN & age>=20 & thyroid.deseases!=1 &
                     !is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &
                     !is.na(TPOAb) & !is.na(TgAb))

data_2<-subset(data_new,SEQN%in%VOCMWB_E$SEQN & age>=20 & 
                 thyroid.deseases!=1 &
                 !is.na(FT4) & !is.na(FT3) &!is.na(TT3) &!is.na(TT4) &
                 !is.na(TPOAb) & !is.na(TgAb))

6.一般情况分析

计算了数据中的年龄及种族加权及未加权的均值或比例,可以看出加权及未加权结果有很大差异,对数据进行基线信息描述时,应该使用加权结果。

#unweighted age and se
mean(data_2$age,na.rm=T)
#49.54916

# weighted age and se
svymean(~age, design_new, na.rm = TRUE)
#45.874

#' Proportion of unweighted interview sample 
data_2 %>% count(race) %>% 
  mutate(prop= round(n / sum(n)*100, digits=1))

#' Proportion of weighted interview sample
data_2 %>% count(race, wt=WTMEC2YR) %>%
  mutate(prop= round(n / sum(n)*100, digits=1))

具体在论文中呈现时,可以参考以下方式

7.svyglm分析

使用常规的glm和weighted glm会对结果进行有偏估计,应该在构建survey数据库的基础上,进行svyglm分析,以下是三个方法的比较。

#glm
Result2 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,
                  family = gaussian(), data=data_2)
summary(Result2)

#weighted glm
Result3 <- glm(TT4~Bromoform+age+Gender+race+BMI+Education,
               family = gaussian(), data=data_2,weights =WTMEC2YR )
summary(Result3)

#survey-weighted glm
Result1 <- svyglm(TT4~Bromoform+age+Gender+race+BMI+Education,
                  family = gaussian(), data=data_new,design=design_new)
summary(Result1)

ref:

Sun Y, Xia PF, Korevaar TI, Mustieles V, Zhang Y, Pan XF, Wang YX, Messerlian C. Relationship between Blood Trihalomethane Concentrations and Serum Thyroid Function Measures in US Adults. Environmental Science & Technology. 2021 Oct 7.

Brody DJ, Pratt LA, Hughes JP. Prevalence of depression among adults aged 20 and over: United States, 2013-2016.

Emecen-Huja P, Li HF, Ebersole JL, Lambert J, Bush H. Epidemiologic evaluation of Nhanes for environmental Factors and periodontal disease. Scientific reports. 2019 Jun 3;9(1):1-1.

Guess you like

Origin blog.csdn.net/qq_42458954/article/details/121296965