Explore the R language regression analysis of data BRFSS data

Executive Summary

This document is the report from the final course project for the Introduction to Probability and Data course, as part of the Duke/Coursera Statistics with R specialization. The project consisted of exploring a real-world dataset - CDC’s 2013 Behavioral Risk Factor Surveillance System - and creating a report on three student-chosen research questions.

The research questions chosen - and their respective results - were:

  • Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference between gender?
    • Yes, there were noticeable relations between health perception and BMI, as well as gender-specific differences.
  • How does being a parent of a young child affect the amount sleep time reported? How is this reported differently between genders?
    • Being a parent of a young child resulted in less sleep being reported, including a difference between the genders.
  • Are responses to general health perception related to the time of year of the survey was conducted? How do any differences show up across states?
    • There were no significant differences between winter and non-winter responses at the national level, but there indications of differences in per-state responses.

Setup

The initial phase consisted of loading the required packages and data. This was done as per the project instructions.

Load packages

library(ggplot2)
library(dplyr)

Load data

The data was loaded from a local copy of the file, as per course instructions.

load("brfss2013.RData")
dim(brfss2013)
## [1] 491775    330

As can be seen above, the dataset consisted of almost 500,000 observations with 330 possible variables. Not all observations included all variables, so data quality was handled individually on each question below.


Part 1: Data

Background on the BRFSS

According to the CDC website, “The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.”

Methodology

According to the CDC, “BRFSS is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodological assistance from CDC. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”

Observations on Generalizability, Causality, and Bias

While the course material makes brief references to more advanced statistical content (causal inference), given the author’s current knowledge about causality, the following statements can be made:

  • On the topic of Generalizability: given the breadth of the survey - across all 50 states and other US territories, coordinated by the CDC with each state’s health agency, … - it does seem to capture enough of a random sample to make it generalizable to the broad US population.

  • On Causality: given that the BRFSS is an observational exercise - with no explicit random assignments to treatments - all relationships indicated may indication association, but not causation.

Also, given the methodology for the BRFSS, there are several concerns about bias:

  • By using a telephone survey, there is the possibility of underreporting several types of individuals:

    • Those that do not have access to a landline or mobile phone.

    • Those that do not respond to phone surveys on principle.

    • Those that were not available for survey when the survey was conducted.

  • Since the answers to the interview questions are not validated, respondents may alter their responses in a variety of ways:
    • over-reporting desireable behaviours and traits, while under-reporting undesirable ones.

    • systematically exaggerating traits such as height or income.

    • misremembering key information since being asked to remember details back to 30 days or more.

  • Finally, there are possible inconsistencies in interview practices and question sets between the participating state agencies. See the CDC website for details.

For future reference, it would be useful if the dataset included details about each interview in terms of what time of day it was collected, and how long it took. That would provide further insight about those who might or might not have taken part in the survey.


Part 2: Research questions

Research quesion 1:

Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference between genders?

This is an interesting question as it looks for linkage between one’s opinion of their health status to a slightly more objective measure of overall health. BMI is not without controversy (see here for example) but it is widely recognized. The difference between genders is also interesting, as one can tease out different perceptions and pressures within society.

The analysis was done using the following variables:

  • genhlth - Corresponds to General Health
  • X_bmi5cat - Computed variable that categorizes BMI into 4 categories. BMI is derived from reported height & weight.
  • sex - Reported gender

Research quesion 2:

How does being a parent of a young child affect the amount sleep time reported? How is this reported differently between genders?

This is an interesting question to estimate the impact that being a parent of young children might have on a respondent. Understanding this is useful to help others better understand and possibly empathize with parents. It is also useful to understand if this effect is markedly different between males and females.

The analysis was done using the following variables:

  • sleptim1 - Reported time slept per night
  • rcsrltn2 - Relationship of respondent to random child from same household
  • X_impcage - Imputed variable that classifies child age into 4 possible categories.
  • sex - Reported gender

Research quesion 3:

Are responses to general health perception related to the time of year of the survey was conducted? How do any differences show up across states?

This question looks at how possible seasonal aspects may affect the responses. In this case, the interest is in the potential impact of winter months on general health response. As a follow-on, it looks at a sample of US states for considering possible regional differences.

The analysis was done using the following variables:

  • genhlth - Corresponds to General Health
  • imonth - Month that the interview was conducted
  • X_state - State of residency for respondent

Part 3: Exploratory data analysis

Research quesion 1:

Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference between gender?

# Select appropriate variables from dataset and omit NAs
q1 <- select(brfss2013,genhlth,sex,X_bmi5cat) %>% na.omit()
dim(q1)
## [1] 463274      3
prop.table(table(q1$genhlth,q1$X_bmi5cat),2)
##            
##             Underweight Normal weight Overweight      Obese
##   Excellent  0.19990243    0.26019496 0.17373887 0.07933813
##   Very good  0.26393463    0.35069868 0.35401238 0.26824837
##   Good       0.26149530    0.24667514 0.30698451 0.37088006
##   Fair       0.15831199    0.09751640 0.11943759 0.19913468
##   Poor       0.11635565    0.04491484 0.04582665 0.08239876

After the initial load of data (over 460,000 observations), we can take an initial look at the frequency of responses and then consider their proportion.

The way to interpret the table above is that for each column (“Underweight”,“Normal weight”, …) what is the proportion of respondents that indicated their health as “Excellent”, “Very good”, … In other words, the column sums up to 1.

An easier graphical representation can be seen below:

g <- ggplot(q1) + aes(x=X_bmi5cat,fill=genhlth) + geom_bar(position = "fill") 
g <- g + xlab("BMI category") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g

There are interesting trends to observe:

  • The proportion of reports of “Excellent” health increases from those that are Underweight to Normal Weight, but then decreases significantly from Normal Weight to Obese. This indicates there is a possible awareness of overall health status.

  • The “Excellent” decrease appears to be greater in magnitude than the mirror trend on the increase of those reporting “Poor” health. This may indicate a lack of awareness/education on what constitutes good health.

What about the effect of gender?

g <- ggplot(q1) + aes(x=sex,fill=genhlth) + geom_bar(position = "fill") + facet_grid(.~X_bmi5cat)
g <- g + xlab("BMI category per Gender") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g

In this case, we can observe the following:

  • Females report a higher proportion of “Excellent” health status than males when their BMI categorization is “Underweight” or “Normal weight”. This could indicate a stronger association of health with slimness, reflecting broader societal opinions.

  • Females report a lower proportion of “Excellent” health status than males when their BMI categorization is “Overweight” or “Obese”. This could indicate an oversensitivity to weight as a component of overall health.

In summary, the analysis seems to indicate that, given the research question: yes, there were noticeable relations between health perception and BMI, as well as gender-specific differences.

Given the analysis performed, however, these relations cannot be used to infer causality.


Research quesion 2:

How does being a parent of a young child affect the amount sleep time reported? How is this reported differently between genders?

q2 <- select(brfss2013,sleptim1,sex,rcsrltn2,X_impcage)
table(q2$sleptim1)
## 
##      0      1      2      3      4      5      6      7      8      9 
##      1    228   1076   3496  14261  33436 106197 142469 141102  23800 
##     10     11     12     13     14     15     16     17     18     19 
##  12102    833   3675    199    447    367    369     35    164     13 
##     20     21     22     23     24    103    450 
##     64      3     10      4     35      1      1

The initial data load indicates that there are coding errors in the data. The cleanup involves removing reported sleep times longer than 16 hours per day. This was an arbitrary decision based on the data.

q2_pop <- select(q2,sex,sleptim1) %>% na.omit() %>% filter(sleptim1 <= 16)
dim(q2_pop)
## [1] 484056      2
q2_parent <- na.omit(q2) %>% filter(rcsrltn2=="Parent" & sleptim1 <= 16) %>%
  mutate(young=X_impcage %in% c("0-4 Years old","5-9 Years old"))
dim(q2_parent)
## [1] 57857     5

This data load performs two data selection operations:

  • First, it selects the proper columns from the original dataset into the q2 data frame.

  • It then creates two separate dataframes for analysis:

    • q2_pop : for the broader population, omitting miscoded values.

    • q2_parent : leverages the Random Child Selection set of questions from the BRFSS and selects those that identified themselves as “Parents”. Furthermore, it adds a column for identifying children less than 10 years old.

It is important to note that while the broad population is approximately 480,000 samples, the Random Child Selection module of the BRFSS yields a little less than 60,000 samples.

For the general population, we have the following reported sleep distribution (red line corresponds to mean):

summarize(q2_pop,avg=mean(sleptim1),sd=sd(sleptim1))
##        avg       sd
## 1 7.042784 1.431061
g <- ggplot(q2_pop) + aes(x=sleptim1) 
g <- g + geom_histogram(binwidth = 1, color="black", fill="white")
g <- g + xlab("Sleep Time (hrs)") + ylab("Reported Count")
g

For the parents groups, the characteristics of the distribution are:

summarize(q2_parent,avg=mean(sleptim1),sd=sd(sleptim1))
##        avg       sd
## 1 6.854521 1.315791

And for parents of small children, the distribution looks like:

filter(q2_parent,young==TRUE) %>% summarize(avg=mean(sleptim1),sd=sd(sleptim1))
##        avg      sd
## 1 6.847745 1.31827

Finally, looking at gender differences for parents of small children:

filter(q2_parent,young==TRUE) %>% group_by(sex) %>% summarize(avg=mean(sleptim1),sd=sd(sleptim1))
## # A tibble: 2 x 3
##      sex      avg       sd
##   <fctr>    <dbl>    <dbl>
## 1   Male 6.755862 1.230122
## 2 Female 6.909699 1.371082

Looking at the characteristics of the distribution, and the original research question, it appears that there are differences between the genders in reported hours of sleep both between the general population and those that responded as being parents of small children. It is expected that further statistical techniques will allow us to quantify the significance of such differences.


Research quesion 3:

Are responses to general health perception related to the time of year of the survey was conducted? How do any differences show up across states?

# Define Winter months
winter <- c("December","January","February")

q3 <- select(brfss2013,genhlth,imonth,X_state) %>%
  na.omit() %>%
  mutate(winter=imonth %in% winter)
dim(q3)
## [1] 489790      4
prop.table(table(q3$genhlth,q3$winter),2)
##            
##                  FALSE       TRUE
##   Excellent 0.17393076 0.17643433
##   Very good 0.32401281 0.32724673
##   Good      0.30769272 0.30641019
##   Fair      0.13705171 0.13362268
##   Poor      0.05731200 0.05628606

The initial data load for this question resulted in approximately 490,000 samples. As per the research question, the variables extracted were the general health reported, the month the interview took place, and the respondent’s state of residence.

For this analysis, an extra column was added indicating if the interview took place in the months typically associated with winter.

Looking at the proportion table (looking down both FALSE and TRUE columns), the reported health is very similar regardless of status of “winter collection”. This can be also visualized in the following plot:

g <- ggplot(q3)+aes(x=winter,fill=genhlth)+geom_bar(position = "fill")
g <- g + xlab("Winter interview per state") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g

Interestingly, when we look at state-specific data, a slightly different picture appears. A sample of US states was selected for further analysis:

# Define states of interest
states <- c("Alaska","California","Massachusetts","New Hampshire","Wyoming")

q3_states <- filter(q3,X_state %in% states)
dim(q3_states)
## [1] 43608     4
group_by(q3_states,X_state,winter) %>% summarise(count=n())
## Source: local data frame [10 x 3]
## Groups: X_state [?]
## 
##          X_state winter count
##           <fctr>  <lgl> <int>
## 1         Alaska  FALSE  3432
## 2         Alaska   TRUE  1129
## 3     California  FALSE 11105
## 4     California   TRUE   403
## 5  Massachusetts  FALSE 10631
## 6  Massachusetts   TRUE  4411
## 7  New Hampshire  FALSE  4525
## 8  New Hampshire   TRUE  1539
## 9        Wyoming  FALSE  5685
## 10       Wyoming   TRUE   748
g <- ggplot(q3_states)+aes(x=winter,fill=genhlth)+geom_bar(position = "fill")+facet_grid(.~X_state)
g <- g + xlab("Winter interview per state") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g

In this case, the plot shows some noticeable differences in the proportion of respondents indicating “Excellent” health in the winter months. This might be attributed to different factors such as:

  • mood during the winter months (either enjoying colder temperatures or enjoying warmer temperatures compared to the rest of the country)

  • discrepancies in per-state data collection - California, for example, shows a very small number of cases in the winter

  • additional factors.

Guess you like

Origin www.cnblogs.com/tecdat/p/11982833.html