Numerical description analysis in R language

The article and code have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

Before analysis, the categorical variables low, race, smoke, ht and ui in the data set birthwt were converted into factors.

library(MASS)
data(birthwt)
str(birthwt)

options(warn=-1)
library(dplyr)
birthwt <- birthwt %>% 
  mutate(low = factor(low, labels = c("no", "yes")),
         race = factor(race, labels = c("white", "black", "other")),
         smoke = factor(smoke, labels = c("no", "yes")),
         ht = factor(ht, labels = c("no", "yes")),
         ui = factor(ui, labels = c("no", "yes")))
str(birthwt)

Obtaining common statistics for each variable in the data frame is a quick way to explore the data set, which can be achieved with one of the following commands.

summary(birthwt)

The function summary() can perform summary statistics on each variable. For numeric variables, such as age, lwt, plt, ftv and bwt, the function summary() gives the minimum value, lower quartile, median, mean, upper quartile and maximum value; for categorical variables, Such as low, race, smoke, ht and ui, the frequency statistics table is given.

The function summ() of the epiDisplay package can be used on the data frame to obtain summary output in another format. It arranges the variables in rows and puts the minimum and maximum values ​​in the last two columns to facilitate viewing the full range of the data.

library(epiDisplay)
summ(birthwt)

It should be noted that for factor-type variables, the function summ() treats each level of the variable as a numerical calculation statistic.

Descriptive statistical analysis of numerical variables

This section will discuss the central tendency, degree of dispersion, and distribution shape of numerical variables. Here we focus on three continuous variables: age (age), mother's pre-pregnancy weight (lwt) and baby's birth weight (bwt).

cont.vars <- dplyr::select(birthwt, age, lwt, bwt)

Next, first calculate the descriptive statistics of these three variables, and then examine the descriptive statistics according to the mother's smoking status (smoke). Here smoke is a binary variable, and we have defined labels for its two levels when converting it into a factor: "no" and "yes".

In addition to the function summary( ) mentioned above, there are many functions in R for calculating specific statistics (see Chapter 2). For example, calculate the sample size, sample mean, and sample standard deviation of the variable age:

length(cont.vars$age)
mean(cont.vars$age)
sd(cont.vars$age)

We can also use the function sapply() to calculate specified statistics for multiple variables in the data frame at the same time. For example, calculate the sample standard deviation of each variable in the data frame cont.vars:

sapply(cont.vars, sd)

The basic package does not provide functions for calculating skewness and kurtosis. We can calculate it ourselves according to the formula, or we can call functions in other packages, such as the Hmisc package, the psych package, and the pstecs package. These packages provide a wide variety of functions for calculating statistics, and these packages need to be installed before first use. The following uses the psych package as an example for explanation. The psych package is widely used in quantitative psychology.

The function describe() in the psych package can calculate the sample size, mean, standard deviation, median, censored mean, absolute median difference, minimum value, maximum value, range, skewness, and peak of the variable after ignoring missing values. degree and standard error of the mean, etc.

For example:

R.Version()
library(psych)
describe(cont.vars)

Many times we also want to calculate statistics under each category of a certain categorical variable. There are many ways to accomplish this task in R. Let’s start with the basic package functions aggregate() and tapply().

aggregate(cont.vars, by = list(smoke = birthwt$smoke), mean)
aggregate(cont.vars, by = list(smoke = birthwt$smoke), sd)

The parameter by in the function aggregate() must be set to list. If list(birthwt$smoke) is used directly, the name of the above grouping column will be "Group.1" instead of "smoke". We can also set multiple categorical variables in the list, for example:

aggregate(cont.vars, 
          by = list(smoke = birthwt$smoke, race = birthwt$race), 
          mean)

There are 2 categorical variables here, of which smoke has 2 categories and race has 3 categories. The above command calculates the mean according to all combinations of each category of these two variables (a total of 6 groups).

Of course, you can also write in any of the following ways:

aggregate(birthwt[,c("age","lwt","bwt")], 
          by = list(smoke = birthwt$smoke, race = birthwt$race), 
          mean)

aggregate(cbind(age, lwt, bwt)~smoke+race, birthwt, mean)

The function tapply() can achieve a similar function, except that its first parameter must be a variable, and the second parameter name is INDEX instead of by. For example, to calculate the mean value of the variable bwt under different smoking conditions of the mother, you can enter:

tapply(birthwt$bwt, INDEX = birthwt$smoke, mean)
# no 3055.69565217391 yes 2771.91891891892

The function summ() in the epiDisplay package can also implement similar functions. The difference is that the statistics in this function are fixed, and the output of the function contains an ordered point plot drawn according to categorical variables, as shown in the figure below.

summ(birthwt$bwt, by = birthwt$smoke)

It is very convenient to use the ordered point plot output by the function summ() to explore the distribution of numerical variables, especially the dense trends and outliers of the data.

The function describeBy() in the psych package can also calculate the same statistics as the function describe() in groups, for example:

describeBy(cont.vars, birthwt$smoke)

Although the function describeBy() is very convenient, it cannot specify any function, so its scalability is poor. In fact, the functions group_by( ) and summarize( ) in the dplyr package introduced in Chapter 3 can calculate group statistics very flexibly. For example:

library(dplyr)
birthwt %>%
  group_by(smoke) %>% 
  summarise(Mean.bwt = mean(bwt), Sd.bwt = sd(bwt))

Data analysts can choose the way they are most comfortable calculating and displaying descriptive statistics. The last method has the clearest idea and the most concise result.

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132618830