R language ggplot 2 and other graphics

The article and code have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

1. First introduction to the ggplot2 package

The ggplot2 package provides a drawing system based on layer syntax, which makes up for the lack of consistency of functions in R's basic drawing system and elevates R's drawing functions to a whole new level. The basic principles of various data visualizations in ggplot2 are exactly the same, which maps mathematical space to graphical element space. Imagine there is a blank canvas on which we need to define the visual data (data) and the mapping of data variables to graphic attributes.

The following uses the data set mtcars to draw the graph.

This data set is taken from the American "Auto Trend" magazine in 1974 and contains 11 indicators of fuel consumption, design and performance of 32 cars: mpg (fuel consumption), cyl (number of cylinders), disp (displacement) , hp (total power), drat (rear axle ratio), wt (vehicle weight), qsec (quarter mile time), vs (engine type), am (transmission mode), gear (number of forward gears) and carb (number of carburetors). We first explore the relationship between vehicle weight and fuel consumption, mapping the variable wt to the x-axis, and the variable mpg to the y-axis.

library(ggplot2)
p <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) 

In the above command, aes represents the aesthetics element, and we put all the variables that need to be mapped in this function. Running p directly will only get a blank canvas, and you still need to define what kind of graphics to use to represent the data. A series of functions starting with geom are used to specify graphic elements, including points, lines, areas, polygons, etc. The following uses geometric objects such as points to display data, and the results are shown in the figure below.

p + geom_point()

In addition to coordinate axes, variables can also be mapped to attributes such as color, size, and shape.

For example, in order to show the relationship between vehicle weight and fuel consumption in different transmission modes, we can map the variable am to color (left in the figure below) or shape (right in the figure below). The variable am is a numeric variable (values ​​are 0 and 1) in the original data set. In fact, it should be a categorical variable, so we first convert it into a two-level factor.

library(gridExtra)
mtcars$am <- factor(mtcars$am)
p1 <- ggplot(data = mtcars, aes(x = wt, y = mpg, color = am)) + geom_point()
p2 <- ggplot(data = mtcars, aes(x = wt, y = mpg, shape = am)) + geom_point()
grid.arrange(p1, p2, nrow=1)

The graphs above are all displays of original data. Sometimes we need to make some kind of summary of the original data and then draw graphs. For example, fit a curve with the scatter points in the image above.

ggplot(data = mtcars, aes(x = wt, y = mpg, color = am)) + geom_smooth()

The default value of the parameter method in the function geom_smooth() is "loess", which is LOESS local weighted regression.

If you want to change the method of fitting the curve, you can change the value of the parameter method. For example, using linear regression

ggplot(data = mtcars, aes(x = wt, y = mpg, color = am)) + 
        geom_smooth(method = "lm")

There are two fitting lines in the above two pictures, that's because we map the variable am to a color attribute. If you only want to display a smooth line, you need to set the color mapping separately in the geom_point() function. The result is as shown in the figure below.

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
        geom_point(aes(color = am)) +
        geom_smooth() 

Now we have the concept of "layers". A layer is like a piece of cellophane, containing various graphic elements. We can create multiple layers separately and then stack them together to form the final display effect.

The function aes() is like the brain of ggplot2, responsible for aesthetic design, and the many functions starting with geom are like the hands of ggplot2, responsible for presenting these aesthetic designs. There are more than 30 functions starting with geom in the ggplot2 package. Readers can view these functions through the help documentation of the package. Mapping is only responsible for associating variables to a certain graphic attribute, and is not responsible for specific values. For example, in the image above, we map the variable am to a color, but which color is chosen automatically by ggplot2. If you want to set the color yourself, you need to use the scale function.

The scaling function is an adjustment function for graphic details, just like the remote control of a TV, which can adjust the volume, picture, color and other attributes of the TV. There are a wide variety of scaling functions starting with scale in ggplot2, which can be used to control the color of graphics, the size and shape of points, etc. For example, we can use the scaling function below to manually set the desired color, and the result is as shown in the figure below.

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
        geom_point(aes(color = am)) +
        scale_color_manual(values = c("blue", "red")) +
        geom_smooth() 

The ggplot2 package can also implement the grouped drawing function in the lattice package, namely facets. Faceting is to divide the entire data into multiple subsets according to one or several categorical variables, and then use these subsets to draw graphs respectively. For example, to display the above figure according to the two levels of the variable am, you can use the following command. The drawing results are shown in the figure below.

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
        geom_point() +
        stat_smooth() +
        facet_grid(~ am)

The theme function in the ggplot2 package is used to define the style of the plot, such as the background of the canvas. The image below is an example of a black and white themed canvas background:

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
        geom_point(aes(color = am)) +
        stat_smooth() +
        theme_bw()

In addition to the themes that come with the ggplot2 package, there are also some extension packages that provide a variety of theme styles, such as the ggthemes package, artyfarty package, etc. These packages need to be installed before using them, and interested readers can explore on their own.
The above introduces concepts such as mapping, graphical elements (geom), scale, facets and themes in the ggplot2 package, and demonstrates their basic usage. Next we will explore ways to draw common statistical graphics using the ggplot2 package.

2. Characteristics of distribution

In the process of exploring data, the most basic method is to observe the value of a single variable. For continuous variables, you can draw a histogram or density plot.

The following is plotted using the data set anorexia in the MASS package mentioned earlier. First load the data and create a new variable wt.change (weight change, unit: lb).

data(anorexia, package = "MASS")
anorexia$wt.change <- anorexia$Postwt - anorexia$Prewt

Next, use the ggplot2 package to draw a histogram of the variable wt.change. The code is as follows:

library(ggplot2)
p1 <- ggplot(anorexia, aes(x = wt.change)) +
        geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
        labs(x = "Weight change (lbs)") +
        theme_bw()
p1

Among them, the parameter binwidth is used to set the group distance. The default value is the full distance divided by 30. You can try to set different parameter values ​​when drawing to obtain more satisfactory results. The parameter fill is used to set the fill color. The parameter color is used to set the color of the rectangle border. We can also display the histogram and density curve at the same time, as shown in the figure below.

p2 <- ggplot(anorexia, aes(x = wt.change, y = ..density..)) +
        geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
        stat_density(geom = "line",linetype = "dashed", size = 1) +
        labs(x = "Weight change (lbs)") +
        theme_bw()
p2

Among them, "y = ...density..." is used to set the y-axis to frequency (density), and stat_density() is a statistical transformation used to calculate the density estimation curve.

Density curves can also be used to compare the distribution of different data. For example, to compare the distribution of weight changes across treatments, enter the following code:

p3 <- ggplot(anorexia, aes(x = wt.change, color = Treat, linetype = Treat)) +
        stat_density(geom = "line", size = 1) +
        labs(x = "Weight change (lbs)") +
        theme_bw()
p3

The above command first maps the variable Treat to color and line type, and then draws the density curve of the weight change wt.change under the three treatment methods, as shown in the figure above.

In addition to histograms and density curves, boxplots are often used to display the distribution of numerical variables, especially for comparison of distributions between groups. For example:

p4 <- ggplot(anorexia, aes(x= Treat, y = wt.change)) +
        geom_boxplot() +
        theme_bw()
p4

As can be seen from the figure above, the weight change of the FT group is higher than that of the other two groups, but the significance of the difference requires statistical testing to draw conclusions.

The ggpubr package provides functionality to add statistical differences between group comparisons on parallel boxplots. This package is a derivative package of ggplot2, which can generate statistical graphics for paper publication and is worth exploring by medical researchers. Next, based on the above figure, add the statistical difference of mean comparison between groups.

library(ggpubr)
my_comparisons <- list(c("CBT", "Cont"), c("CBT", "FT"), c("Cont", "FT"))
p5 <- ggplot(anorexia, aes(x= Treat, y = wt.change)) +
        geom_boxplot() +
        stat_compare_means(comparisons = my_comparisons,
                           method = "t.test",
                           color = "blue") +
        theme_bw()
p5

The p value in the above figure is obtained by using t test for pairwise comparison between groups. In addition, we can also use ggplot2 to draw a violin plot similar to the above figure, and the result is as shown in the figure below.

p6 <- ggplot(anorexia, aes(x= Treat, y = wt.change)) +
        geom_violin() +
        geom_point(position = position_jitter(0.1), alpha = 0.5) +
        theme_bw()
p6

3. The composition of proportion

Many data involve issues of proportion, and extracting proportion information allows us to understand the importance of each component to the whole. The composition of proportions is commonly displayed in bar charts, for example:

library(vcd)
data(Arthritis)
ggplot(Arthritis, aes(x = Treatment, fill = Improved)) +
        geom_bar(color = "black") +
        scale_fill_brewer() +
        theme_bw()

The above picture is called a stacked bar chart in order to display multiple variables in one picture at the same time. The vertical axis in the picture is the absolute size of the count. But sometimes we prefer to observe the relative proportion. This can be achieved by setting the parameter position to "fill". The result is as shown in the figure below.

ggplot(Arthritis, aes(x = Treatment, fill = Improved)) +
        geom_bar(color = "black", position = "fill") +
        scale_fill_brewer() +
        theme_bw()

We can also set the parameter position to "dodge" to place the bars side by side, as shown in the figure below.

ggplot(Arthritis, aes(x = Treatment, fill = Improved)) +
        geom_bar(color = "black", position = "dodge") +
        scale_fill_brewer() +
        theme_bw()

4. Use the function ggsave() to save the graph

The function ggsave() is specially used to save graphics drawn by the ggplot2 package. This function can export pictures in a variety of different formats. For example:

p <- ggplot(mtcars, aes(wt, mpg)) + geom_point()
ggsave("myplot.png", p)
ggsave("myplot.pdf", p)

The above command first creates a scatter plot and saves the result as p, and then uses the function ggsave() to save the graph as png and pdf format files respectively. You can see these two files by opening the current working directory.

If you want to use the picture in a publication, we can set the size and resolution of the picture. For example, save the above graphic object p in tiff format, and set the length and width of the image to 12cm and 15cm respectively, and the resolution to 500 dpi. The code is as follows:

ggsave("myplot.tiff", width = 15, height = 12, units = "cm", dpi = 500)

2. Other graphics

2.1 Pyramid diagram

A pyramid chart is a back-to-back bar chart that is often used to display the demographic structure of a research population, so it is also called a population pyramid chart. The PlotPyramid() function in the DescTools package and the pyramid() function in the epiDisplay package can be used to draw pyramid diagrams. The following uses the Oswego data set in the epiDisplay package as an example to draw a pyramid diagram. Here we need to use the two variables age and sex in the data set.

options(warn = -1)
library(epiDisplay)
data(Oswego)
pyramid(Oswego$age, Oswego$sex, col.gender = c(2, 4), bar.label = TRUE)

The figure above shows the frequency distribution of each age group under different genders. There are many parameters in the function pyramid() that can be used to control the detailed display of graphics. Readers, please check the help documentation of this function and try changing different parameter settings to obtain satisfactory output effects.

2.2 Horizontal stacked bar chart

When doing epidemiological surveys, it is often necessary to set many multiple-choice questions on questionnaires. For a set of questions, you can use the function plot_stackfrq() in the sjPlot package to visualize the proportions of different options. The following uses the data set efc in the package as an example to draw the diagram. 9 variables are needed here, and they correspond to the 9 multiple-choice questions in the questionnaire. Please install the sjPlot package before running the following code.

library(sjPlot)
data(efc)
names(efc)

head(efc)

qdata <- dplyr::select(efc, c82cop1:c90cop9)
plot_stackfrq(qdata)

The drawing results are shown in the figure above. From the figure, we can obtain information such as the formulation of each question, the number of people who answered, and the percentage of choices for different options.

The sjPlot package brings together many functions for visualizing data in the fields of epidemiology and social sciences. Using these functions, you can easily draw beautiful and practical statistical graphics, which are worthy of further exploration by readers.

3.3 Heat map

A heatmap is a color map that expresses the element values ​​in a matrix in different colors and performs hierarchical clustering on the rows or columns of the matrix. Through the heat map, we can not only directly observe the numerical distribution in the matrix, but also know the clustering results. See Chapter 10 for further introduction to cluster analysis. Heatmaps are often used in bioinformatics data analysis. Taking RNA-seq as an example, heat maps can visually present the changes in the global expression of multiple samples or multiple genes, and can also present the clustering relationships of the expression of multiple samples or multiple genes.

The function heatmap() in the stats package can be used to create heatmaps. The following uses the data set mtcars as an example to introduce the usage of this function. Since the measurement scales of variables in this data set are quite different, we first need to use the function scale() to standardize the variables. The matrix composed of standardized variables can be used as the input of the function heatmap(), and the drawing results are as shown in the figure below.

data(mtcars)
dat <- scale(mtcars)
class(dat)
heatmap(dat)

3.4 Three-dimensional scatter plot

The previously mentioned graphics are all two-dimensional. If you want to visualize the relationship between three numerical variables, you can use the scatterplot3d() function of the scatterplot3d package. Please install the package before use.

The parameter options provided by the function scatterplot3d() include setting graphic symbols, highlighting, angles, colors, lines, coordinate axes, grid lines, etc. The following uses the dataset trees in the datasets package as an example to illustrate the usage of this function. This data set contains 3 numerical variables Girth, Height and Volume. We draw a three-dimensional scatter plot using these three variables as the coordinate axes, and the results are shown in the figure below.

library(scatterplot3d)
data(trees)
scatterplot3d(trees, type = "h", highlight.3d = TRUE, angle = 55, pch = 16)

The parameter type in the function scatterplot3d() above is used to set the type of plot. The default is "p" (point). Here it is set to "h" to display vertical line segments. The parameter angle is used to set the angle of the x-axis and y-axis. It should be noted that when using a static three-dimensional scatter plot to describe the relationship between three variables, it may be affected by the observation angle.

3.5 Summary

Some other specialized graphics, such as scatter plot matrices, correlation plots, normal QQ plots, survival curves, cluster plots, scree plots, ROC curves and meta analysis forest plots, will be combined with statistical analysis methods in subsequent chapters Introduced one after another. Visualization is a very active area in R applications, with new packages emerging one after another. The website The R Graph Gallery collects a variety of novel graphics and corresponding sample codes, which is worthy of attention by readers interested in visualization.

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132575441