Basic drawing in R language

The article and code have been archived in [Github warehouse: https://github.com/timerring/dive-into-AI ] or the public account [AIShareLab] can also be obtained by replying to R language .

R's basic drawing system was written by Ross Ihaka and is very powerful. It mainly consists of the graphics package and grDevices package, which are automatically loaded when starting R. There are two types of functions in the basic drawing system, one is high-level drawing functions and the other is low-level drawing functions.

The so-called high-level drawing functions are functions used to directly generate graphics, including plot(), hist(), boxplot(), pairs(), etc. Low-level drawing functions are functions used to add new graphics or elements based on the graphics drawn by high-level mapping functions, including points(), lines(), text(), title(), legend() and axis() etc.

1. Function plot()

The function plot() is a generic function that can draw different graphics for different types of data. For example, for numerical data, it can draw scatter plots; for categorical data, it can draw box plots; for some statistical models, it can draw corresponding graphics, such as for survival analysis, it can draw survival curve. Therefore, the function plot() is used very frequently. It is recommended that you open its help document to view the usage of its various commonly used parameters.

A sample data is created below to represent the response of a patient with a certain disease to 2 drugs (drugA and drugB) and 5 dose levels.

dose <- c(20, 30, 40, 45, 60)
drugA <- c(16, 20, 27, 40, 60)
drugB <- c(15, 18, 25, 31, 40)

Use the data above to graph the dose and response relationship for Drug A:

plot(dose, drugA)
plot(dose, drugA, type = "b")

The above command creates two graphs. The parameter type in the function plot() defaults to "p" (representing point), so the first graph obtained is a scatter plot. In the second line of command, the parameter type is changed to "b" (representing points and lines), so the second graph obtained is a point-line graph.

The function plot() is used to create a new graph. We can also use low-level plotting functions, such as lines(), legend(), etc., to add new graphic elements to an existing graph. For example:

# 为了比较两种药物不同剂量下的响应情况,我们在一幅图上展示两个点线图,并用不同类型的线(lty)和不同特征的点(pch)加以区分。
plot(dose, drugA, type = "b", lty = 1, pch = 15)
lines(dose, drugB, type = "b", lty = 2, pch = 17)
# 为了增强可读性,还添加了图例(legend)。
# 需要注意的是,函数 legend( )里面点和线的属性必须与前面函数 plot( )和 lines( )中设置的属性一致。
legend("topleft", title = "Drug Type",
       legend = c("A", "B"), 
       lty = c(1, 2), 
       pch = c(15, 17))

2. Histograms and density curves

The histogram is the most commonly used tool for displaying the distribution of continuous variables. It is essentially an estimate of the density function. Histograms and density plots are generally used to explore distributions and rarely to report results. The function hist() can be used to draw a histogram .

The dataset anorexia is located in the MASS package and comes from a study on weight changes in young female patients with anorexia. The data set contains 72 observed subjects and 3 variables. The variable Treat (treatment method) is a factor with 3 levels. The variables Prewt and Postwt are both numerical and represent the weight before and after treatment (unit: lb) respectively. The following is a histogram of the variable Prewt. The code is as follows:

library(MASS)
data(anorexia)
str(anorexia)

attach(anorexia)
hist(Prewt)

The above figure shows the frequency distribution of the variable Prewt. Since no parameters are set in the function hist(), the default group intervals, axis labels, titles, etc. are used in the figure. It should be noted that the shape of the histogram is affected by the group interval. Sometimes we need to try setting different values ​​​​of the parameter breaks to get a suitable graph. The output result of the function hist() contains some calculation return values, which can be used for further graphing or analysis, such as dividing endpoints, frequencies (or density), interval midpoints, etc. for intervals.

Density curve provides a smoother description of the distribution of data. The method of drawing density curve is:

plot(density(Prewt))

As can be seen from the above figure, the distribution of variable Prewt is unimodal and basically symmetrical. We can also add a density curve and axes and whiskers to a histogram. At this time, you need to set the parameter freq to FALSE in the function hist(), that is, replace the ordinate with frequency, otherwise the density curve will be almost invisible. The parameter las (or labels) is set to 1 to display the scale labels of the vertical axis horizontally.

library("showtext") # R 数据分析镜像的中文支持不太好,需要借助 showtext 包
showtext_auto() # 自动支持中文
# 使用红色填充了条形,添加了信息量更大的坐标轴标签和标题,还通过设置参数 las 为 1 把纵轴的刻度标签换成了横向显示。
hist(Prewt, freq = FALSE, col = "red",
     xlab = "体重(lbs)", 
     main = "治疗前体重分布直方图",
     las = 1)
# 然后使用函数 lines( )在直方图上叠加了一条蓝色的、两倍于默认线条宽度的密度曲线。
lines(density(Prewt), col = "blue", lwd = 2)
# 最后使用函数 rug( )在横轴上添加了轴须图,以展示数据分布的密集趋势。
rug(Prewt)
detach(anorexia)

3. Bar chart

Bar charts are often used in medical scientific papers, which display the frequency distribution of categorical variables through vertical or horizontal rectangles. Function barplot( )can be used to draw a bar chart .

The following uses the Arthritis data set in the vcd package as an example to introduce the usage of the function barplot(). This data set comes from a group-controlled, double-blind clinical trial study of a new treatment for rheumatoid arthritis. The response variable Improved records the treatment effect of each patient who received drug treatment (Treated, 41 cases) or placebo (Placebo, 43 cases), divided into 3 levels (None, Some, Marked).

library(vcd)
data(Arthritis)
attach(Arthritis)
counts <- table(Improved)
counts
# Improved
#   None   Some Marked 
#     42     14     28

The function table() is used to generate frequency statistics tables for categorical variables. As can be seen from the output above, 28 patients showed significant improvement, 14 patients showed partial improvement, and 42 patients showed no improvement. A bar chart can be used to display this frequency distribution, as shown below:

barplot(counts, xlab = "Improvement", ylab = "Freqency", las = 1)

The function barplot() can also be used to display data from a two-dimensional contingency table. The figure below draws a grouped bar chart and adds colors and legends. The code is as follows:

counts <- table(Improved, Treatment)
barplot(counts, 
        col = c("red", "yellow", "green"),
        xlab = "Improvement", 
        ylab = "Freqency", 
        beside = TRUE, las = 1)
legend("top", legend = rownames(counts), 
       fill = c("red", "yellow", "green"))

Bar charts can sometimes be used to display means, medians, standard deviations, confidence intervals, etc. under different categories. This function can be achieved using functions in the basic package, but it requires many steps. The function aggregate.plot() in the epiDisplay package can simplify this process.

The following code takes the data set anorexia as an example to draw a bar chart of the mean weight after treatment under different treatment methods. The results are shown in the figure below.

library(epiDisplay)
aggregate.plot(anorexia$Postwt, by = list(anorexia$Treat), 
               error = "sd", legend = FALSE, 
               bar.col = c("red", "yellow", "green"),
               ylim = c(0,100), las = 1,
               main = "")

The error bars above represent the standard deviation. We can set the display standard error or confidence interval by changing the error parameter in the function aggregate.plot().

4. Pie chart

Pie charts can be used to display the proportion of categorical data. For example, the following code draws a pie chart showing the distribution of disease types for emergency admissions to a hospital within a week.

percent <- c(5.8, 27.0, 0.5, 20.8, 12.8, 33.1)
disease <- c("上感", "中风", "外伤", "昏厥", "食物中毒", "其他")
lbs <- paste0(disease, percent, "%")
pie(percent, labels = lbs, col = rainbow(6))


Most statisticians do not recommend using pie charts, preferring bar charts or dot charts instead of pie charts because people judge length more accurately than area. Therefore, the base package's functions pie( )have limited options for drawing pie charts.

However, some donated packages extend R's capabilities for drawing pie charts, such as the plotrix package. The function pie3D() provided by this package can draw a three-dimensional pie chart, and the other function fan.plot() can draw a fan chart with functions similar to pie charts. Interested readers can install this package and view its help documentation.

5. Box plots and violin plots

Box plot, also known as box-whisker plot, is often used to display the approximate distribution characteristics of data and is also used to explore outliers and outliers. The function boxplot() can be used to draw boxplots.

The following uses a box plot to show the distribution of weight changes before and after in the data set anorexia.

anorexia$wt.change <- anorexia$Postwt - anorexia$Prewt
boxplot(anorexia$wt.change, ylab = "Weight change (lbs)", las = 1)

In order to allow readers to better understand the meaning of each part of the boxplot, manual annotations have been added to the figure below. If the data is symmetrically distributed, the median (Median) should be located in the middle of the upper quartile (Upper quartile) and the lower quartile (Lower quartile), that is, the box of the box plot is symmetrical about the median line. Values ​​outside the upper hinge and lower hinge are generally considered outliers.

fivenum(anorexia$wt.change)

anorexia$wt.change <- anorexia$Postwt - anorexia$Prewt
b <- boxplot(anorexia$wt.change, ylab = "Weight change (lbs)", las = 1)
# text(x= 1, y=1:5, labels= c("some","more","red text"))
text(1.2, 21.5, "Upper hinge")
text(1.13, 15.5, "←—— Whisker")
text(1.31, 9.2, "Upper quantile")
text(1.26, 1.65, "Median")
text(1.31, -2.45, "Upper quantile")
text(1.13, -7, "←—— Whisker")
text(1.2, -12.2, "Upper hinge")


Parallel box plots can be used to compare the distribution of an indicator under each category of a categorical variable. For example, to compare weight changes under different treatments, you can use the following command:

boxplot(wt.change ~ Treat, data = anorexia,
        ylab = "Weight change (lbs)", las = 1)

The first parameter of the function boxplot() is a formula. Formulas in R generally use symbols ~to connect variables. ~The left side can be regarded as the dependent variable, and ~the right side can be regarded as the independent variable. As can be seen from the figure (a) below, the change in body weight of the "FT" (family treatment) group is higher than that of the other two groups. However, the significance of the difference requires further significance testing to determine.

A violin plot can be viewed as a combination of a boxplot and a density plot. The function vioplot() in the vioplot package can be used to draw violin plots. Please install and load the package before use. For example, the above picture can be changed to a violin diagram to display

options(warn=-1) # 清爽显示
library(vioplot)
vioplot(wt.change ~ Treat, data = anorexia, 
        ylab = "Weight change (lbs)",
        col = "gold", las = 1)

6. Cleveland dot plot

The Cleveland dot plot is essentially a scatter plot. It displays the size of the data through the position of the points. It is a method of plotting a large number of labeled values ​​on a simple horizontal scale. Its function is similar to a bar chart. But the emphasis is on the ordering of data and the gaps between them.

Functions dotchart( )can be used to plot Cleveland point plots. The dataset VADeaths in the datasets package is the mortality rate (expressed in ‰) of people of different age groups in urban and rural Virginia in the United States in 1940.

VADeaths
dotchart(VADeaths)
dotchart(t(VADeaths),pch = 19)


As can be seen from the figure above, the mortality rate increases with age; in the same age group, the mortality rate in rural areas is higher than that in urban areas; in the same age group and the same area, the mortality rate of men is higher than that of women .

7. Export graphics

If you want to save the graphics, you can use the graphical user interface or code. Under "Plots" in the lower right corner of RStudio, click "Export" and select "Save as Image" or "Save as PDF" to save the graphics in the specified folder. We can also select "Copy to Clipboard" to copy the graphics directly to a Word or PowerPoint document. It should be noted that the graphics saved in this way are related to the size of the RStudio graphics window, that is, the graphics obtained by windows of different sizes will be different (in ModelWhale, you can right-click the picture and save it directly as).

If you want to save graphics for use in reports or papers, the author recommends using code to place the drawing statement between the statement that turns on the target graphics device and the statement that turns off the target graphics device. For example, the following code will save the graph to the current working directory and name it "mygraph.pdf":

pdf("mygraph.pdf")
boxplot(wt.change ~ Treat, 
        data = anorexia, 
        ylab = "Weight change (lbs)",
        las = 1)
dev.off() # work 下可以看到该 pdf

In addition to the function pdf(), we can also use the functions png(), jpeg(), tiff(), and postscript() to save graphics in other formats.

Graphic files in bmp, png and jpeg formats are non-vector formats and are easily affected by resolution, but they take up very little space and are suitable for use in Word and PowerPoint documents; graphic files in ps format are vector format files, which are different from It is resolution-independent and is suitable for typesetting and printing; graphic files in tiff (or tif) format can support many color systems and are independent of the operating system, and are the most widely used in various publications. For example:

tiff(filename = "mygraph.tiff",
     width = 15, height = 12, units = "cm", res = 300)
boxplot(wt.change ~ Treat, data = anorexia, ylab = "Weight change (lbs)")
dev.off() # work 下可以看到该 tiff

The above command generates a graphics file named "mygraph.tiff". The parameters width and height are used to set the width and height of the graph respectively. The parameter units is used to set the units of width and height. The parameter res is used to set the resolution. Here Set to the minimum value of 300 required by most publications .

summary

Some other specialized graphics, such as scatter plot matrix, correlation plot, normal QQ plot, survival curve, cluster plot, scree plot, ROC curve and Meta analysis forest plot, etc. Visualization is a very active area in R applications, with new packages emerging one after another. The website The R Graph Gallery collects a variety of novel graphics and corresponding sample codes, which is worthy of attention by readers interested in visualization.

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132556392