R language: detailed explanation of ggplot2 package and various exquisite graphics drawing

1.1 Introduction to Ggplot2

ggplot2 is an R package for generating statistical or data graphics.

Unlike most other graphics packages, ggplot2 has an underlying syntax based on the Grammar of Graphics that allows graphics to be composed by combining independent components.

The ability to create new graphs based on a specific problem rather than being limited to a predefined set of graphs is the power of ggplot2.

Ggplot2 is actually easy to learn: there is a simple set of core principles, with few special cases.

1.2 Features of Ggplot2
Defaults

Ggplot2 provides beautiful, easy-to-use graphics, and users don't have to care about tedious details such as drawing legends.

It provides a large number of defaults, which means that users can generate and publish high-quality graphics in a short time. Instead of spending time making graphs look pretty, users can focus on creating graphs that best reveal the information in their data.

But if you do have special format requirements, ggplot2 also provides many modifiable ways.

iteration

The Ggplot2 package works iteratively. Start by displaying raw data, then add annotations and statistical layers.

This allows users to generate graphs using the same structured thinking as design analysis. This can shorten the distance between the picture in your head and the picture in the book. This is especially helpful for students who have not mastered the structured analysis methods used by experts.

advanced elements

Most graphics packages are just a collection of special graphics. For example, in the R environment, if you design a graph composed of primitive graphic elements such as lines and points, it is difficult to design new components that combine with the existing graph. In ggplot2, the expressions used to create new graphs consist of higher-level elements, such as representations of raw data and statistical transformations, that can be easily combined with new datasets and other plots.

1.3 Ggplot2 mapping component

All diagrams are made up of data, the information you want to visualize, and mappings (describing how data variables map to attributes). The mapping components are as follows:

layer

A layer is a collection of geometric elements and statistical transformations. Geometric elements (geoms for short) represent what you actually see in a drawing: points, lines, polygons, etc. Statistical transformations, or stats for short, summarize data: for example, classifying and counting observations to create a histogram, or fitting a linear model.

Scale

A scale maps values ​​in data space to values ​​in space. This includes the use of colour, shape and size. Scales also draw legends and axes, which makes it possible to read raw data values ​​from plots (reverse mapping).

coord

A coordinate or coordinate system describes how data coordinates map to the plane of the graph. It also provides axes and gridlines to aid in reading the chart. We usually use the Cartesian coordinate system, but there are a few others available, including polar coordinates and map projections.

theme

Themes control details of the display, such as font size and background color. Although the defaults in ggplot2 have been carefully chosen by the author, users may still need to refer to other sources to create a more attractive plot.

1.4 Ggplot2 comes with data set

Use one of the datasets bundled with ggplot2: mpg. It includes fuel economy information for popular car models from 1999 and 2008, collected by the U.S. Environmental Protection Agency (http://fueleconomy.gov). You can access the data by loading ggplot2:

> library(ggplot2)
> mpg
# A tibble: 234 x 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
 2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
 3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
 4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
 5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
 6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
 7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
 8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
 9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
# ... with 224 more rows

These variables are mostly self-explanatory:

  • Cty and hwy record miles per gallon for city and highway driving.
  • Displ is the engine displacement in liters.
  • DRV is the drivetrain: front wheels (f), rear wheels® or four wheels (4).
  • Model is the model of the car. Thirty-eight models were selected because they were updated each year between 1999 and 2008.
  • class is a categorical variable describing the "type" of the car: two-seater, SUV, compact, etc.
1.5 Ggplot2 components

component:

  • data
  • A set of mappings between variables in the data and visual attributes
  • At least one layer exists that describes how to render each observation. Layers are usually created using geom functions

for example:

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

Pay attention to the form of the above code:

Data and maps are passed in in ggplot(), and layers are added with +.

The drawing results are as follows:

insert image description here

Components in this case:

  • Data: mpg
  • Mapping: displ maps to x position, hwy maps to y position.
  • Layer: geom_point()

The plot shows a strong correlation: As the engine size (displ) increases, fuel economy (hwy) gets worse.

1.6 Aesthetic parameters

Same as the data x, y, these parameters are also called in aes(), as in this way

aes(displ, hwy, colour = class)
aes(displ, hwy, shape = drv)
aes(displ, hwy, size = cyl)

insert image description here

color = class gives each point a unique color corresponding to its class. This legend allows us to read the data values ​​from the colors.

If you want to modify the aesthetic parameters of an image without scaling its dimensions, do so in a single layer outside of aes(), like so:

ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")

insert image description here
insert image description here

The same is using "blue" blue as input, but the previous image is adjusted to pink and a legend is added, and the latter image correctly displays blue.

When using aesthetic properties in graphics, less is more. It is difficult to see the relationship between color, shape and size at the same time, so use restraint when using aesthetics. Instead of trying to make a very complex graphic that shows everything at the same time, create a series of simple graphics that tell a story and lead the reader from ignorance to knowledge.

1.7 Multi-subgraph drawing

Another technique for displaying additional categorical variables on a graph is subplots. Faceting creates graphical tables by dividing the data into subsets and displaying the same graph for each subset.

There are two types of subgraph techniques grid and wrapped

Just append at the end

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  facet_wrap(~class)

insert image description here

1.8 Graph Types and Functions

Geom_point() plots data scatter points on the coordinate axis, indicating the distribution of data.

Geom_smooth() provides a smoother to the data and displays the smoother and its standard error.

Geom_boxplot() generates a boxplot to summarize the distribution of a set of points.

Geom_histogram() and geom_freqpoly() represent the distribution of continuous variables.

Geom_bar() displays the distribution of a categorical variable.

Geom_path() and geom_line() draw straight lines between data points. Line graphs are limited to lines that move from left to right, while paths can move in any direction. Lines are often used to explore how things change over time.

1.9 Curve Fitting

Can be used alone or in combination such as:

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth()

insert image description here

This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of confidence intervals for the points shown in grey. If you are not interested in confidence intervals, you can turn it off with geom_smooth(se = FALSE ).

parameter:

An important parameter to geom_smooth() is method, which allows you to choose which type of model to use to fit smooth curves.

Optional values ​​such as "loess" "lm" "gam" "rlm"

span controls the swing degree of the line from 0 to 1 and gradually stabilizes

1.10 Boxplots

The most basic boxplot code:

ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()

insert image description here

insert image description here

They can easily adjust the style through attributes such as size color shape fill , for example here we increase the fill color

ggplot(mpg, aes(drv, hwy)) + geom_violin(fill = "blue")

insert image description here

1.11 Frequency histogram and density curve

One line of code:

ggplot(数据, aes(数值变量)) + geom_histogram()/ geom_freqpoly()

Example:

ggplot(mpg, aes(hwy)) + geom_histogram()

insert image description here

ggplot(mpg, aes(hwy)) + geom_freqpoly()

insert image description here

Histograms and density curves work the same way: they bin the data and then count the number of observations in each bin. The only difference is the display: the histogram uses a histogram and the density curve uses a line.

parameter:

You can control the width of the bin with the binwidth parameter

If you don't want evenly spaced boxes, use the breaks parameter.

ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 10.5)

insert image description here

It can be observed that the grouping has been reduced a lot.

1.12 Histogram

Count or display a value.

ggplot(mpg, aes(manufacturer)) + 
  geom_bar()

insert image description here

1.13 Time series

There is time on the x-axis, showing the change of a single variable over time, usually in the form of a line chart to display time series data.

The code method is similar to before

ggplot(economics, aes(date, unemploy / pop)) +
  geom_line()
ggplot(economics, aes(date, uempmed)) +
  geom_line()

insert image description here

insert image description here

To study this relationship in more detail, we want to plot two time series on the same graph. We could draw a scatterplot of the unemployment rate versus the length of unemployment, but then we wouldn't be able to see the evolution over time. The solution is to connect the points adjacent to the line segment in time to form a path map, and to distinguish the color depth.

insert image description here

It is more intuitive to observe the change of bivariate over time.

1.14 Scatter plot
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")

Have a data set, select the X-axis and Y-axis data from it, and add the geom_point layer to draw a scatter plot.

insert image description here

Guess you like

Origin blog.csdn.net/yt266666/article/details/127394061