Article directory
-
-
-
-
- 1.1 Introduction to Ggplot2
- 1.2 Features of Ggplot2
- 1.3 Ggplot2 mapping component
- 1.4 Ggplot2 comes with data set
- 1.5 Ggplot2 components
- 1.6 Aesthetic parameters
- 1.7 Multi-subgraph drawing
- 1.8 Graph Types and Functions
- 1.9 Curve Fitting
- 1.10 Boxplots
- 1.11 Frequency histogram and density curve
- 1.12 Histogram
- 1.13 Time series
- 1.14 Scatter plot
-
-
-
1.1 Introduction to Ggplot2
ggplot2 is an R package for generating statistical or data graphics.
Unlike most other graphics packages, ggplot2 has an underlying syntax based on the Grammar of Graphics that allows graphics to be composed by combining independent components.
The ability to create new graphs based on a specific problem rather than being limited to a predefined set of graphs is the power of ggplot2.
Ggplot2 is actually easy to learn: there is a simple set of core principles, with few special cases.
1.2 Features of Ggplot2
Defaults
Ggplot2 provides beautiful, easy-to-use graphics, and users don't have to care about tedious details such as drawing legends.
It provides a large number of defaults, which means that users can generate and publish high-quality graphics in a short time. Instead of spending time making graphs look pretty, users can focus on creating graphs that best reveal the information in their data.
But if you do have special format requirements, ggplot2 also provides many modifiable ways.
iteration
The Ggplot2 package works iteratively. Start by displaying raw data, then add annotations and statistical layers.
This allows users to generate graphs using the same structured thinking as design analysis. This can shorten the distance between the picture in your head and the picture in the book. This is especially helpful for students who have not mastered the structured analysis methods used by experts.
advanced elements
Most graphics packages are just a collection of special graphics. For example, in the R environment, if you design a graph composed of primitive graphic elements such as lines and points, it is difficult to design new components that combine with the existing graph. In ggplot2, the expressions used to create new graphs consist of higher-level elements, such as representations of raw data and statistical transformations, that can be easily combined with new datasets and other plots.
1.3 Ggplot2 mapping component
All diagrams are made up of data, the information you want to visualize, and mappings (describing how data variables map to attributes). The mapping components are as follows:
layer
A layer is a collection of geometric elements and statistical transformations. Geometric elements (geoms for short) represent what you actually see in a drawing: points, lines, polygons, etc. Statistical transformations, or stats for short, summarize data: for example, classifying and counting observations to create a histogram, or fitting a linear model.
Scale
A scale maps values in data space to values in space. This includes the use of colour, shape and size. Scales also draw legends and axes, which makes it possible to read raw data values from plots (reverse mapping).
coord
A coordinate or coordinate system describes how data coordinates map to the plane of the graph. It also provides axes and gridlines to aid in reading the chart. We usually use the Cartesian coordinate system, but there are a few others available, including polar coordinates and map projections.
theme
Themes control details of the display, such as font size and background color. Although the defaults in ggplot2 have been carefully chosen by the author, users may still need to refer to other sources to create a more attractive plot.
1.4 Ggplot2 comes with data set
Use one of the datasets bundled with ggplot2: mpg. It includes fuel economy information for popular car models from 1999 and 2008, collected by the U.S. Environmental Protection Agency (http://fueleconomy.gov). You can access the data by loading ggplot2:
> library(ggplot2)
> mpg
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
3 audi a4 2 2008 4 manu~ f 20 31 p comp~
4 audi a4 2 2008 4 auto~ f 21 30 p comp~
5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
# ... with 224 more rows
These variables are mostly self-explanatory:
- Cty and hwy record miles per gallon for city and highway driving.
- Displ is the engine displacement in liters.
- DRV is the drivetrain: front wheels (f), rear wheels® or four wheels (4).
- Model is the model of the car. Thirty-eight models were selected because they were updated each year between 1999 and 2008.
- class is a categorical variable describing the "type" of the car: two-seater, SUV, compact, etc.
1.5 Ggplot2 components
component:
- data
- A set of mappings between variables in the data and visual attributes
- At least one layer exists that describes how to render each observation. Layers are usually created using geom functions
for example:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
Pay attention to the form of the above code:
Data and maps are passed in in ggplot(), and layers are added with +.
The drawing results are as follows:
Components in this case:
- Data: mpg
- Mapping: displ maps to x position, hwy maps to y position.
- Layer: geom_point()
The plot shows a strong correlation: As the engine size (displ) increases, fuel economy (hwy) gets worse.
1.6 Aesthetic parameters
Same as the data x, y, these parameters are also called in aes(), as in this way
aes(displ, hwy, colour = class)
aes(displ, hwy, shape = drv)
aes(displ, hwy, size = cyl)
color = class gives each point a unique color corresponding to its class. This legend allows us to read the data values from the colors.
If you want to modify the aesthetic parameters of an image without scaling its dimensions, do so in a single layer outside of aes(), like so:
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
The same is using "blue" blue as input, but the previous image is adjusted to pink and a legend is added, and the latter image correctly displays blue.
When using aesthetic properties in graphics, less is more. It is difficult to see the relationship between color, shape and size at the same time, so use restraint when using aesthetics. Instead of trying to make a very complex graphic that shows everything at the same time, create a series of simple graphics that tell a story and lead the reader from ignorance to knowledge.
1.7 Multi-subgraph drawing
Another technique for displaying additional categorical variables on a graph is subplots. Faceting creates graphical tables by dividing the data into subsets and displaying the same graph for each subset.
There are two types of subgraph techniques grid and wrapped
Just append at the end
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
1.8 Graph Types and Functions
Geom_point() plots data scatter points on the coordinate axis, indicating the distribution of data.
Geom_smooth() provides a smoother to the data and displays the smoother and its standard error.
Geom_boxplot() generates a boxplot to summarize the distribution of a set of points.
Geom_histogram() and geom_freqpoly() represent the distribution of continuous variables.
Geom_bar() displays the distribution of a categorical variable.
Geom_path() and geom_line() draw straight lines between data points. Line graphs are limited to lines that move from left to right, while paths can move in any direction. Lines are often used to explore how things change over time.
1.9 Curve Fitting
Can be used alone or in combination such as:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of confidence intervals for the points shown in grey. If you are not interested in confidence intervals, you can turn it off with geom_smooth(se = FALSE ).
parameter:
An important parameter to geom_smooth() is method, which allows you to choose which type of model to use to fit smooth curves.
Optional values such as "loess" "lm" "gam" "rlm"
span controls the swing degree of the line from 0 to 1 and gradually stabilizes
1.10 Boxplots
The most basic boxplot code:
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
They can easily adjust the style through attributes such as size color shape fill , for example here we increase the fill color
ggplot(mpg, aes(drv, hwy)) + geom_violin(fill = "blue")
1.11 Frequency histogram and density curve
One line of code:
ggplot(数据, aes(数值变量)) + geom_histogram()/ geom_freqpoly()
Example:
ggplot(mpg, aes(hwy)) + geom_histogram()
ggplot(mpg, aes(hwy)) + geom_freqpoly()
Histograms and density curves work the same way: they bin the data and then count the number of observations in each bin. The only difference is the display: the histogram uses a histogram and the density curve uses a line.
parameter:
You can control the width of the bin with the binwidth parameter
If you don't want evenly spaced boxes, use the breaks parameter.
ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 10.5)
It can be observed that the grouping has been reduced a lot.
1.12 Histogram
Count or display a value.
ggplot(mpg, aes(manufacturer)) +
geom_bar()
1.13 Time series
There is time on the x-axis, showing the change of a single variable over time, usually in the form of a line chart to display time series data.
The code method is similar to before
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
ggplot(economics, aes(date, uempmed)) +
geom_line()
To study this relationship in more detail, we want to plot two time series on the same graph. We could draw a scatterplot of the unemployment rate versus the length of unemployment, but then we wouldn't be able to see the evolution over time. The solution is to connect the points adjacent to the line segment in time to form a path map, and to distinguish the color depth.
It is more intuitive to observe the change of bivariate over time.
1.14 Scatter plot
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
Have a data set, select the X-axis and Y-axis data from it, and add the geom_point layer to draw a scatter plot.