Data reshaping in R language (conversion of long and wide tables)

Study notes are for study use only.

Table of contents

1- What is tidy data?

2- wide table becomes long table 

Example 1:

Example 2:

Example 3:

3- long table variable width table

Example 1:

Example 2:

1- What is tidy data?

According to Hadley, clean data has the following characteristics:

  1. Each variable forms a column, that is, variables with the same attributes form a column;
  2. Each observation constitutes a row;
  3. Each variable value for each observation constitutes a cell;

Data that does not meet the above conditions are called dirty and untidy data, and they often have the following characteristics:

  1. Column names are not variable names, but values;
  2. Put multiple variables in one column;
  3. Variables are placed in both rows and columns;
  4. Multiple types of observation units are in the same cell, that is, each cell is not a value;
  5. One observation unit is placed in multiple tables;

Data reshaping : The functions in the tidyverse series of packages operate on neat data frames, and untidy data needs to be transformed into neat data first. This process is data reshaping;

Data reshaping includes : length-width table conversion, split/merge columns, and square. Among them, the transformation of the length and width table uses the pivot_longer() and pivot_wider() functions

Dirty data example and description:

In this example, both male and female belong to gender, so male and female can be classified as one variable. Violating the first of the tidy data requirements, a column is a variable.

 In this example, because the two variables of age and weight are placed in one column, although they are separated by backslashes, it is easy for humans to understand according to common sense, but the computer does not understand, it only says that the two columns are originally Numeric data are treated as strings, which violates the third rule of tidy data, that each variable for each observation constitutes a cell.

 In this example, none of the three requirements for clean data are met.

The key to making data tidy is to learn to distinguish variables, observations, and values.

2- wide table becomes long table 

Wide table: refers to the data set that clearly subdivides all variables. The table is relatively wide. The value that should have been placed in the cell is placed in the column name, such as male and female, which should be placed in the cell. The content of the column has become the column name of a certain two columns;

Long table: Refers to data that contains categorical variables in the dataset.

Use the pivot_longer() function in the tidyr package to convert a wide table into a long table, and use the pivot_wider() function to convert a long table into a wide table, which is the inverse transformation of the pivot_longer() function.

Grammar introduction:

pivot_longer(data, cols, names_to, values_to, values_drop_na, ...)

in

  • data: the data frame to reshape;
  • cols: select the column to be deformed with the select column syntax, that is, the column to be processed;
  • names_to: Set the column name, specifically, in order to store the column name of the column to be processed, create a new column or several columns (according to the specific problem), and set a new column name for the newly created column;
  • values_to: Set the column name, specifically, store the column to be processed, the value in the cell below it, and set a new column name for this column.
  • values_drop_na: Whether to ignore missing values ​​in deformed columns (NA, not available)
  • If the column name of the deformed column includes prefixes, variable names + separators, and regular expression group capture patterns in addition to the desired "content", you can use the parameters names_prefix, names_sep, and names_pattern to extract the desired "content" , Note that the "content" here refers to the desired part of the column name.

Example 1 :

Wide table variable length table ( store the column name of the column to be reshaped in one column)

> df <- read.csv("配套数据/分省年度GDP.csv")
> df
      地区  X2019年  X2018年  X2017年
1   北京市 35371.28 33105.97 28014.94
2   天津市 14104.28 13362.92 18549.19
3   河北省 35104.52 32494.61 34016.32
4 黑龙江省 13612.68 12846.48 15902.68


> df %>%
+   pivot_longer(-地区, names_to = "年份", values_to="GDP")
# A tibble: 12 × 3
   地区     年份       GDP
   <chr>    <chr>    <dbl>
 1 北京市   X2019年 35371.
 2 北京市   X2018年 33106.
 3 北京市   X2017年 28015.
 4 天津市   X2019年 14104.
 5 天津市   X2018年 13363.
 6 天津市   X2017年 18549.
 7 河北省   X2019年 35105.
 8 河北省   X2018年 32495.
 9 河北省   X2017年 34016.
10 黑龙江省 X2019年 13613.
11 黑龙江省 X2018年 12846.
12 黑龙江省 X2017年 15903.

df is a wide table. Except for the region column, all the remaining columns are the columns we want to reshape. In the pivot_longer() function, the first parameter is the data frame we want to process, because the pipeline operation is used here , so the first parameter of this function is omitted, the second parameter is the column to be reshaped, here means that except for the region, all the remaining columns are what we want to reshape, the third parameter names_to= "Year" indicates the column we want to reshape in the original data. The column name of this column is stored in a new column. We use names_to to give the new column a column name. In this case, name the new column for "year". The fourth parameter is values_to="GDP", indicating the column we want to reshape in the original data. The value stored in the cell of this column is now placed in a new column. We need to take a column name for this new column , the column is named "GDP".

As you can see from this example, we reshape the columns, extract the column names of these reshaped columns, put them in the cells of a new column, and repeat them in a cycle, that is, x2019, x2018, and x2017 are a cycle .

Example 2 :

Wide table variable length table ( store the column name of the column to be reshaped in multiple columns)

Raw data: wide table

 The goal is transformed into the following long table

analyze:

This data is to collect the information of each family child, for example, in family 1, there are two children, child1 and child2, and the date of birth and gender of these children are collected.

In the original data, the columns we want to reshape are all columns except the family column. The column names of these columns are separated by underscores. We want to make the data into 3 columns, namely child column, dob date of birth, gender gender. Among them, the two columns of dob and gender remain unchanged, without any operation, child1 and child2 become a new column, and name this column child.

> load("配套数据/family.rda")
> knitr::kable(family, align="c")


| family | dob_child1 | dob_child2 | gender_child1 | gender_child2 |
|:------:|:----------:|:----------:|:-------------:|:-------------:|
|   1    | 1998-11-26 | 2000-01-29 |       1       |       2       |
|   2    | 1996-06-22 |     NA     |       2       |      NA       |
|   3    | 2002-07-11 | 2004-04-05 |       2       |       2       |
|   4    | 2004-10-10 | 2009-08-27 |       1       |       1       |
|   5    | 2000-12-05 | 2005-02-28 |       2       |       1       |
> 
> family %>%
+   pivot_longer(-family,
+                names_to = c(".value", "child"),
+                names_sep="_",
+                values_drop_na = TRUE)
# A tibble: 9 × 4
  family child  dob        gender
   <int> <chr>  <date>      <int>
1      1 child1 1998-11-26      1
2      1 child2 2000-01-29      2
3      2 child1 1996-06-22      2
4      3 child1 2002-07-11      2
5      3 child2 2004-04-05      2
6      4 child1 2004-10-10      1
7      4 child2 2009-08-27      1
8      5 child1 2000-12-05      2
9      5 child2 2005-02-28      1

Code explanation:

  • -family, indicates that the column to be deformed is a column other than the family column.
  • .names_sep="_" indicates that the column names of the reshaped columns are separated by underscores.
  • names_to=c(".value", "child") is used to set the column names of the newly created columns in the long table. Specifically, the column name of the column to be reshaped is divided into two parts using underscores, the first part is the date of birth and gender, and the second part is child 1 and child 2.
    • The column information (column name + cell content below it) generated by the first part remains unchanged;
    • "child" is the column name of the newly created column, which is used to store the content of the second part of the column name to be reshaped, namely child 1 and child 2.
    • values_drop_na=TRUE: indicates that the missing value NA in the column to be deformed is ignored in the data reshaping.

Example 3:

Wide table variable length table ( store the column name of the column to be reshaped in multiple columns)

raw data wide table

 Goal: Convert to a long table in the following form

> df <- read.csv("配套数据/参赛队信息.csv")
> df
  队员1姓名 队员1专业 队员2姓名 队员2专业 队员3姓名 队员3专业
1      张三      数学      李四      英语      王五    统计学
2      赵六    经济学      钱七      数学      孙八    计算机
> 
> df %>%
+   pivot_longer(everything(),
+                names_to=c("队员", ".value"),
+                names_pattern = "(.*\\d)(.*)")
# A tibble: 6 × 3
  队员  姓名  专业  
  <chr> <chr> <chr> 
1 队员1 张三  数学  
2 队员2 李四  英语  
3 队员3 王五  统计学
4 队员1 赵六  经济学
5 队员2 钱七  数学  
6 队员3 孙八  计算机

 Grammar explanation:

  • everything(): Indicates that all columns are selected, that is, the columns to be reshaped are all columns;
  • names_pattern= "(.*\\d)(.*)" : Use this parameter and regular expression for group capture. \\d means matching numbers, that is, 0-9, * means any character, letter, number except newline, * matches at least once.
  • names_to=c("team member", ".value") indicates the column name of the newly created column, the column name of the newly created column is "team member", and the remaining columns and information remain unchanged. Specifically, the column name of the column to be reshaped is divided into two parts using a regular expression. The content of the first part is child1, child2. Here, a new column is created for the first part of the column name, and the column name of the new column is set to child , the second part is column information (column name and cell part, remain unchanged, for example, in this example, there are two column names in the second part, name and major, these two columns remain unchanged, keep the original column name and the cell content under the column )

3- long table variable width table

Use the pivot_wider() function in the tidyr package to implement long and wide tables

pivot_wider(data, id_cols, names_from, values_from, values_fill,...)

in:

  • data: Indicates the data frame to be reshaped;
  • id_cols: The column that uniquely identifies the observation, the default is a column other than the columns specified by names_from and values_from.
  • names_from: Specifies which variable column the column name comes from;
  • values_from: specifies which variable column the column value comes from
  • values_fill: If the cell value is correct after the table is widened, what value should be set to be filled.
  • There are also parameters to help fix column names: names_prefix, names_sep, names_glue.

Example 1:

There is only one column name and one column value, 

Column name: from the Type column

Column value: from the Heads column

In tidy data, you can use the column name to access the information for this entire column.

> load("配套数据/animals.rda")
> animals
# A tibble: 228 × 3
   Type    Year  Heads
   <chr>  <int>  <dbl>
 1 Sheep   2015 24943.
 2 Cattle  1972  2189.
 3 Camel   1985   559 
 4 Camel   1995   368.
 5 Camel   1997   355.
 6 Goat    1977  4411.
 7 Cattle  1979  2477.
 8 Cattle  2014  3414.
 9 Cattle  1996  3476.
10 Cattle  2017  4388.
# ℹ 218 more rows
# ℹ Use `print(n = ...)` to see more rows
> 
> animals %>%
+   pivot_wider(names_from=Type, values_from=Heads, values_fill = 0)
# A tibble: 48 × 6
    Year  Sheep Cattle Camel   Goat Horse
   <int>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
 1  2015 24943.  3780.  368. 23593. 3295.
 2  1972 13716.  2189.  625.  4338. 2239.
 3  1985 13249.  2408.  559   4299. 1971 
 4  1995     0   3317.  368.  8521. 2684.
 5  1997 14166.  3613.  355. 10265. 2893.
 6  1977 13430.  2388.  609   4411. 2104.
 7  1979 14400.  2477.  614.  4715. 2079.
 8  2014 23215.  3414.  349. 22009.    0 
 9  1996 13561.  3476.  358.  9135. 2770.
10  2017 30110.  4388.  434. 27347. 3940.
# ℹ 38 more rows
# ℹ Use `print(n = ...)` to see more rows

You can see that the value in the first column Type of the original data of animals is repeated. The cell content in this column, that is, the type of animal, is used as a new variable. If there are several types, create several columns and use names_from Specify which column of the original data the newly created column name comes from. The column name of the data frame used here indicates that the information of this column (column name + cell content) can be accessed, and values_from is used to specify the cell of the newly created column Which column of the original data the content comes from.

Example 2:

There are only multiple column name columns or multiple value columns. The following example shows that there are two value columns, estimate and moe

> us_rent_income#tidyr自带的数据集;
# A tibble: 104 × 5
   GEOID NAME       variable estimate   moe
   <chr> <chr>      <chr>       <dbl> <dbl>
 1 01    Alabama    income      24476   136
 2 01    Alabama    rent          747     3
 3 02    Alaska     income      32940   508
 4 02    Alaska     rent         1200    13
 5 04    Arizona    income      27517   148
 6 04    Arizona    rent          972     4
 7 05    Arkansas   income      23789   165
 8 05    Arkansas   rent          709     5
 9 06    California income      29454   109
10 06    California rent         1358     3
# ℹ 94 more rows
# ℹ Use `print(n = ...)` to see more rows
> us_rent_income%>%
+   pivot_wider(names_from=variable, values_from=c(estimate, moe))
# A tibble: 52 × 6
   GEOID NAME             estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                      <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                    24476           747        136        3
 2 02    Alaska                     32940          1200        508       13
 3 04    Arizona                    27517           972        148        4
 4 05    Arkansas                   23789           709        165        5
 5 06    California                 29454          1358        109        3
 6 08    Colorado                   32401          1125        109        5
 7 09    Connecticut                35326          1123        195        5
 8 10    Delaware                   31560          1076        247       10
 9 11    District of Col…           43198          1424        681       17
10 12    Florida                    25952          1077         70        3
# ℹ 42 more rows
# ℹ Use `print(n = ...)` to see more rows

Wide and long tables:

In the process of changing a wide table to a long table, the columns to be reshaped should be "integrated" into several columns. Abstractly, it means multiple columns, "integrated" into fewer columns than before, that is, a wide table will be transformed into a long table . The integration here is in quotation marks, which means that the information of the columns to be reshaped is integrated. This integration operation is completed by creating new columns and retaining some columns. The categorical variables, such as male and female, male as a variable, are self-contained One column, women as a variable form a column of its own, the process of changing a wide table to a long table is to classify men and women into variables and name them gender. When creating a new column, it is necessary to name the column. The name of this column must be a string "gender". The cell content of this gender column is repeatedly cycled by male and female. The original data column name is the male column, and the cell content below is the same as the female column. The cell content forms a column by itself, and when you need to create a new column for this cell value, take a column name, and the column name is usually represented by a string.

In the process of widening the long table, the content below the variable (that is, a column) of the long table is repeated. At this time, the repeated content should be extracted to make these values ​​​​become new column names, so set names_from, that is, which column of the original data the new column name comes from. At this time, the parameter is followed by the column name, not a string. After having the new column name, we need to fill the new column, the cell content below, what value to fill? It uses the values ​​of certain columns of the original long table (concrete analysis of specific issues) to fill in. Therefore, the column names of the original data are filled after values_from, without quotation marks, not strings.

Sample data source:

R language programming: based on tidyverse-asynchronous community-committed to publishing and sharing high-quality IT knowledge (epubit.com)

reference:

"R Language Programming" (published in February 2023, People's Posts and Telecommunications Press)

"R Data Science in Practice: Detailed Explanation of Tools and Case Analysis" (published in June 2019, Machinery Industry Press)

R language data visualization practice (micro-video full solution version)---big data professional chart from entry to mastery. (Published in February 2022, Electronic Industry Press)

Guess you like

Origin blog.csdn.net/u011375991/article/details/132025047