A pit where missing values exist when filter rows are screened in tidyverse

Hello everyone, I am Deng Fei. I haven't updated my blog for a long time, because I haven't made progress for a long time.

I thought Lu Xun was right before. He wrote in "Wild Grass": "When I am silent, I feel full; when I speak, I feel empty at the same time." The exact situation now is that when I stop updating, I feel full and stress-free, then I don’t want to update more and more, and finally I find that there is nothing to write, and once I want to write something, I feel very empty, my stomach is empty but I started to have a big belly, as if there was nothing in my stomach, but it was all meat, the sadness of an adult...

Back on track, when I used dplyr's filter to control data today, unexpected results appeared. In line with the principle of "learning should be shared, and output is the best learning", I simulated a data and introduced this pit. and how to avoid it.

1. First simulate a set of data


set.seed(123)
df = data.frame(ID = 1:10, Sex = c("F","F","F","F","NA","F","F","NA","M","M"), y1 = c(rnorm(9),NA))
df

Data are as follows:

> df
   ID Sex          y1
1   1   F -0.56047565
2   2   F -0.23017749
3   3   F  1.55870831
4   4   F  0.07050839
5   5  NA  0.12928774
6   6   F  1.71506499
7   7   F  0.46091621
8   8  NA -1.26506123
9   9   M -0.68685285
10 10   M          NA

2. Extract the line where Sex is not F

I have three methods:

  • The first one, use !=
  • The second, use ! ==
  • The third way, use !%in%

The sample code is as follows:

df %>% filter(Sex != "F")
df %>% filter(!Sex == "F")
df %>% filter(!Sex %in% "F")

Example result:

It can be seen that the results of the three are consistent.

insert image description here

3. If the data is saved to Excel and then read

write.xlsx(df,"df_test.xlsx")

Read excel data:

df = read.xlsx("df_test.xlsx")
df

4. Weird moment: Excel reading error

library(tidyverse)
df %>% filter(Sex != "F")
df %>% filter(!Sex == "F")
df %>% filter(!Sex %in% "F")

The first two are wrong, it automatically ignores rows with NA...

insert image description here
Only the third is correct:

5. The data frame built in R is fine, but Excel is broken after a circle

It's that weird.

Complete code:

set.seed(123)
df = data.frame(ID = 1:10, Sex = c("F","F","F","F","NA","F","F","NA","M","M"), y1 = c(rnorm(9),NA))
df

# 
library(tidyverse)
library(openxlsx)
df %>% filter(Sex != "F")
df %>% filter(!Sex == "F")
df %>% filter(!Sex %in% "F")
write.xlsx(df,"df_test.xlsx")

# 读取数据
df1 = read.xlsx("df_test.xlsx")
df1
str(df1)
library(tidyverse)
df1 %>% filter(Sex != "F")
df1 %>% filter(!Sex == "F")
df1 %>% filter(!Sex %in% "F")

Compare the two dataframes: R has NA, Excel reads <NA>,

insert image description here
Use drop_na to see if it is a missing value:

It turns out that when I build the vector in R, I use "NA", instead of NA, as characters, so filter != can extract NA rows.

6. Rebuild the R data frame

set.seed(123)
df2 = data.frame(ID = 1:10, Sex = c("F","F","F","F",NA,"F","F",NA,"M","M"), y1 = c(rnorm(9),NA))
df2


Try it with drop_na, no problem:

> df2 %>% drop_na(Sex)
  ID Sex         y1
1  1   F -0.4456620
2  2   F  1.2240818
3  3   F  0.3598138
4  4   F  0.4007715
5  6   F -0.5558411
6  7   F  1.7869131
7  9   M -1.9666172
8 10   M         NA

Filter with three methods, try it: the first two are not ideal.

> df2 %>% filter(Sex != "F")
  ID Sex        y1
1  9   M -1.966617
2 10   M        NA
> df2 %>% filter(!Sex == "F")
  ID Sex        y1
1  9   M -1.966617
2 10   M        NA
> df2 %>% filter(!Sex %in% "F")
  ID  Sex         y1
1  5 <NA>  0.1106827
2  8 <NA>  0.4978505
3  9    M -1.9666172
4 10    M         NA

Conclusion: When filtering, the NA rows will be automatically ignored, so it is %in%reliable to use it! ! !

Guess you like

Origin blog.csdn.net/yijiaobani/article/details/130934349