Joining Dataframes

  • 一个变量存在于两个表中,并可唯一标识其中一个表。相对于可唯一标识的表,该变量称主键;相对于另一个表,该变量称外键
  • 人为添加的主键称代理键
    如:
flights %>% mutate(iden=row_number())
#iden就是代理键,将每个观测值依次标为1,2,3...
#judge whether the key is a primary key
Lahman::Batting %>%
  count(playerID, yearID, stint) %>%
  filter(n > 1) %>%
  nrow()
#> [1] 0
#I can conclude that this set of three keys are primary keys

#a more direct way
ggplot2::diamonds %>%
  distinct() %>%
  nrow()
#> [1] 53794
nrow(ggplot2::diamonds)
#> [1] 53940
#if the latter number is larger,it denotes that the dataset does not have a primary key

合并连接

1.连接方式
内连接(只保留相同) ,左连接(保留左侧),右连接(保留右侧),全连接(两侧都保留)
#无对应部分用NA代替
#重复部分依次对应连接

格式均为:

inner_join(x,y,by="key")
left_join(x,y,by="key")
#诸如此类.by = 可以省略

2.键列

  • by = NULL(直接不写) 直接将所有公共变量当成键
  • by = c(“a”=“b”) 将左表的a和右表的b连接
#show the route of flights
flights_latlon <- flights %>%
  inner_join(select(airports, origin = faa, origin_lat = lat, origin_lon = lon),
    by = "origin"
  ) %>%
  inner_join(select(airports, dest = faa, dest_lat = lat, dest_lon = lon),
    by = "dest"
  )

筛选连接

  • semi_join( ):保留左表中某变量与右表匹配的部分
  • anti_join():去除左表中某变量与右表匹配的部分

集合

intersect(a,b) 可以将a、b中的共同元素显示出来

猜你喜欢

转载自blog.csdn.net/weixin_51674826/article/details/116044094