R爬虫常用的包与用法

1. xml2

用于解析xml报表（parse XML）使用简单、一致的接口处理XML文件。
构建在’libxml2’ C库之上。xml2包是到libxml2的绑定，这使得使用r中的HTML和XML很容易。这个API多少受到了jQuery的启发。
Usage如下：

library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x
xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")
h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
xml_text(h)

2. rvest

rvest package可以轻松获取(刮)网页。Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML.
rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.
Usage 如下：

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
#> [1] 7.8

cast <- lego_movie %>%
  html_nodes("#titleCast .primary_photo img") %>%
  html_attr("alt")
cast
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
#> [1] "https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

3. dplyr

dplyr是一种数据操作语法，它提供了一组一致的动词，帮助您解决最常见的数据操作挑战:mutate()添加了新的变量，这些变量是现有变量的函数select()根据它们的名称选择变量。
filter()根据案例的值选择案例。
summarise()将多个值简化为一个摘要。
arrange()更改行顺序。
这些都与group_by()结合在一起，group_by()允许您按组执行任何操作。
你可以在vignette(“dplyr”)中了解更多。除了这些单表动词之外，dplyr还提供了各种双表动词，您可以vignette(“two-table”)中了解它们。
dplyr被设计用来抽象数据的存储方式。这意味着除了处理本地数据帧之外，还可以使用完全相同的R代码处理远程数据库表。安装dbplyr包，然后读取vignette(“databases”，package =“dbplyr”)。如果您是初次接触dplyr，最好的起点是R中数据科学的数据导入章节。
Usage如下：

library(dplyr)

starwars %>% 
  filter(species == "Droid")
#> # A tibble: 5 x 13
#>   name  height  mass hair_color skin_color  eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> 
#> 1 C-3PO    167   75. <NA>       gold        yellow          112. <NA>  
#> 2 R2-D2     96   32. <NA>       white, blue red              33. <NA>  
#> 3 R5-D4     97   32. <NA>       white, red  red              NA  <NA>  
#> 4 IG-88    200  140. none       metal       red              15. none  
#> 5 BB8       NA   NA  none       none        black            NA  none  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars %>% 
  select(name, ends_with("color"))
#> # A tibble: 87 x 4
#>   name           hair_color skin_color  eye_color
#>   <chr>          <chr>      <chr>       <chr>    
#> 1 Luke Skywalker blond      fair        blue     
#> 2 C-3PO          <NA>       gold        yellow   
#> 3 R2-D2          <NA>       white, blue red      
#> 4 Darth Vader    none       white       yellow   
#> 5 Leia Organa    brown      light       brown    
#> # ... with 82 more rows

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)
#> # A tibble: 87 x 4
#>   name           height  mass   bmi
#>   <chr>           <int> <dbl> <dbl>
#> 1 Luke Skywalker    172   77.  26.0
#> 2 C-3PO             167   75.  26.9
#> 3 R2-D2              96   32.  34.7
#> 4 Darth Vader       202  136.  33.3
#> 5 Leia Organa       150   49.  21.8
#> # ... with 82 more rows

starwars %>% 
  arrange(desc(mass))
#> # A tibble: 87 x 13
#>   name    height  mass hair_color skin_color  eye_color  birth_year gender
#>   <chr>    <int> <dbl> <chr>      <chr>       <chr>           <dbl> <chr> 
#> 1 Jabba …    175 1358. <NA>       green-tan,… orange          600.  herma…
#> 2 Grievo…    216  159. none       brown, whi… green, ye…       NA   male  
#> 3 IG-88      200  140. none       metal       red              15.0 none  
#> 4 Darth …    202  136. none       white       yellow           41.9 male  
#> 5 Tarfful    234  136. brown      brown       blue             NA   male  
#> # ... with 82 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(n > 1)
#> # A tibble: 9 x 3
#>   species      n  mass
#>   <chr>    <int> <dbl>
#> 1 Droid        5  69.8
#> 2 Gungan       3  74.0
#> 3 Human       35  82.8
#> 4 Kaminoan     2  88.0
#> 5 Mirialan     2  53.1
#> # ... with 4 more rows

4. stringr

stringr包被定义为一致的、简单易用的字符串工具集。所有的函数和参数定义都具有一致性，比如，用相同的方法进行NA处理和0长度的向量处理。
字符串处理虽然不是R语言中最主要的功能，却也是必不可少的，数据清洗、可视化等的操作都会用到。对于R语言本身的base包提供的字符串基础函数，随着时间的积累，已经变得很多地方不一致，不规范的命名，不标准的参数定义，很难看一眼就上手使用。字符串处理在其他语言中都是非常方便的事情，R语言在这方面确实落后了。stringr包就是为了解决这个问题，让字符串处理变得简单易用，提供友好的字符串操作接口。
Usage如下：

字符串拼接函数

str_c:	字符串拼接。
str_join:	字符串拼接，同str_c。
str_trim:	去掉字符串的空格和TAB(\t)
str_pad:	补充字符串的长度
str_dup:	复制字符串
str_wrap:	控制字符串输出格式
str_sub:	截取字符串
str_sub<-	截取字符串，并赋值，同str_sub
字符串计算函数

str_count:	字符串计数
str_length:	字符串长度
str_sort:	字符串值排序
str_order:	字符串索引排序，规则同str_sort
字符串匹配函数

str_split:	字符串分割
str_split_fixed: 字符串分割，同str_split
str_subset: 返回匹配的字符串
word:	从文本中提取单词
str_detect: 检查匹配字符串的字符
str_match:	从字符串中提取匹配组。
str_match_all: 从字符串中提取匹配组，同str_match
str_replace: 字符串替换
str_replace_all: 字符串替换，同str_replace
str_replace_na:把NA替换为NA字符串
str_locate: 找到匹配的字符串的位置。
str_locate_all: 找到匹配的字符串的位置,同str_locate
str_extract: 从字符串中提取匹配字符
str_extract_all: 从字符串中提取匹配字符，同str_extract
字符串变换函数

str_conv:	字符编码转换
str_to_upper: 字符串转成大写
str_to_lower: 字符串转成小写,规则同str_to_upper
str_to_title: 字符串转成首字母大写,规则同str_to_upper
参数控制函数，仅用于构造功能的参数，不能独立使用。

boundary:	定义使用边界
coll:	定义字符串标准排序规则。
fixed:	定义用于匹配的字符，包括正则表达式中的转义符
regex:	定义正则表达式

stringr包具体用法可参考：http://blog.fens.me/r-stringr/