Python 爬虫 ---- Beautiful Soup（二）

遍历文档树

要想解析 Beautiful Soup 对象，首先需要对遍历文档树有所了解。遍历文档树的操作可以分为以下四个部分：

一、子节点

一个 Tag 可能包含多个字符串或其它的 Tag，这些都是这个 Tag 的子节点。Beautiful Soup 提供了许多操作和遍历子节点的属性，最简单的方法就是告诉它你想获取的 Tag 的 name。如果想获取 <head> 标签，只需要 soup.head 即可。首先解析 HTML 文档

  # HTML 文档
  html_doc = """
   <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
  <b><!--Hey, buddy. Want to buy a used parser?--></b>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
   <p class="story">...</p>
  """
  # 将HTML 文档解析为 BS 对象
  soup = BeautifulSoup(html_doc,'lxml')

下面给出一些遍历 BS 对象子节点的方法（‘#’后为输出结果）：

  # 1 .contents
  # tag的 .contents 属性可以将 Tag 的子节点以列表的方式输出:
  head_tag = soup.head
  head_tag.contents  # [<title>The Dormouse's story</title>]
  BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:
  len(soup.contents) # 1
  soup.contents[0].name  # u'html'
  
  2 .children
  # 通过 Tag 的 .children 生成器,可以对 Tag 的子节点进行循环:
  for child in title_tag.children:
      print(child)  # The Dormouse's story
      
  3 .descendants
  .contents 和 .children 属性仅获取 Tag 的直接子节点 .descendants 属性可以对所有 Tag 的子孙节点进行递归循环
  for child in head_tag.descendants:
      print(child)
      # <title>The Dormouse's story</title>
      # The Dormouse's story
      
  4 .string
  # 如果一个 Tag 仅有一个子节点，那么这个 Tag 可以使用 .string 方法,输出结果与当前唯一
  # 子节点的 .string 结果相同，如果 Tag 包含了多个子节点, Tag 就无法确定 .string 方法
  # 应该调用哪个子节点的内容，此时.string 的输出结果是 None :
  print(soup.html.string)	# None
  
  5 .strings 和 .stripped_strings
  # 如果tag中包含多个字符串,可以使用 .strings 来循环获取:
  for string in soup.strings:
      print(repr(string))
  # 输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:
  for string in soup.stripped_strings:
      print(repr(string)) 
  # 全部是空格的行会被忽略掉,段首和段末的空白会被删除

二、父节点

每个 Tag 或字符串都有父节点:被包含在某个 Tag 中

  1 .parent
  # 通过 .parent 属性来获取某个元素的父节点.<head>标签是<title>标签的父节点:
  title_tag = soup.title
  title_tag  # <title>The Dormouse's story</title>
  title_tag.parent  # <head><title>The Dormouse's story</title></head>
  # 文档title的字符串也有父节点:<title>标签
  title_tag.string.parent	# <title>The Dormouse's story</title>
  # 文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:
  html_tag = soup.html
  type(html_tag.parent)  # <class 'bs4.BeautifulSoup'>
  # BeautifulSoup 对象的 .parent 是None:
  print(soup.parent)  # None
  
  2 .parents
  # 通过元素的 .parents 属性可以递归得到元素的所有父辈节点
  for parent in link.parents:
      if parent is None:
          print(parent)
      else:
          print(parent.name)

三、兄弟节点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")

因为<b>标签和<c>标签是同一层，他们是同一个元素的子节点，所以<b>和<c>可以被称为兄弟节点。操作方法有：

  1 .next_sibling 和 .previous_sibling
  # 在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询上/下一个兄弟节点:
  sibling_soup.b.next_sibling	# <c>text2</c>
  sibling_soup.c.previous_sibling	# <b>text1</b>
  2 .next_siblings 和 .previous_siblings
  # 通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

四、回退和前进

  # HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”，”打开一个<head>标签”，”打开
  # 一个<title>标签”，”添加一段字符串”，”关闭<title>标签”,”打开<p>标签”，等等。Beautiful Soup
  # 提供了重现解析器初始化过程的方法.
  1 .next_element 和 .previous_element
.next_element 属性指向解析过程中下一个被解析的对象（字符串或tag）,结果可能与 .next_sibling 相同,但通常是不一样的.
  2 .next_elements 和 .previous_elements
  # 通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容