Python Reptile of BeautifulSoup Guide

A, BeautifulSoup Introduction and Installation

1 Introduction

In simple terms, BeautifulSoup is a python parsing library, HTML data and its main function is to parse the pages of
the official explanation is as follows:

Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed data captured by the user to parse the document, because simple, so do not need much code to write a complete application.
Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.

2. Install

Directly pip can be installed

pip install beautifulsoup4

Two, BeautifulSoup using the method described

1. Notes

BeautifulSoup need to specify when using a 解析器:

  • html.parse - comes with Python, but fault tolerance is not high enough for some not standardized written part of the page will be lost

  • lxml - parsing speed, additional installation

  • xml - belong lxml library that supports XML documents

  • html5lib - best fault tolerance, but slightly slower

Here lxml and html5lib require additional installation, use self- pip can be installed ( 推荐使用lxml)

2. Use

For example, the following HTML document fragment:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="beautiful title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""

Initialization object that specifies the parser is lxml:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

2.1 tag information was acquired

HTML code completion, tag information was acquired

print(soup.prettify())  # 补全HTML代码
print(soup.p)  # 获取整条p标签
print(soup.p.name)  # 获取p标签名称
# 获取标签属性
print(soup.p.attrs)  # 获取p标签内的所有属性,返回一个字典
print(soup.p['class'])  # 获取p标签内的class属性值,返回一个列表
# 获取标签文本
print(soup.p.string)  # 获取p标签的文本信息,如果p标签内包含了多个子节点并有多个文本时返回None
print(soup.p.strings)  # 获取p标签内的所有文本信息,返回一个生成器
print(soup.p.text)  # 获取p标签内的所有文本信息,返回一个列表
print(soup.stripped_strings)  # 去掉空白,保留所有的文本,返回一个生成器

2.2 Gets the element node

Gets the specified element 父/祖先节点, 子/子孙节点and兄弟节点

# 获取父·祖先节点
print(soup.p.parent)  # 获取p标签的直接父节点
print(soup.p.parents)  # 获取p标签的祖先节点,返回一个生成器
# 获取子·子孙节点
print(soup.p.contents)  # 获取p标签内的直接子节点,返回一个列表
print(soup.p.children)  # 获取p标签内的直接子节点,返回一个生成器
print(soup.p.descendants)  # 获取p标签内的子孙节点,返回一个生成器
# 获取兄弟节点
print(soup.a.previous_sibling)  # 获取a标签的上一个兄弟节点
print(soup.a.previous_siblings)  # 获取a标签前面的所有兄弟节点,返回一个生成器
print(soup.a.next_sibling)  # 获取a标签的下一个兄弟节点
print(soup.a.next_siblings)  # 获取a标签后面的所有兄弟节点,返回一个生成器

2.3 Use selector

Not all information can be easily obtained through a structured, usually find () and find_all () method to find:

  • Find () - returns a match result to the first
  • find_all () - Returns a list of all matching results

Since find () and find_all () is almost the same in the use, so this list only find_all () to use

print(soup.find_all(text=re.compile('Lacie'), limit=2))  # 使用正则获取所有文本包含'Lacie'的节点(limit: 限制匹配个数)
print(soup.find_all('a', text='Lacie'))  # 获取所有a标签内文本等于'Lacie'的节点(文本完整匹配)
print(soup.find_all('a', id='link2'))  # 获取所有a标签内id等于'link2'的节点
print(soup.find_all('a', class_='sister'))  # 获取所有a标签内class等于'sister'的节点
print(soup.find_all('a', class_='sister', id='link2'))  # 多个搜索条件叠加
print(soup.find_all(name='a'))  # 获取所有a节点
print(soup.find_all(attrs={'class': 'sister'}))  # 获取所有class属性值为'sister'的节点

2.4 Use CSS selectors

If you're familiar with CSS selectors, BeautifulSoup also provides a corresponding method:

  • .- on behalf of class
  • #- on behalf of id
print(soup.select('p'))  # 获取所有p标签,返回一个列表
print(soup.select('p a'))  # 获取所有p标签内的a节点,返回一个列表
print(soup.select('p.story'))  # 获取p标签内class为'story'的所有元素,返回一个列表
print(soup.select('.story'))  # 获取class为'story'的所有元素,返回一个列表
print(soup.select('.beautiful.title'))  # 获取class为'beautiful title'的所有元素,返回一个列表
print(soup.select('#link1'))  # 获取id为'link1'的所有元素,返回一个列表
Published 27 original articles · won praise 10 · views 392

Guess you like

Origin blog.csdn.net/weixin_43750377/article/details/103210617