Python bs4 BeautifulSoup library usage record

Table of contents

introduce

Install

initialization

Advantages of using parser

Python standard library

lxml HTML

lxml XML

html5lib

Formatted output

object

tag

Name

multi-valued attribute

Other methods

BeautifulSoup

Comment

Traverse

child node

parent node

Sibling node

Go back and forward

search

filter

string

regular expression

list

method

find and find_all

Call tag like find_all()

Other search methods

CSS selector


introduce

The full name of bs4 is  BeautifulSoup . It is one of the commonly used libraries for writing python crawlers. It is mainly used to parse html tags.

Install

pip install bs4

initialization

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

Two parameters: the first parameter is the html text to be parsed, and the second parameter is the parser to use. For HTML, it is html.parser, which is the parser that comes with bs4.

If an HTML or XML document is not in the correct format, the results returned by different parsers may be different.

Advantages of using parser

Python standard library

BeautifulSoup(html, "html.parser")

1、Python的内置标准库

2、执行速度适中

3、文档容错能力强

lxml HTML

BeautifulSoup(html, "lxml")

1、速度快

2、文档容错能力强

lxml XML

BeautifulSoup(html, ["lxml", "xml"])

BeautifulSoup(html, "xml")

1、速度快

2、唯一支持XML的解析器

html5lib

BeautifulSoup(html, "html5lib")

1、最好的容错性

2、以浏览器的方式解析文档

3、生成HTML5格式的文档

Formatted output

soup.prettify()  # prettify 有括号和没括号都可以

object

Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into 4 types: tag, NavigableString, BeautifulSoup, and Comment.

tag

Tag objects are the same as tags in xml or html native documents.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

If it does not exist, it returns None, if there are multiple ones, it returns the first one.

Name

Each tag has its own name

tag.name
# 'b'
Attributes

  tag 的属性是一个字典


tag['class']
# 'boldest'

tag.attrs
# {'class': 'boldest'}

type(tag.attrs)
# <class 'dict'>

multi-valued attribute

The most common multi-valued attribute is class, and the multi-valued attribute returns a list.

soup = BeautifulSoup('<p class="body strikeout"></p>')

print(soup.p['class'])  # ['body', 'strikeout']

print(soup.p.attrs)     # {'class': ['body', 'strikeout']}

If an attribute appears to have multiple values ​​but is not defined as a multi-valued attribute in any version of the HTML definition, Beautiful Soup will return the attribute as a string.

soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
print(soup.p['id'])    # 'my id'
Text

The text attribute returns the string concatenated by all the strings of tag.

Other methods

  tag.has_attr('id') # 返回 tag 是否包含 id 属性

Of course, the above code can also be written as 'id' in tag.attrs. As mentioned before, the attribute of tag is a dictionary. By the way, has_key is an old legacy API, left over to support code before 2.2. This function has been removed in Python3.

Strings are often included in tags, and Beautiful Soup uses the NavigableString class to wrap the strings in tags. But the string cannot contain other tags.

soup = BeautifulSoup(‘Extremely bold’)

s = soup.b.string

print(s)        # Extremely bold

print(type(s))  # <class 'bs4.element.NavigableString'>

BeautifulSoup

A BeautifulSoup object represents the entire contents of a document. Most of the time, you can think of it as a Tag object. But the BeautifulSoup object is not a real HTML or XML tag. It has no attribute attribute. The name attribute is a special attribute with a value of "[document]".

Comment

Comment generally represents the comment part of the document.

soup = BeautifulSoup("<b><!--This is a comment--></b>")

comment = soup.b.string

print(comment)          # This is a comment

print(type(comment))    # <class 'bs4.element.Comment'>

Traverse

child node

contents attribute

The contents property returns a list of all child nodes, including NavigableString type nodes. If there is a newline character in the node, it will be regarded as a NavigableString type node and as a child node.

Nodes of type NavigableString have no contents attribute because they have no child nodes.

soup = BeautifulSoup("""<div>
<span>test</span>
</div>
""")

element = soup.div.contents

print(element)          # ['\n', <span>test</span>, '\n']

children attribute

The children attribute is basically the same as the contents attribute, except that instead of returning a list of child nodes, it returns an iterable object of child nodes.

descendants attribute

The descendants property returns all descendant nodes of tag.

string attribute

If the tag has only one child node of NavigableString type, then this tag can use .string to get the child node.

If a tag has only one child node, then this tag can also use the .string method, and the output result is the same as the .string result of the current only child node.

If the tag contains multiple child nodes, the tag cannot determine the content of which child node the .string method should be called, and the output result of .string is None.

soup = BeautifulSoup("""<div>
    <p><span><b>test</b></span></p>
</div>
""")

element = soup.p.string

print(element)          # test

print(type(element))    # <class 'bs4.element.NavigableString'>

Special attention should be paid to the fact that in order to display clearly, we usually wrap the html node and display it indented, but in BeautifulSoup it will be considered a NavigableString type child node, causing an error. In the above example, if you change element = soup.div.string, an error will occur.

strings 和 stripped_strings 属性

If the tag contains multiple strings, you can use the strings attribute to obtain them. If you want to remove blank lines from the returned results, you can use the stripped_strings attribute.

soup = BeautifulSoup("""<div>
    <p>      </p>
    <p>test 1</p>
    <p>test 2</p>
</div>
""", 'html.parser')

element = soup.div.stripped_strings

print(list(element))          # ['test 1', 'test 2']

parent node

parent attribute

The parent attribute returns the parent node of a certain element (tag, NavigableString). The parent node of the top-level node of the document is the BeautifulSoup object, and the parent node of the BeautifulSoup object is None.

The parent attribute recursively obtains all parent nodes of the element, including BeautifulSoup objects.

Sibling node

next_sibling 和 previous_sibling

next_sibling returns the next sibling node, and previous_sibling returns the previous sibling node. Let’s look at an example directly, and be careful not to get spoiled by line breaks and indentations.

soup = BeautifulSoup("""<div>
    <p>test 1</p><b>test 2</b><h>test 3</h></div>
""", 'html.parser')

print(soup.b.next_sibling)      # <h>test 3</h>

print(soup.b.previous_sibling)  # <p>test 1</p>

print(soup.h.next_sibling)      # None

next_siblings 和 previous_siblings

  next_siblings 返回后面的兄弟节点

  previous_siblings  返回前面的兄弟节点

Go back and forward

Thinking of HTML parsing as a series of events that parse tags in sequence, BeautifulSoup provides a way to reproduce the parser initialization process.

The next_element attribute points to the next parsed object (tag or NavigableString) during the parsing process.

The previous_element attribute points to the previous parsed object during the parsing process.

There are also next_elements and previous_elements attributes, which will not be described in detail.

search

filter

Before introducing the find_all() method, let's first introduce the types of filters. These filters run through the entire search API. Filters can be used on tag names, node attributes, strings, or a mixture of them.

The html document used in the example is as follows:

html = """
<div>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a></p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

string

Find all tags

soup.find_all(‘b’) # [The Dormouse’s story]

regular expression

Pass in a regular expression as a parameter and return tags that satisfy the regular expression. In the following example, find all tags starting with b.

soup.find_all(re.compile("^b"))  # [<b>The Dormouse's story</b>]

list

Passing in the list parameter will return content matching any element in the list. Find all tags and tags in the example below.

soup.find_all(["a", "b"])
True

True can match any value. The following code finds all tags, but does not return string nodes.

soup.find_all(True)

method

If there is no suitable filter, you can also customize a method that only accepts one element parameter. If this method returns True, it means that the current element match is found. The following example returns all tags that contain a class attribute but not an id attribute.

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')


print(soup.find_all(has_class_but_no_id))

Return results:

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></p>]

This result seems wrong at first glance. The label contains the id attribute. In fact, there are only 2 elements in the returned list, both of which are

label, label is

The child node of the label.

find and find_all

Search all tag child nodes of the current tag and determine whether they meet the filter conditions

grammar:

  find(name=None, attrs={}, recursive=True, text=None, **kwargs)

  find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

parameter:

  name:查找所有名字为 name 的 tag,字符串对象会被自动忽略掉。上面过滤器示例中的参数都是 name 参数。当然,其他参数中也可以使用过滤器。

  attrs:按属性名和值查找。传入字典,key 为属性名,value 为属性值。

  recursive:是否递归遍历所有子孙节点,默认 True。

  text:用于搜索字符串,会找到 .string 方法与 text 参数值相符的tag,通常配合正则表达式使用。也就是说,虽然参数名是 text,但实际上搜索的是 string 属性。

  limit:限定返回列表的最大个数。

  kwargs:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作 tag 的属性来搜索。这里注意,如果要按 class 属性搜索,因为 class 是 python 的保留字,需要写作 class_。

Some attributes of Tag cannot be used as kwargs parameters in search, such as the data-* attributes in html5.

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

print(data_soup.find_all(data-foo="value"))

# SyntaxError: keyword can't be an expression

But it can be passed through the attrs parameter:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

print(data_soup.find_all(attrs={"data-foo": "value"}))

# [<div data-foo="value">foo!</div>]

When searching by class_, only one CSS class name is sufficient. If multiple CSS names are written, the order must be consistent and cannot jump. In the following examples, the first three elements can be found, but the last two cannot.

css_soup = BeautifulSoup('<p class="body bold strikeout"></p>')

print(css_soup.find_all("p", class_="strikeout"))

print(css_soup.find_all("p", class_="body"))

print(css_soup.find_all("p", class_="body bold strikeout"))

# [<p class="body strikeout"></p>]

print(css_soup.find_all("p", class_="body strikeout"))

print(css_soup.find_all("p", class_="strikeout body"))

# []

Call tag like find_all()

find_all() is almost the most commonly used search method in BeautifulSoup, so we define its shorthand method. BeautifulSoup objects and tag objects can be used as a method. The execution result of this method is the same as calling the find_all() method of this object. The following two lines of code are equivalent:

soup.find_all('b')

soup('b')

Other search methods

find_parents()      返回所有祖先节点

find_parent()      返回直接父节点

find_next_siblings()   返回后面所有的兄弟节点

find_next_sibling()   返回后面的第一个兄弟节点

find_previous_siblings() 返回前面所有的兄弟节点

find_previous_sibling() 返回前面第一个兄弟节点

find_all_next()     返回节点后所有符合条件的节点

find_next()       返回节点后第一个符合条件的节点

find_all_previous()   返回节点前所有符合条件的节点

find_previous()     返回节点前所有符合条件的节点

CSS selector

BeautifulSoup supports most CSS selectors, which are demonstrated directly in code.

from bs4 import BeautifulSoup

 
html = """
<html>
<head><title>标题</title></head>
<body>
 <p class="title" name="dromouse"><b>标题</b></p>
 <div name="divlink">
  <p>
   <a href="http://example.com/1" class="sister" id="link1">链接1</a>
   <a href="http://example.com/2" class="sister" id="link2">链接2</a>
   <a href="http://example.com/3" class="sister" id="link3">链接3</a>
  </p>
 </div>
 <p></p>
 <div name='dv2'></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# 通过tag查找
print(soup.select('title'))             # [<title>标题</title>]

# 通过tag逐层查找
print(soup.select("html head title"))   # [<title>标题</title>]

# 通过class查找
print(soup.select('.sister'))
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">链接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">链接3</a>]


# 通过id查找
print(soup.select('#link1, #link2'))
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">链接2</a>]


# 组合查找
print(soup.select('p #link1'))    # [<a class="sister" href="http://example.com/1" id="link1">链接1</a>]

 
# 查找直接子标签
print(soup.select("head > title"))  # [<title>标题</title>]

print(soup.select("p > #link1"))   # [<a class="sister" href="http://example.com/1" id="link1">链接1</a>]

print(soup.select("p > a:nth-of-type(2)"))  # [<a class="sister" href="http://example.com/2" id="link2">链接2</a>]
# nth-of-type 是CSS选择器



# 查找兄弟节点(向后查找)
print(soup.select("#link1 ~ .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">链接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">链接3</a>]

print(soup.select("#link1 + .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">链接2</a>]

 

# 通过属性查找
print(soup.select('a[href="http://example.com/1"]'))

# ^ 以XX开头
print(soup.select('a[href^="http://example.com/"]'))

# * 包含
print(soup.select('a[href*=".com/"]'))

# 查找包含指定属性的标签
print(soup.select('[name]'))

 

# 查找第一个元素
print(soup.select_one(".sister"))

Guess you like

Origin blog.csdn.net/u012206617/article/details/132875817