Beautiful Soup 4

Beautiful Soup is a flexible and convenient web page parsing library, use it instead of writing regular expressions pages of information can be easily extracted from
the official document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Parsing library

Beautiful Soup supports HTML parser Python standard library also supports a number of third-party parsers
main categories:

from bs4 import BeautifulSoup
BeautifulSoup(markup, "html.parser")        # Python 标准库
BeautifulSoup(markup, "lxml")               # lxml HTML 解析器
BeautifulSoup(markup, "xml")                # lxml XML 解析器 = BeautifulSoup(markup, ["lxml", "xml"])
BeautifulSoup(markup, "html5lib")           # html5lib

The official recommended lxml

Basic use

from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
print(soup.prettify())

Objects

Beautiful Soup complex complex HTML document into a tree structure, each node is Python objects, all objects can be grouped into four kinds: Tag, NavigableString, BeautifulSoup,Comment

Tag object

Attributes

Label name

By Tag.nameacquiring the tag name Tag object
Tag.namecan be modified, modify objects in the current BeautifulSoup

Attributes

Tag attributes are stored in a dictionary
by Tag.attrcan get property dictionary, you can also find key directly Tag[key]
if a property corresponds to multiple values, it will return a list of
properties you can add, delete, modify,

from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
tag = soup.div
tag['id'] = 'i1'
tag['class'] = ['c1', 'c2']
del tag['class']
print(tag.get('id'))

NavigableString objects

Often it is included in the string tag, and packing NavigableString class
by tag.stringobtaining

BeautifulSoup objects

BeautifulSoup objects represents the entire contents of a document, most of the time, you can treat it as an object Tag

Selector

Tag selector

Can be found in the first match of the tag by tag name, including its sub-labels
can also be called multiple times to find its child tag
tag selector returns the object Tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
print(soup.tag_name)
print(soup.parent_tag.child_tag)

Descendant node

The tag .contentsattribute child nodes may be by way of a list of tag output

children

Through the tag .children, can be circulated to the child node tag generator

list(tag.children) == tag.contents

descendants

Returns a generator object tag all descendant nodes

from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
for i, child in enumerate(soup.children):
    print(i, child)

Ancestor node

parent

By .parentacquiring the parent node of an element of property
documents the top-level node's parent is BeautifulSoupthe object, BeautifulSoupthe object has no parent node (None)

parents

Returns a generator object tag all ancestor nodes

Sibling

next_sibling & previous_sibling

By .next_sibling& .previous_siblingAfter acquiring the properties of an element a (previous) sibling
if there is no return None

next_siblings & previous_siblings

By .next_siblings& .previous_siblingscan iterate output brothers of the current node properties

Standard selector

find_all

find_all () method searches all child nodes of the current tag of the tag, and determines whether the conditions of the filter
usage:

find_all(name, attrs, recursive, text, **kwargs)

By name:

find_all('div')

By attr:

find_all(id='i1')
find_all(class_='c1')
find_all(id=True)
find_all(href=re.compile('cnblogs.com/'))
find_all(attr={'attr1': '1', 'attr2': '2'})

other

find(name, attrs, recursive, text, **kwargs)                        # 返回找到的第一个

find_parents(name, attrs, recursive, text, **kwargs)                # 对当前tag的祖先节点进行迭代, 返回所有符合条件的节点
find_parent(name, attrs, recursive, text, **kwargs)                 # 对当前tag的祖先节点进行迭代, 返回第一个符合条件的节点

find_next_siblings(name, attrs, recursive, text, **kwargs)          # 对当前tag的之后的兄弟节点进行迭代, 返回所有符合条件的节点
find_next_sibling(name, attrs, recursive, text, **kwargs)           # 对当前tag的之后的兄弟节点进行迭代, 返回第一个符合条件的节点

find_previous_siblings(name, attrs, recursive, text, **kwargs)      # 对当前tag的之前的兄弟节点进行迭代, 返回所有符合条件的节点
find_previous_sibling(name, attrs, recursive, text, **kwargs)       # 对当前tag的之前的兄弟节点进行迭代, 返回第一个符合条件的节点

find_all_next(name, attrs, recursive, text, **kwargs)               # 对当前tag的之后的 tag 和字符串进行迭代, 返回所有符合条件的节点
find_next(name, attrs, recursive, text, **kwargs)                   # 对当前tag的之后的 tag 和字符串进行迭代, 返回第一个符合条件的节点

find_all_previous(name, attrs, recursive, text, **kwargs)           # 对当前tag的之前的 tag 和字符串进行迭代, 返回所有符合条件的节点
find_previous(name, attrs, recursive, text, **kwargs)               # 对当前tag的之前的 tag 和字符串进行迭代, 返回第一个符合条件的节点

CSS selectors

Beautiful Soup supports most of the CSS selector, the Tag or BeautifulSoup object .select()passed in a string parameter method, you can use CSS selector syntax find tag

soup.select('#i1')              # id 选择，选择 id = 'i1' 的标签
soup.select('.c1')              # class 选择，选择 class = 'c1' 的标签
soup.select('body')             # 标签选择，选择所有 body 标签
soup.select('body a')           # 层级选择，选择 body 标签下的所有 a 标签
soup.select('body > a')         # 层级选择，选择 body 标签下一层的所有 a 标签
soup.select('[attr1= "attr1"]') # 属性选择，选择所有 attr1 属性为 attr1 的标签

Export

Formatted output

prettify() The method of the document tree Beautiful Soup formatted output to Unicode encoding, each XML / HTML tags have a separate line

Beautiful Soup 4 through XSS attack prevention <- Click to view

Python Beautiful Soup 4

Beautiful Soup 4

Parsing library

Basic use

Objects

Tag object

Attributes

Label name

Attributes

NavigableString objects

BeautifulSoup objects

Selector

Tag selector

Descendant node

contents

children

descendants

Ancestor node

parent

parents

Sibling

next_sibling & previous_sibling

next_siblings & previous_siblings

Standard selector

find_all

other

CSS selectors

Export

Formatted output

Guess you like