Beautiful Soup 4
Beautiful Soup is a flexible and convenient web page parsing library, use it instead of writing regular expressions pages of information can be easily extracted from
the official document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Parsing library
Beautiful Soup supports HTML parser Python standard library also supports a number of third-party parsers
main categories:
from bs4 import BeautifulSoup
BeautifulSoup(markup, "html.parser") # Python 标准库
BeautifulSoup(markup, "lxml") # lxml HTML 解析器
BeautifulSoup(markup, "xml") # lxml XML 解析器 = BeautifulSoup(markup, ["lxml", "xml"])
BeautifulSoup(markup, "html5lib") # html5lib
The official recommended lxml
Basic use
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
print(soup.prettify())
Objects
Beautiful Soup complex complex HTML document into a tree structure, each node is Python objects, all objects can be grouped into four kinds: Tag
, NavigableString
, BeautifulSoup
,Comment
Tag object
Attributes
Label name
By Tag.name
acquiring the tag name Tag object
Tag.name
can be modified, modify objects in the current BeautifulSoup
Attributes
Tag attributes are stored in a dictionary
by Tag.attr
can get property dictionary, you can also find key directly Tag[key]
if a property corresponds to multiple values, it will return a list of
properties you can add, delete, modify,
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
tag = soup.div
tag['id'] = 'i1'
tag['class'] = ['c1', 'c2']
del tag['class']
print(tag.get('id'))
NavigableString objects
Often it is included in the string tag, and packing NavigableString class
by tag.string
obtaining
BeautifulSoup objects
BeautifulSoup objects represents the entire contents of a document, most of the time, you can treat it as an object Tag
Selector
Tag selector
Can be found in the first match of the tag by tag name, including its sub-labels
can also be called multiple times to find its child tag
tag selector returns the object Tag
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
print(soup.tag_name)
print(soup.parent_tag.child_tag)
Descendant node
contents
The tag .contents
attribute child nodes may be by way of a list of tag output
children
Through the tag .children
, can be circulated to the child node tag generator
list(tag.children) == tag.contents
descendants
Returns a generator object tag all descendant nodes
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
for i, child in enumerate(soup.children):
print(i, child)
Ancestor node
parent
By .parent
acquiring the parent node of an element of property
documents the top-level node's parent is BeautifulSoup
the object, BeautifulSoup
the object has no parent node (None)
parents
Returns a generator object tag all ancestor nodes
Sibling
next_sibling & previous_sibling
By .next_sibling
& .previous_sibling
After acquiring the properties of an element a (previous) sibling
if there is no return None
next_siblings & previous_siblings
By .next_siblings
& .previous_siblings
can iterate output brothers of the current node properties
Standard selector
find_all
find_all () method searches all child nodes of the current tag of the tag, and determines whether the conditions of the filter
usage:
find_all(name, attrs, recursive, text, **kwargs)
By name:
find_all('div')
By attr:
find_all(id='i1')
find_all(class_='c1')
find_all(id=True)
find_all(href=re.compile('cnblogs.com/'))
find_all(attr={'attr1': '1', 'attr2': '2'})
other
find(name, attrs, recursive, text, **kwargs) # 返回找到的第一个
find_parents(name, attrs, recursive, text, **kwargs) # 对当前tag的祖先节点进行迭代, 返回所有符合条件的节点
find_parent(name, attrs, recursive, text, **kwargs) # 对当前tag的祖先节点进行迭代, 返回第一个符合条件的节点
find_next_siblings(name, attrs, recursive, text, **kwargs) # 对当前tag的之后的兄弟节点进行迭代, 返回所有符合条件的节点
find_next_sibling(name, attrs, recursive, text, **kwargs) # 对当前tag的之后的兄弟节点进行迭代, 返回第一个符合条件的节点
find_previous_siblings(name, attrs, recursive, text, **kwargs) # 对当前tag的之前的兄弟节点进行迭代, 返回所有符合条件的节点
find_previous_sibling(name, attrs, recursive, text, **kwargs) # 对当前tag的之前的兄弟节点进行迭代, 返回第一个符合条件的节点
find_all_next(name, attrs, recursive, text, **kwargs) # 对当前tag的之后的 tag 和字符串进行迭代, 返回所有符合条件的节点
find_next(name, attrs, recursive, text, **kwargs) # 对当前tag的之后的 tag 和字符串进行迭代, 返回第一个符合条件的节点
find_all_previous(name, attrs, recursive, text, **kwargs) # 对当前tag的之前的 tag 和字符串进行迭代, 返回所有符合条件的节点
find_previous(name, attrs, recursive, text, **kwargs) # 对当前tag的之前的 tag 和字符串进行迭代, 返回第一个符合条件的节点
CSS selectors
Beautiful Soup supports most of the CSS selector, the Tag or BeautifulSoup object .select()
passed in a string parameter method, you can use CSS selector syntax find tag
soup.select('#i1') # id 选择,选择 id = 'i1' 的标签
soup.select('.c1') # class 选择,选择 class = 'c1' 的标签
soup.select('body') # 标签选择,选择所有 body 标签
soup.select('body a') # 层级选择,选择 body 标签下的所有 a 标签
soup.select('body > a') # 层级选择,选择 body 标签下一层的所有 a 标签
soup.select('[attr1= "attr1"]') # 属性选择,选择所有 attr1 属性为 attr1 的标签
Export
Formatted output
prettify()
The method of the document tree Beautiful Soup formatted output to Unicode encoding, each XML / HTML tags have a separate line
Beautiful Soup 4 through XSS attack prevention <- Click to view