Beautiful Soup library - Python Reptile (b)

Beautiful Soup 4.4.0 Chinese documents

import requests
from bs4 import BeautifulSoup

url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, "html.parser")

The basic elements of the class 1.Beautiful Soup


fundamental element Explanation
Tag Tag, the basic information blocking means, respectively <> and </> indicate the beginning and end
Name Name tag, <p> ... </ p> name is 'p', the format: <tag> .name
Attributes Tag attributes, organized in the dictionary, the format: <tag> .attrs
NavigableString Non attribute string in the tag, <> ... </> string in the format: <tag> .string
Comment Note the tag part of the string, a special type Comment
# Tag
# 获取网页的标题
print(soup.title)
# <title>This is a python demo page</title>
# 获取html的a标签的内容
# 默认获取第一个标签
print(soup.a)

# Name
# 获取标签的名字
print('标签名字:', soup.a.name)

# Attributes
# 获取属性信息
tag = soup.a
print(tag.attrs)

# NavigableString
# 获取a标签的字符串信息
print(soup.a.string)

# Comment
new_soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")

print(new_soup.b.string)
# This is a comment
print(type(new_soup.b.string))
# <class 'bs4.element.Comment'>
print(new_soup.p.string)
# This is not a comment
print(type(new_soup.p.string))
# <class 'bs4.element.NavigableString'>

2. The number of downlink label traversal


Attributes Explanation
.contents List of child nodes of the <tag> list of all son nodes into
.children Iterator type of child nodes, and .contents similar for loop iterates son node
.descendants Iterative descendant node type, comprising all descendant nodes, a loop through
import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")

# 获取body标签下的所有节点并存入列表中
print(soup.body.contents)
print(type(soup.body.contents))
# <class 'list'>

# 遍历儿子节点
for child in soup.body.children:
    print(child)

# 遍历子孙节点
for desc in soup.body.descendants:
    print(desc)

Traversing up the tree 3. Label


Attributes Explanation
.parent Father node label
.parents Iterative ancestor node type tags for looping through ancestor node
# 标签数的上行遍历
# 遍历a标签的所有父节点
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

# title的父标签
print(soup.title.parent)

4. parallel tag tree traversal


Attributes Explanation
.next_sibling Returns the next node in parallel according to the procedure of HTML text label
.previous_sibling Returns the HTML text tag a parallel node according to the order
.next_siblings Iterative type, HTML text returned in order for all subsequent parallel node label
.previous_siblings Iterative type, Continued return all HTML text tags parallel nodes in accordance with the order
# 遍历后续节点
for sibling in soup.a.next_siblings:
	print(sibling)

# 遍历前续节点
for sibling in soup.a.previous_siblings:
	print(sibling)

to sum up:
Here Insert Picture Description

prettify 5.bs4 library () method


The text of HTML format the content or part of the label (each label will add back wrap)

import requests
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")

print(soup.prettify())
print(soup.a.prettify())

6. Find method


find_all(name, attrs, recursive, string, **kwargs):
Returns a list type, storage result of the lookup.

  • name: string to retrieve the tag name
  • attrs: search character string tag attribute values, attribute search can be marked
  • recursive: Whether to retrieve all descendants, default True
  • Retrieving character string <> ... </> string region: string
import requests
import re
from bs4 import BeautifulSoup

r = requests.get("http://python123.io/ws/demo.html")
soup = BeautifulSoup(r.text, "html.parser")

# 查找所有a标签
print(soup.find_all('a'))
print(type(soup.find_all('a')))
# <class 'bs4.element.ResultSet'>

for tag in soup.find_all('a'):
    print(tag.string)
# 显示a 和 b 标签
print(soup.find_all(['a', 'b']))

# 显示soup的所有标签信息
for tag in soup.find_all(True):
    print(tag.name)

# 使用正则表达式来查找含有b的标签
for tag in soup.find_all(re.compile('b')):
    print(tag.name)

# 查找p标签含有course的内容
print(soup.find_all('p', 'course'))

# 查找id属性为link1的内容
print(soup.find_all(id='link1'))

# 查找id属性为link的内容 没有则返回[]
print(soup.find_all(id='link'))

# 使用re模块来查找id属性包含link的内容
print(soup.find_all(id=re.compile('link')))

# 设置recursive参数为False, 这时从soup的子节点进行检索, 而不会去检索子孙节点的内容
print(soup.find_all('a', recursive=False))

# 检索字符串是否存在
print(soup.find_all(string="Basic Python"))

# 检索字符串是否含有python, 通过re
print(soup.find_all(string=re.compile('Python')))

The Tip:
<Tag> (...) is equivalent to <Tag> .find_all (...)
Soup (...) is equivalent to soup.find_all (...)

Extension Methods

method Explanation
<>.find() Search and only returns a result of type string, with .find_all () parameters
<>.find_parents() Search ancestor node, returns a list of types, with .find_all () parameters
<>.find_parent() Ancestor node returns a result of type string, with .find () parameters
<>.find_next_siblings() In the subsequent search for a parallel node, returns a list of types, the same .find_all () parameters
<>.find_next_sibling() In the subsequent return to a parallel node a result, string type, with .find () parameters
<>.find_previous_siblings() Sequence search nodes in parallel in the front, returns a list of types, the same .find_all () parameters
<>.find_previous_sibling() Sequence parallel in the front node returns a result of type string, with .find () parameters




Reptile Case

  1. Chinese University Ranking directed reptiles (requests + bs4)
He published 190 original articles · won praise 153 · views 90000 +

Guess you like

Origin blog.csdn.net/qq_36852780/article/details/104330215