import requests
from bs4 import BeautifulSoup
url = 'http://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
The basic elements of the class 1.Beautiful Soup
fundamental element | Explanation |
---|---|
Tag | Tag, the basic information blocking means, respectively <> and </> indicate the beginning and end |
Name | Name tag, <p> ... </ p> name is 'p', the format: <tag> .name |
Attributes | Tag attributes, organized in the dictionary, the format: <tag> .attrs |
NavigableString | Non attribute string in the tag, <> ... </> string in the format: <tag> .string |
Comment | Note the tag part of the string, a special type Comment |
# Tag
# 获取网页的标题
print(soup.title)
# <title>This is a python demo page</title>
# 获取html的a标签的内容
# 默认获取第一个标签
print(soup.a)
# Name
# 获取标签的名字
print('标签名字:', soup.a.name)
# Attributes
# 获取属性信息
tag = soup.a
print(tag.attrs)
# NavigableString
# 获取a标签的字符串信息
print(soup.a.string)
# Comment
new_soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>", "html.parser")
print(new_soup.b.string)
# This is a comment
print(type(new_soup.b.string))
# <class 'bs4.element.Comment'>
print(new_soup.p.string)
# This is not a comment
print(type(new_soup.p.string))
# <class 'bs4.element.NavigableString'>
2. The number of downlink label traversal
Attributes | Explanation |
---|---|
.contents | List of child nodes of the <tag> list of all son nodes into |
.children | Iterator type of child nodes, and .contents similar for loop iterates son node |
.descendants | Iterative descendant node type, comprising all descendant nodes, a loop through |
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
# 获取body标签下的所有节点并存入列表中
print(soup.body.contents)
print(type(soup.body.contents))
# <class 'list'>
# 遍历儿子节点
for child in soup.body.children:
print(child)
# 遍历子孙节点
for desc in soup.body.descendants:
print(desc)
Traversing up the tree 3. Label
Attributes | Explanation |
---|---|
.parent | Father node label |
.parents | Iterative ancestor node type tags for looping through ancestor node |
# 标签数的上行遍历
# 遍历a标签的所有父节点
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# title的父标签
print(soup.title.parent)
4. parallel tag tree traversal
Attributes | Explanation |
---|---|
.next_sibling | Returns the next node in parallel according to the procedure of HTML text label |
.previous_sibling | Returns the HTML text tag a parallel node according to the order |
.next_siblings | Iterative type, HTML text returned in order for all subsequent parallel node label |
.previous_siblings | Iterative type, Continued return all HTML text tags parallel nodes in accordance with the order |
# 遍历后续节点
for sibling in soup.a.next_siblings:
print(sibling)
# 遍历前续节点
for sibling in soup.a.previous_siblings:
print(sibling)
to sum up:
prettify 5.bs4 library () method
The text of HTML format the content or part of the label (each label will add back wrap)
import requests
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())
print(soup.a.prettify())
6. Find method
find_all(name, attrs, recursive, string, **kwargs)
:
Returns a list type, storage result of the lookup.
- name: string to retrieve the tag name
- attrs: search character string tag attribute values, attribute search can be marked
- recursive: Whether to retrieve all descendants, default True
- Retrieving character string <> ... </> string region: string
import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://python123.io/ws/demo.html")
soup = BeautifulSoup(r.text, "html.parser")
# 查找所有a标签
print(soup.find_all('a'))
print(type(soup.find_all('a')))
# <class 'bs4.element.ResultSet'>
for tag in soup.find_all('a'):
print(tag.string)
# 显示a 和 b 标签
print(soup.find_all(['a', 'b']))
# 显示soup的所有标签信息
for tag in soup.find_all(True):
print(tag.name)
# 使用正则表达式来查找含有b的标签
for tag in soup.find_all(re.compile('b')):
print(tag.name)
# 查找p标签含有course的内容
print(soup.find_all('p', 'course'))
# 查找id属性为link1的内容
print(soup.find_all(id='link1'))
# 查找id属性为link的内容 没有则返回[]
print(soup.find_all(id='link'))
# 使用re模块来查找id属性包含link的内容
print(soup.find_all(id=re.compile('link')))
# 设置recursive参数为False, 这时从soup的子节点进行检索, 而不会去检索子孙节点的内容
print(soup.find_all('a', recursive=False))
# 检索字符串是否存在
print(soup.find_all(string="Basic Python"))
# 检索字符串是否含有python, 通过re
print(soup.find_all(string=re.compile('Python')))
The Tip:
<Tag> (...) is equivalent to <Tag> .find_all (...)
Soup (...) is equivalent to soup.find_all (...)
Extension Methods
method | Explanation |
---|---|
<>.find() | Search and only returns a result of type string, with .find_all () parameters |
<>.find_parents() | Search ancestor node, returns a list of types, with .find_all () parameters |
<>.find_parent() | Ancestor node returns a result of type string, with .find () parameters |
<>.find_next_siblings() | In the subsequent search for a parallel node, returns a list of types, the same .find_all () parameters |
<>.find_next_sibling() | In the subsequent return to a parallel node a result, string type, with .find () parameters |
<>.find_previous_siblings() | Sequence search nodes in parallel in the front, returns a list of types, the same .find_all () parameters |
<>.find_previous_sibling() | Sequence parallel in the front node returns a result of type string, with .find () parameters |
Reptile Case