Python.BeautifulSoup4

About BeautifulSoup

About BeautifulSoup Beautiful Soup is a library of python, the most important function is to grab data from a web page. The official explanation is as follows:

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。

Official Documents

Installation BeautifulSoup library

pip install bs4

Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds:

  • Tag
  • Navigable
  • String
  • BeautifulSoupComment

Below each function using the Beautiful Soup parses own blog home page:

import requests
from bs4 import BeautifulSoup

blog_url = "https://www.cnblogs.com/youngleesin/"
header={
"User-Agent":"Mozilla/5.0 (Linux; U; An\
droid 8.1.0; zh-cn; BLA-AL00 Build/HUAW\
EIBLA-AL00) AppleWebKit/537.36 (KHTML, l\
ike Gecko) Version/4.0 Chrome/57.0.2987.13\
2 MQQBrowser/8.9 Mobile Safari/537.36"
}   

respone = requests.get(blog_url, headers = header)

blog_html = BeautifulSoup(respone.text, "lxml") # respone.text 表示被解析的html内容,lxml表示使用的解析器

Tag selector

Tag information was acquired method

# print(blog_html.prettify()) # prettify() 方法格式化显示输出,由于内容较多不展示输出信息
print("博客园的title是:", blog_html.title.string)
print("第一个a标签的信息:", blog_html.a)
print("第一个a标签的名字:", blog_html.a.name)
print("a标签父标签的名字:", blog_html.a.parent.name)
print("a标签父标签的父标签的名字:", blog_html.a.parent.parent.name)
print("title标签的子标签:",blog_html.title.contents)
print("第一个link标签的信息:", blog_html.link)
print("link标签的属性:", type(blog_html.link))
print("link标签的名字:", blog_html.link.name)
print("link标签的类型:", blog_html.link.attrs)
print("link标签的href属性是:", blog_html.link.attrs["href"])

operation result

博客园的title是: yonugleesin - 博客园
第一个a标签的信息: <a name="top"></a>
第一个a标签的名字: a
a标签父标签的名字: body
a标签父标签的父标签的名字: html
title标签的子标签: ['yonugleesin - 博客园']
第一个link标签的信息: <link href="/bundles/blog-common.css?v=KOZafwuaDasEedEenI5aTy8aXH0epbm6VUJ0v3vsT_Q1" rel="stylesheet" type="text/css"/>
link标签的属性: <class 'bs4.element.Tag'>
link标签的名字: link
link标签的类型: {'type': 'text/css', 'rel': ['stylesheet'], 'href': '/bundles/blog-common.css?v=KOZafwuaDasEedEenI5aTy8aXH0epbm6VUJ0v3vsT_Q1'}
link标签的href属性是: /bundles/blog-common.css?v=KOZafwuaDasEedEenI5aTy8aXH0epbm6VUJ0v3vsT_Q1

find_all () Method:

Used by find_all () method to find the label elements: find_all (name, attrs, recursive, text, ** kwargs), returns a list type, the result is stored lookup

• name: the name of the tag to retrieve a string
• attrs: Retrieving string tag attribute value, attribute search can be marked
• recursive: Whether to retrieve all descendants, default True
• text: <> ... </> string region the retrieval string

name tag name for the retrieval string

print("检索所有a标签并输出第一个:", blog_html.find_all("a")[0])
print("检索所有div标签和a标签的内容:", blog_html.find_all(["div", "a"])) # 由于内容较多不展示输出信息
print(len(blog_html.find_all(["div", "a"])[1]))
print(len(blog_html.find_all(["div", "a"])[2]))

operation result

检索所有a标签并输出第一个: <a name="top"></a>
12
7

The first three traverse a label, and get href link

for data in blog_html.find_all("a")[0:3]:
    print(data.get("href"))

operation result

None
https://www.cnblogs.com/youngleesin/
https://www.cnblogs.com/youngleesin/

attrs string for retrieval tag attributes

print("class属性为headermaintitle的a标签:",blog_html.find_all("a",class_="headermaintitle")) # class是python的关键字,所以要加个下划线
print("name属性为top的a标签:",blog_html.find_all("a",attrs = {"name":"top"}))

operation result

class属性为headermaintitle的a标签: [<a class="headermaintitle" href="https://www.cnblogs.com/youngleesin/" id="Header1_HeaderTitle">Young_Leesin</a>]
name属性为top的a标签: [<a name="top"></a>]

Traversing the class attribute menu labels get href link

for data in blog_html.find_all(class_="menu"):
    print(data.get("href"))

operation result

https://www.cnblogs.com/
https://www.cnblogs.com/youngleesin/
https://i.cnblogs.com/EditPosts.aspx?opt=1
https://msg.cnblogs.com/send/yonugleesin
https://www.cnblogs.com/youngleesin/rss
https://i.cnblogs.com/

text selected according to the text

print(blog_html.find_all(text = "博客园")) # 适合用来断言

operation result

['博客园']

find () method

find (name, attrs, recursive, text, ** kwargs)
difference is find_all find and return to find a single element, all the elements return findAll

print("针对标签属性检索字符串")
print(blog_html.find_all(class_="clear"))
print("---------------------------")
print(blog_html.find(class_="clear"), "\n", "\n")

print("针对文本内容检索字符串")
print(blog_html.find_all(text = "编辑"))
print("---------------------------")
print(blog_html.find(text = "编辑"))
针对标签属性检索字符串
[<div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>, <div class="clear"></div>]
---------------------------
<div class="clear"></div> 
 

针对文本内容检索字符串
['编辑', '编辑', '编辑', '编辑', '编辑', '编辑', '编辑', '编辑', '编辑', '编辑']
---------------------------
编辑

CSS selectors

By select () directly into the CSS selector to complete the selection
CSS selectors reference material
. Is by matching the class, # by matching the id, other methods can refer to the above hyperlinks

print(blog_html.select(".headermaintitle"))
print(blog_html.select("#navigator #navList #blog_nav_sitehome"))

operation result

[<a class="headermaintitle" href="https://www.cnblogs.com/youngleesin/" id="Header1_HeaderTitle">Young_Leesin</a>]
[<a class="menu" href="https://www.cnblogs.com/" id="blog_nav_sitehome">博客园</a>]

Access to content and properties

print("第2个a标签的属性是:",blog_html.select("a")[1])
print("第1个li标签内容是:",blog_html.select("li")[0].get_text())

operation result

第2个a标签的属性是: <a href="https://www.cnblogs.com/youngleesin/" id="lnkBlogLogo"><img alt="返回主页" id="blogLogo" src="/Skins/custom/images/logo.gif"/></a>
第1个li标签内容是: 博客园

Traversing the contents of all li tags

for Li in blog_html.select("li"):
    print(Li.get_text())

operation result

博客园
首页
新随笔
联系
订阅

管理

Traversing all a label id attribute

i=0
for a in blog_html.select("a")[0:5]:
    i=i+1
    try:
        print("我是第",i,"个a标签,我的id属性是:",a["id"])
    except:
        print("我是第",i,"个a标签,我的id属性是:","sorry,本标签无id属性")

operation result

我是第 1 个a标签,我的id属性是: sorry,本标签无id属性
我是第 2 个a标签,我的id属性是: lnkBlogLogo
我是第 3 个a标签,我的id属性是: Header1_HeaderTitle
我是第 4 个a标签,我的id属性是: blog_nav_sitehome
我是第 5 个a标签,我的id属性是: blog_nav_myhome

Guess you like

Origin www.cnblogs.com/youngleesin/p/11298639.html