3.bs4 package learning (html source code analysis)

The previous blog 2. urllib library learning, anti-crawler web source code crawling introduced how to crawl the web source code, this blog introduced how to parse the crawled html files

Use Beautiful Soup to parse html files

1. First crawl to the Baidu homepage and save the source code locally

import pickle  # 保存html文件
import urllib.request
url = "http://www.baidu.com"
response = urllib.request.urlopen(url)
pickle.dump(response.read(),open('./baidu.html','wb'))

If one day, Baidu is no longer that Baidu, this blog can download the following html source code to learn the follow-up tutorial
2021.02.02.12: 57 crawl the baidu homepage source code: https://wwx.lanzoui.com/i5QOvl7peyb

2. Use html.parser to parse

from bs4 import BeautifulSoup
file = open("./baidu.html", "rb")
html = file.read()  # 读取文本
bs = BeautifulSoup(html, "html.parser")  # bs对象来解析,使用html.parser解析器

bs is a BeautifulSoup parsing object created, bs is a type of <class'bs4.BeautifulSoup'>, BeautifulSoup4 converts complex HTML documents into a complex tree structure, each node is a Python object, and all objects can be summarized into 4 types :

  • 1.Tag

  • 2.NavigableString

  • 3.BeautifulSoup

  • 4.Comment


1. tag: tag (including the content in the tag), only the first matching tag found
print(bs.title)
print(type(bs.title))

bs.title matches the first title is:百度一下,你就知道

type(bs.title)是:<class ‘bs4.element.Tag’>

Parse the object. Tag, get the tag and its content

2. NavigableString: the content of the tag
print(bs.title.string)
print(type(bs.title.string))  # NavigableString

bs.title matches the first title is:百度一下,你就知道

bs.title.string is: click on Baidu, you will know

type(bs.title.string)是:<class ‘bs4.element.NavigableString’>

Parse the object. Label. String is the middle content of the label

In addition to getting the tag and its content, getting the content of the tag, you can also get the attributes of the tag:
print(bs.a)
print(bs.a.attrs)
print(type(bs.a.attrs))  # <class 'dict'>

bs.a is: Baidu homepage

bs.a.attrs是:{‘class’: [‘toindex’], ‘href’: ‘/’}

type(bs.a.attrs)是:<class ‘dict’>

Parsing object.label.attrs is all the attributes of the label (attrs can print the class, href and name of a label), and returns a dictionary containing all attribute key-value pairs.

3. BeautifulSoup: Represents the entire document
print(bs)  # 打印的就是整个文档
print(type(bs))
print(bs.name)  # [document]
print(bs.a.name)

print(bs) prints the content of the entire html document

type(bs)是:<class ‘bs4.BeautifulSoup’>

bs.name is: [document]

bs.a.name is: a

Parsing object. Tag. Name is the type of tag, or parsing object. Name is the type of parsing object

5.Comment: is a special NavigableString, the output content does not contain comments

To see the Comment, please manually modify the Baidu homepage in baidu.html to

Insert picture description here

Note: Please keep crawling and save the code in the baidu.html part when executing again

print(bs.a)
print(bs.a.string)
print(type(bs.a.string))  # Comment

bs.a is:

bs.a.string is: Baidu homepage

type(bs.a.string)是:<class ‘bs4.element.Comment’>

It is found that the analytic object. tag. string will be automatically blocked as long as the content in the tag is annotated. And the parsing object. tag. string becomes the type of <class'bs4.element.Comment'>, a brief summary Comment is a special case of BeautifulSoup


Practical application

The parser organizes and manages the html source code into a tree-like data structure, and uses a traversal method to obtain all nodes

Node acquisition: traversal

print(bs.head.contents)
print(bs.head.contents[1])

bs.head.contents is all the first-level tags in the head tag

bs.head.contents[1] gets the second of all the first-level tags in the head tag

Analyze the object.head.contents got all the first-level tags in the head tag, and use the list to return

Document search: targeted (specify div)

1.find_all

String filtering, matching all tag content that exactly matches the string, and returning the result list, for example:

t_list = bs.find_all("a")
print(t_list)

Find all a tags and return a list of results

2. Regular expression search: use the search method to match content

Still the chestnut above, find all the content with a character

t_list = bs.find_all(re.compile("a")) 
printf(t_list)
t_list = bs.find_all(text=re.compile("\d"))  # 包含数字的字符串
for it in t_list:
    print(it)

re.compile("a") creates an anonymous mode object, looking for tags and content. The pattern object of the regular expression can also be assigned to text, text=re.compile("\d"), the search is for content containing numbers (not including the label itself)

3. Method: Pass in a function (method), search according to the requirements of the function

First define a function to return whether the label has the true value of the name attribute 1

def name_is_exists(tag):
    return tag.has_attr("name")

t_list = bs.find_all(name_is_exists)  # 查询所有有name属性的标签
    print(t_list)

bs.find_all(name_is_exists) finds all tags with the name attribute

4.kwargs parameters
t_list = bs.find_all(id="head")  # id是head的标签
print(it)
t_list = bs.find_all(class_=True)  # 有class属性值的标签
print(it)
t_list = bs.find_all(href="http://news.baidu.com")  # 查找href="http://news.baidu.com"
print(it)
5.text parameter: specific text

Find content with "hao123" text

t_list = bs.find_all(text="hao123")
# 参数可以是列表,结果是并集
t_list = bs.find_all(text=["hao123","地图","贴吧"])
for it in t_list:
    print(it)

Can match multiple text content at the same time

6.limit parameter: get the first limit matching elements
t_list = bs.find_all("a", limit=3)
for it in t_list:
    print(it)

The above example gets the first three a tags

7. CSS selector: nested in the page, fast positioning by id or class, parameters can be tag id and class-usage is similar to jQuery
    # 按标签查找,返回所有匹配的标签
    print(bs.select('title'))
    # 按类名查找,返回所有匹配的标签
    print(bs.select('.mnav'))
    # 按id查找,返回所有匹配的标签
    print(bs.select('#u1'))
    # 筛选某个标签的属性或者id等于什么,返回所有匹配的标签
    print(bs.select("a[class='s-set-hotsearch set-show']"))  # 筛选class是's-set-hotsearch set-show'的a标签
    # 通过层级结构查找子标签
    res = bs.select("head > link")# 查找head标签中的所有link子标签,不包含孙标签
    print(res)  
    print(len(res))
    for it in res:
        print(it)
    # 查找兄弟标签,特点是只能往后查找
    t_list = bs.select(".toindex ~ .pf")  # 先找到class是toindex的标签,然后向后查找class是pf的同级标签
    print(len(t_list))
    print(t_list[0].get_text())
    这个例子涉及到html源码中的如下内容
    """
		<a class="toindex" href="/">百度首页</a>
		<a href="javascript:;" name="tj_settingicon" class="pf">设置<i class="c-icon c-icon-triangle-down"></i></a>
		<a href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5" name="tj_login" class="lb" onclick="return false;">登录</a>
		<div class="bdpfmenu"></div>
    """

Task-driven development: organize the data crawled by Douban Top250 movies into tree structure data

# import bs4 #只需要使用bs4中的BeautifulSoup因此可以如下写法:
from bs4 import BeautifulSoup  # 网页解析,获取数据
import urllib.request, urllib.error  # 指定url,获取网页数据


def main():
    # 爬取的网页
    baseurl = "https://movie.douban.com/top250?start="
    # # 保存的路径
    savepath = ".\\豆瓣电影Top250.xls"  # 使用\\表示层级目录或者在整个字符串前加r“.\豆瓣电影Top250”
    savepath2Db = "movies.db"
    # # 1.爬取网页
    # print(askURL(baseurl))
    datalist = getData(baseurl)
    print(datalist)



# 爬取网页
def getData(baseurl):
    datalist = []
    for i in range(0, 10):  # 一页25条电影
        url = baseurl + str(i*25)
        html = askURL(url)  # 保存获取到的网页源码
        # print(html)
        # 2.解析数据(逐一)
        soup = BeautifulSoup(html, "html.parser")  # 使用html.parser解析器解析html文档形成树形结构数据
        # 解析...
    return datalist


# 得到执行url的网页信息
def askURL(url):
    # 头部信息 其中用户代理用于伪装浏览器访问网页
    head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/87.0.4280.88 Safari/537.36"}
    req = urllib.request.Request(url, headers=head)
    html = ""  # 获取到的网页源码
    try:
        response = urllib.request.urlopen(req)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):  # has attribute
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html


if __name__ == '__main__':
    main()

  1. Boolean ↩︎

Guess you like

Origin blog.csdn.net/qq_43808700/article/details/113588874