Six: Crawler-BeautifulSoup4 for data analysis

Six: Introduction to bs4

basic concept:

Simply put, Beautiful Soup is a library in python. Its main function is to crawl data from web pages. The official explanation is as follows:

'''
Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,
所以不需要多少代码就可以写出一个完整的应用程序。
'''

Beautiful Soup is a Python library that can extract data from HTML or XML files . It enables customary document navigation, search, and modification methods through your favorite converters. BeautifulSoup will save hours or even days of work. BeautifulSoup3 has currently stopped development. The official website recommends using BeautifulSoup4 in current projects.

bs4 installation

Since Bautiful Soup is a third-party library, it needs to be downloaded separately. The download method is very simple. You can install it by executing the following command:
pip install bs4 . Since BS4 needs to rely on the document parser
when parsing the page , lxml also needs to be installed as the parsing library, so we also need to install it. You need to install lxml. The installation method is as follows: pip install lxml Python also comes with a document parsing library html.parser, but its parsing speed is slightly slower than lxml. In addition to the above parsers, you can also use the html5lib parser. The installation method is as follows: pip install html5lib Note: bs4 depends on the lxml library. The bs4 library can only be installed by installing the lxml library first.



Document Parser Pros and Cons

The following table lists the main parsers, as well as their advantages and disadvantages:
image.png
It is recommended to use lxml as the parser because it is more efficient. In versions before Python 2.7.3 and versions before 3.2.2 in Python 3, lxml or html5lib must be installed because the built-in HTML parsing method in the standard library of those Python versions is not stable enough.
Tip: If an HTML or XML document is not in the correct format, the results returned by different parsers may be different. Therefore, we can choose the corresponding document parser according to the situation. Analyze specific situations in detail.

Use of bs4

quick start

Creating a BS4 parsing object is the first step. It is very simple. The syntax format is as follows:
1. Import the parsing package
from bs4 import BeautifulSoup
2. Create a beautifulsoup parsing object
soup = BeautifulSoup(html_doc, 'html.parser')
In the above code, html_doc represents the document to be parsed, and html.parser represents the parser used to parse the document. The parser here can also be 'lxml' or 'html5lib'

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 创建一个soup对象
soup = BeautifulSoup(html_doc,'lxml')
print(soup,type(soup))
# 格式化文档输出
print(soup.prettify())
# 获取title标签内容 <title>The Dormouse's story</title>
print(soup.title) 
# 获取title标签名称: title
print(soup.title.name) 
# title标签里面的文本内容: The Dormouse's story
print(soup.title.string)
# 获取p段落
print(soup.p)

Object type of bs4

  • tag: tag in html.

The specific content of Tag can be analyzed through BeautifulSoup. The specific format is soup.name, where name is the tag under html.

  • NavigableString: The text object in the label.
  • BeautifulSoup: The entire html text object.

Can be used as a Tag object.

  • Comment: A special NavigableString object, if there are comments in the html tag, the comment symbols can be filtered and the comment text can be retained.
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

'''
tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释
'''
soup = BeautifulSoup(html_doc, "html.parser")
# print(soup)
'''tag:标签'''
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

'''NavigableString : 可导航的字符串'''
from bs4.element import NavigableString
print(type(soup.title.string))

'''BeautifulSoup : bs对象'''
soup = BeautifulSoup(html_doc, "html.parser")
print(type(soup))

'''Comment : 注释'''
html = "<b><!--同学们好呀加油学习--></b>"
soup2 = BeautifulSoup(html, "html.parser")
print(soup2.b.string, type(soup2.b.string))

Traverse the document tree

Traverse child nodes
  • contents returns a list of all child nodes (understand)
  • children returns an iterator of child nodes (understand)
  • descendants returns a generator that traverses descendants (understand)
  • string Get the content in the tag (master)
  • strings returns a generator object used to obtain the contents of multiple tags (mastery)
  • stripped_strings is basically the same as strings but it can remove extra spaces (master)
Traverse parent nodes (understand)
  • parent directly obtains the parent node
  • parents gets all parent nodes
Traverse sibling nodes (understand)
  • next_sibling next sibling node
  • previous_sibling Previous sibling node
  • next_siblings next all sibling nodes
  • previous_siblings all previous sibling nodes
from bs4 import BeautifulSoup

html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""
'''
生成器 迭代器  可迭代对象 三者之间的关系 
'''
#  获取单个标签中的内容
soup = BeautifulSoup(html_doc, "lxml")
r1 = soup.title.string  # 获取标签里面的内容
print(r1)

# 获取html中所有的标签内容
r2 = soup.html.strings  # 返回是一个生成器对象用过来获取多个标签内容
print(r2)
for i in r2:
    print(i)

r3 = soup.html.stripped_strings  # 和strings基本一致 但是它可以把多余的空格去掉
print(r3)  # 生成器对象 <generator object Tag._all_strings at 0x000001A73C538AC8>
for i in r3:
    print(i)

Search document tree

find()
  • The find() method returns the first piece of data searched
find_all()
  • The find_all() method returns all the searched tag data in list form
Example application
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
  1. Get all tr ​​tags;
# 1 获取所有的tr标签
trs = soup.find_all("tr")  # 这是个列表过滤器
for tr in trs:
    print(tr)
    print("*" * 150)
  1. Get the second tr tag;
# 2 获取第二个tr标签
tr = soup.find_all("tr")[1]
print(tr)
  1. Get all tr ​​tags of class=even
trs = soup.find_all("tr", class_="even")  # 但这里如果直接用class不行 class是作为我们的关键字
# trs = soup.find_all("tr", attrs={"class": "even"})  这两种方式都可
for tr in trs:
    print(tr)
    print("*" * 150)
  1. Get the href attribute value in all a tags;
# 5 获取所有的a标签的href属性
a_li = soup.find_all("a")
for a in a_li:
    href = a.get("href")
    print(href)
  1. Get all job information.
trs = soup.find_all("tr")[1:]
for tr in trs:
    tds = tr.find_all("td")
    # print(tds)
    job_name = tds[0].string
    print(job_name)

select() method

We can also extract data through css selectors. But it should be noted that we need to master the css syntax https://www.w3school.com.cn/cssref/css_selectors.asp

from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html, "lxml")

# 获取所有的tr标签
# trs = soup.select("tr")
# for i in trs:
#     print(i)

# 获取第二个tr标签
# tr = soup.select("tr")[1]
# print(tr)

# 获取所有class等于even的tr标签
# trs = soup.select(".even")


# 获取所有的a标签的href属性
# a_tags = soup.select("a")
# print(a_tags)
# for a in a_tags:
#     href = a.get("href")
#     print(href)


# 获取所有的职位信息
trs = soup.select("tr")[1:]
print(trs)
for tr in trs:
    print(tr)
    print(list(tr.strings))
    info = list(tr.stripped_strings)[0]
    print(info)

Modify document tree

  • Modify tag name and attributes
  • Modifying the string attribute assignment is equivalent to replacing the original content with the current content.
  • append() adds content to a tag, just like the .append() method of Python's list
  • decompose() modifies and deletes paragraphs. We can delete some unnecessary article paragraphs.
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
"""
● 修改tag的名称和属性
● 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
● append() 像tag中添加内容,就好像Python的列表的 .append() 方法
● decompose() 修改删除段落,对于一些没有必要的文章段落我们可以给他删除掉
"""
# 修改tag的名称和属性
tag_p = soup.p
print(tag_p)
tag_p.name = "w"
tag_p["class"] = "content"
print(tag_p)


# 修改string  属性赋值,就相当于用当前的内容替代了原来的内容
tag_p = soup.p
print(tag_p.text)
tag_p.string = "you need python"
print(tag_p.text)

# append() 像tag中添加内容,就好像Python的列表的 .append() 方法
tag_p = soup.p
print(tag_p)
tag_p.append("真的C!")
print(tag_p)

# # decompose() 修改删除段落,对于一些没有必要的文章段落我们可以给他删除掉
r = soup.title
print(r)
r.decompose()
print(soup)

csv module

What is csv?

CSV (Comma Separated Values) , comma-separated values ​​(also called character-separated values, because the separator does not need to be a comma), is a commonly used text format used to store tabular data, including numbers or characters. Many programs will encounter files in the csv format when processing data. Python comes with the csv module, which is specially used to process the reading of csv files.

Use of csv module
Write to csv file

1 By creating a writer object, two methods are mainly used. One is writerow, which writes a line. The other is writerows to write multiple lines
2. Use DictWriter to write data into it using a dictionary.

Read csv file

1 Each piece of data read through reader() is a list. A specific value can be obtained through subscript
2. The data read through DictReader() is a dictionary. Data can be obtained through Key value (column name)

csv file operation application
"""csv写入文件"""
import csv

persons = [('岳岳', 20, 175), ('月月', 22, 178), ('张三', 20, 175)]
headers = ('name', 'age', 'heigth')
with open('persons.csv', mode='w', encoding='utf-8',newline="")as f:
    writer = csv.writer(f)  # 创建writer对象
    writer.writerow(headers)  # 将表头写入进去
    for i in persons:
        writer.writerow(i)  # 将列表中的值写入进去


# Dictwriter 写入字典数据格式
import csv

persons = [
    {
    
    'name': '岳岳', 'age': 18, 'gender': '男'},
    {
    
    'name': '岳岳2', 'age': 18, 'gender': '男'},
    {
    
    'name': '岳岳3', 'age': 18, 'gender': '男'}
]

headers = ('name', 'age', 'gender')
with open('person2.csv', mode='w', encoding='utf-8',newline="")as f:
    writer = csv.DictWriter(f, headers)
    writer.writeheader() # 写入表头
    writer.writerows(persons)

"""csv读取文件"""
# 方式一
import csv
with open('persons.csv',mode='r',encoding='utf-8',newline="")as f:
    reader = csv.reader(f)
    print(reader)   # <_csv.reader object at 0x0000021D7424D5F8>
    for i in reader:
        print(i)

# 方式二
import csv
with open('person2.csv', mode='r', encoding='utf-8',newline="")as f:
    reader = csv.DictReader(f)
    print(reader)  # <_csv.reader object at 0x0000021D7424D5F8>
    for i in reader:
        # print(i)
        for j, k in i.items():
            print(j, k)

bs4 example application

from bs4 import BeautifulSoup
import requests
import csv

"""
目标url = "http://www.weather.com.cn/textFC/hb.shtml"
需求: 爬取全国所有城市的温度(最低气温) 并保存到csv文件中 
保存格式:[{"city":"北京","temp":"5℃"},{"xxx":"xxx","xxx":"xxx"},.....]
涉及技术: request csv bs4

思路与页面分析:
1 获取网页源码并创建soup对象 
2 将拿到的数据进行解析拿到目标数据
    2.1 先找到整页的div class = 'conMidtab'标签
    2.2 接下来找到它下面的每一个省或者是直辖市的table标签
    2.3 对拿到的tables数据进行过滤 找到table标签下面所有的tr标签 需要注意,要把前2个tr标签过滤掉
    2.4 再找到tr标签里面所有的td标签(第0个就是城市 倒数第二个就是温度)
3 将获取的数据进行存储 
"""


# 定义一个函数用于获取网页源码并解析数据
def getscroce(every_url):
    # 目标url
    # url = "http://www.weather.com.cn/textFC/hb.shtml"
    # 请求头数据
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }
    response = requests.get(every_url, headers=headers)
    response.encoding = 'utf-8'
    # 获取到的网页源码
    html = response.text

    # 将获取的网页源代码进行解析
    # 1 创建一个soup对象
    soup = BeautifulSoup(html, 'html5lib')
    # print(soup)

    # 2 先找到整页的div class = 'conMidtab'标签
    conMidtab = soup.find('div', class_='conMidtab')
    # print(conMidtab)

    # 3接下来找到它下面的每一个省或者是直辖市的table标签
    tables = conMidtab.find_all('table')
    # print(tables)

    # 4对拿到的tables数据进行过滤 找到table标签下面所有的tr标签(需要注意,要把前2个tr标签过滤掉)

    # 定义一个列表 将字典数据进行存储 然后准备写入csv
    templist = []

    for table in tables:
        trs = table.find_all('tr')[2:]
        # print(trs)
        for index, tr in enumerate(trs):
            # print(index,tr)
            # 在找到tr标签里面所有的td标签(第0个就是城市 倒数第二个就是温度)
            tds = tr.find_all('td')
            # print(tds)
            # 获取城市存在的td标签
            city_td = tds[0]
            if index == 0:
                city_td = tds[1]
            # print(city_td)

            # 定义一个字典用于保存数据  城市和温度
            tempdict = {
    
    }

            # 获取城市文本数据
            city = list(city_td.stripped_strings)[0]
            # print(city)
            # 获取最低温度
            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            # print(temp)

            tempdict['city'] = city
            tempdict['temp'] = temp
            # 将字典数据添加到列表中
            templist.append(tempdict)
    # print(templist)  # 通过打印发现 {'city': '河北', 'temp': '20'} 这个根本不存在
    '''
        如果是直辖市你取第0个td标签没有问题,所有的数据也是正常的
        如果是省你不能取第0个td标签了(省的名字),取第一个td标签,但是所有的都取第一个td那么这样其它城市又不对了。因为其它的城市都是第0个td标签
        我们只需要做一个判断,什么时候取第0个td 什么时候取第一个td
    '''
    # 将获取的数据进行返回 用于下一步进行数据的存储
    return templist

# 定义一个函数用于保存解析到的数据
def writeData(alltemplist):
    header = ('city', 'temp')
    with open('weather.csv', mode='w', encoding='utf-8', newline='')as f:
        # 创建写入对象
        writer = csv.DictWriter(f, header)
        # 写入表头
        writer.writeheader()
        # 写入数据
        writer.writerows(alltemplist)

# 定义一个主函数 用来执行各个函数
def main():
    # 定义一个列表保存全国城市的温度
    alltemplist = []
    model_url = "http://www.weather.com.cn/textFC/{}.shtml"
    # 定义一个列表 用于保存八大地区的url

    urlkey_list = ["hb", "db", "hd", "hz", "hn", "xb", "xn", "gat"]
    for i in urlkey_list:
        every_url = model_url.format(i)
        print(every_url)
        # templist = getscroce()  # 舍去
        alltemplist += getscroce(every_url)
    # print(templist)
    # 将获取的数据进行传递 用于保存csv
    writeData(alltemplist)

    # enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。
    # for i,j in enumerate(range(10)):
    #     print(i,j)


if __name__ == '__main__':
    main()

Guess you like

Origin blog.csdn.net/qiao_yue/article/details/135051491