网络爬虫之网页数据解析（bs4）

文章目录

定义

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种

实例数据

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>soup测试</title>
    <title class="warm">你那温情的一笑，搞得我瑟瑟发抖</title>
</head>
<body>
<div class="tang">
    <ul>
        <li class="hello" id="world"><a href="http://www.baidu.com" title="出塞"><!--秦时明月汉时关，万里长征人未还，但使龙城飞将在，不教胡马度阴山--></a></li>
        <list><a href="https://www.baidu.com" title="出塞" style="font-weight: bold"><!--秦时明月汉时关，万里长征人未还，但使龙城飞将在，不教胡马度阴山--></a></list>
        <li><a href="http://www.163.com" class="taohua" title="huahua">人面不知何处去，桃花依旧笑春风</a></li>
        <lists class="hello"><a href="http://mi.com" id="hong" title="huahua">去年今日此门中，人面桃花相映红</a></lists>
        <li id="wo"><a href="http://qq.com" name="he" id="gu">故人西辞黄鹤楼，烟花三月下扬州</a></li>
    </ul>
    <ul>
        <li class="hello" id="sf"><a href="http://www.baidu.com" title="出塞"><!--秦时明月汉时关，万里长征人未还，但使龙城飞将在，不教胡马度阴山--></a></li>
        <list><a href="https://www.baidu.com" title="出塞"><!--秦时明月汉时关，万里长征人未还，但使龙城飞将在，不教胡马度阴山--></a></list>
        <li><a href="http://www.163.com" class="taohua">人面不知何处去，桃花依旧笑春风</a></li>
        <lists class="hello"><a href="http://mi.com" id="fhsf">去年今日此门中，人面桃花相映红，不知桃花何处去，出门依旧笑楚风</a></lists>
        <li id="fs"><a href="http://qq.com" name="he" id="gufds">故人西辞黄鹤楼，烟花三月下扬州</a></li>
    </ul>
</div>
<div id="meng">
    <p class="jiang">
        <span>三国猛将</span>
    <ol>
        <li>关羽</li>
        <li>张飞</li>
        <li>赵云</li>
        <li>马超</li>
        <li>黄忠</li>
    </ol>
    <div class="cao">
        <ul>
            <li>典韦</li>
            <li>许褚</li>
            <li>张辽</li>
            <li>张郃</li>
            <li>于禁</li>
            <li>夏侯惇</li>
        </ul>
    </div>
    </p>
</div>
</body>
</html>

遍历文档树

print(len(soup.body.contents),soup.body.contents)
print(soup.body.children)
print(soup.body.descendants)
for haha in soup.body.descendants:
	print(haha)

直接子节点：.contents .children 属性
- 字符串也属于一个节点，例如’\n’等
所有子孙节点: .descendants 属性

搜索文档树

find

print(soup.find('a'))

print(soup.find('a', title='出塞'))
print(soup.find('a', id='hong'))
print(soup.find('a', class_='taohua'))
print(soup.find('a', href='http://www.163.com'))

keyword参数

find(返回的是一个对象)
find('a')   只找到第一个a标签
find('a', title='名字')
soup.find(name = 'div',attrs={'class':'tang'})

div = soup.find(name = 'div',class_ = 'tang')
【注】不能使用name属性查找

find_all

import re


print(soup.find_all('a', title= re.compile(r'.*')))
print(soup.find_all('a', id=['hong','gu']))

print(soup.find_all('a'))
print(soup.find_all(['a', 'span']))
print(soup.find_all('a', limit=2))

CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法
- 写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
- 在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

查找方式

数据

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

通过标签查找

# 通过标签查找
print(soup.select('title'))
print(soup.select('a'))
print(soup.select('b'))

通过类名查找

# 通过类名查找
print(soup.select('.sister'))

通过id名查找

# 通过 id 名查找
print(soup.select('#link1'))

组合查找

# 组合查找，中间无空格
print(soup.select('p#link2'))

# 直接子标签查找，则使用 > 分隔 中间有空格
print(soup.select('head > title'))

# 使用逗号，间隔，表示查询多个
print(soup.select('div > ul > li,list'))

属性查找

# 查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

# 属性查找
print(soup.select('a[class="sister"]'))

print(soup.select('a[href="http://example.com/elsie"]'))


# 同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print(soup.select('p a[href="http://example.com/elsie"]'))

# 不支持contains

获取内容get_text()

# 以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

print(type(soup.select('title')))
print(soup.select('title')[0].get_text())

for title in soup.select('title'):
    print(title.get_text())

bs4实例

import requests
from bs4 import BeautifulSoup
url = 'http://quanben.kanshu.com/fulllist_0_C_0_B_4.html'
if __name__ == '__main__':
    headers = {
    
    
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Cookie': '_ga=GA1.2.557027477.1504366471; _vwo_uuid_v2=938EB395FEDF0DA9E9D90650DE9042F0|e0a43720af285a1e9923074545b60ff1; gr_user_id=d044fe6a-3a16-4f08-ad14-cc71033b6767; __utmv=30149280.16469; douban-fav-remind=1; viewed="21477429_24703171_24746415_10769749_3288908_5377669_2201479_27055214_30203973_1088812"; ll="118092"; __gads=ID=625a2b745eb9793a:T=1570537132:S=ALNI_Mar0mSJZgli_ELNG9BUkQWU1c5hBw; bid=BkKg_X9bqAU; __utmz=30149280.1571381145.41.36.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ct=y; _ga=GA1.3.557027477.1504366471; _gid=GA1.3.1821035700.1572355629; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03=120a8d27-458a-4f19-8c27-a681dcc1211d; gr_cs1_120a8d27-458a-4f19-8c27-a681dcc1211d=user_id%3A0; ap_v=0,6.0; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03_120a8d27-458a-4f19-8c27-a681dcc1211d=true; __utma=30149280.557027477.1504366471.1572355583.1572403845.43; __utmc=30149280; __utmt_douban=1; __utmb=30149280.3.10.1572403845; _pk_ref.100001.a7dd=%5B%22%22%2C%22%22%2C1572403869%2C%22https%3A%2F%2Fbook.douban.com%2F%22%5D; _pk_ses.100001.a7dd=*; _pk_id.100001.a7dd=b87d4fbbc5d10d15.1572355630.2.1572403874.1572355651.',
        'Host': 'read.douban.com',
        'Referer': 'https://read.douban.com/ebooks/?dcs=book-nav&dcm=douban',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }
    response = requests.get(url=url)
    html = response.text
    soup = BeautifulSoup(html,'lxml')
    books = soup.select('ul[id="qb_ul"] > li')
    print(books)
    print(len(books))
    fp = open('./books.txt',mode = 'a',encoding='utf-8')
    fp.write('%s,%s,%s,%s,%s\n'%('排名','书名','分类','作者','出版时间'))
    for book in books[1:]:
        rank = book.find('span',class_ = 'sp_01').get_text()
        book_name = book.select('span[class="sp_02"] > a')[0].string
        category = book.find('span', class_="sp_03").find('a').string
        author = book.select('span[class="sp_04"] a')[0].get_text()
        time = book.select('span[class="sp_08"]')[0].get_text()
        print(rank,book_name,category,author,time)
        fp.write('%s,%s,%s,%s,%s\n'%(rank,book_name,category,author,time))

    fp.close()

网络爬虫之网页数据解析（bs4）

文章目录

定义

实例数据

遍历文档树

搜索文档树

CSS选择器

bs4实例

猜你喜欢