网络爬虫之网页数据解析(bs4)

定义

  • 和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据
  • Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种

实例数据

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>soup测试</title>
    <title class="warm">你那温情的一笑,搞得我瑟瑟发抖</title>
</head>
<body>
<div class="tang">
    <ul>
        <li class="hello" id="world"><a href="http://www.baidu.com" title="出塞"><!--秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山--></a></li>
        <list><a href="https://www.baidu.com" title="出塞" style="font-weight: bold"><!--秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山--></a></list>
        <li><a href="http://www.163.com" class="taohua" title="huahua">人面不知何处去,桃花依旧笑春风</a></li>
        <lists class="hello"><a href="http://mi.com" id="hong" title="huahua">去年今日此门中,人面桃花相映红</a></lists>
        <li id="wo"><a href="http://qq.com" name="he" id="gu">故人西辞黄鹤楼,烟花三月下扬州</a></li>
    </ul>
    <ul>
        <li class="hello" id="sf"><a href="http://www.baidu.com" title="出塞"><!--秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山--></a></li>
        <list><a href="https://www.baidu.com" title="出塞"><!--秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山--></a></list>
        <li><a href="http://www.163.com" class="taohua">人面不知何处去,桃花依旧笑春风</a></li>
        <lists class="hello"><a href="http://mi.com" id="fhsf">去年今日此门中,人面桃花相映红,不知桃花何处去,出门依旧笑楚风</a></lists>
        <li id="fs"><a href="http://qq.com" name="he" id="gufds">故人西辞黄鹤楼,烟花三月下扬州</a></li>
    </ul>
</div>
<div id="meng">
    <p class="jiang">
        <span>三国猛将</span>
    <ol>
        <li>关羽</li>
        <li>张飞</li>
        <li>赵云</li>
        <li>马超</li>
        <li>黄忠</li>
    </ol>
    <div class="cao">
        <ul>
            <li>典韦</li>
            <li>许褚</li>
            <li>张辽</li>
            <li>张郃</li>
            <li>于禁</li>
            <li>夏侯惇</li>
        </ul>
    </div>
    </p>
</div>
</body>
</html>

遍历文档树

print(len(soup.body.contents),soup.body.contents)
print(soup.body.children)
print(soup.body.descendants)
for haha in soup.body.descendants:
	print(haha)
  • 直接子节点 :.contents .children 属性
    • 字符串也属于一个节点,例如’\n’等
  • 所有子孙节点: .descendants 属性

搜索文档树

  • find

    print(soup.find('a'))
    
    print(soup.find('a', title='出塞'))
    print(soup.find('a', id='hong'))
    print(soup.find('a', class_='taohua'))
    print(soup.find('a', href='http://www.163.com'))
    
    • keyword参数

      find(返回的是一个对象)
      find('a')   只找到第一个a标签
      find('a', title='名字')
      soup.find(name = 'div',attrs={'class':'tang'})
      
      div = soup.find(name = 'div',class_ = 'tang')
      【注】不能使用name属性查找
      
  • find_all

    import re
    
    
    print(soup.find_all('a', title= re.compile(r'.*')))
    print(soup.find_all('a', id=['hong','gu']))
    
    print(soup.find_all('a'))
    print(soup.find_all(['a', 'span']))
    print(soup.find_all('a', limit=2))
    

CSS选择器

  • 这就是另一种与 find_all 方法有异曲同工之妙的查找方法

    • 写 CSS 时,标签名不加任何修饰,类名前加.,id名前加#
    • 在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list
  • 查找方式

    • 数据

      html = """
      <html><head><title>The Dormouse's story</title></head>
      <body>
      <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
      <p class="story">Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.</p>
      <p class="story">...</p>
      """
      
    • 通过标签查找

      # 通过标签查找
      print(soup.select('title'))
      print(soup.select('a'))
      print(soup.select('b'))
      
    • 通过类名查找

      # 通过类名查找
      print(soup.select('.sister'))
      
    • 通过id名查找

      # 通过 id 名查找
      print(soup.select('#link1'))
      
    • 组合查找

      # 组合查找,中间无空格
      print(soup.select('p#link2'))
      
      # 直接子标签查找,则使用 > 分隔 中间有空格
      print(soup.select('head > title'))
      
      # 使用逗号,间隔,表示查询多个
      print(soup.select('div > ul > li,list'))
      
    • 属性查找

      # 查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
      
      # 属性查找
      print(soup.select('a[class="sister"]'))
      
      print(soup.select('a[href="http://example.com/elsie"]'))
      
      
      # 同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格
      
      print(soup.select('p a[href="http://example.com/elsie"]'))
      
      # 不支持contains
      
    • 获取内容get_text()

      # 以上的 select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 get_text() 方法来获取它的内容。
      
      print(type(soup.select('title')))
      print(soup.select('title')[0].get_text())
      
      for title in soup.select('title'):
          print(title.get_text())
      

bs4实例

import requests
from bs4 import BeautifulSoup
url = 'http://quanben.kanshu.com/fulllist_0_C_0_B_4.html'
if __name__ == '__main__':
    headers = {
    
    
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'Cache-Control': 'max-age=0',
        'Connection': 'keep-alive',
        'Cookie': '_ga=GA1.2.557027477.1504366471; _vwo_uuid_v2=938EB395FEDF0DA9E9D90650DE9042F0|e0a43720af285a1e9923074545b60ff1; gr_user_id=d044fe6a-3a16-4f08-ad14-cc71033b6767; __utmv=30149280.16469; douban-fav-remind=1; viewed="21477429_24703171_24746415_10769749_3288908_5377669_2201479_27055214_30203973_1088812"; ll="118092"; __gads=ID=625a2b745eb9793a:T=1570537132:S=ALNI_Mar0mSJZgli_ELNG9BUkQWU1c5hBw; bid=BkKg_X9bqAU; __utmz=30149280.1571381145.41.36.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; ct=y; _ga=GA1.3.557027477.1504366471; _gid=GA1.3.1821035700.1572355629; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03=120a8d27-458a-4f19-8c27-a681dcc1211d; gr_cs1_120a8d27-458a-4f19-8c27-a681dcc1211d=user_id%3A0; ap_v=0,6.0; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03_120a8d27-458a-4f19-8c27-a681dcc1211d=true; __utma=30149280.557027477.1504366471.1572355583.1572403845.43; __utmc=30149280; __utmt_douban=1; __utmb=30149280.3.10.1572403845; _pk_ref.100001.a7dd=%5B%22%22%2C%22%22%2C1572403869%2C%22https%3A%2F%2Fbook.douban.com%2F%22%5D; _pk_ses.100001.a7dd=*; _pk_id.100001.a7dd=b87d4fbbc5d10d15.1572355630.2.1572403874.1572355651.',
        'Host': 'read.douban.com',
        'Referer': 'https://read.douban.com/ebooks/?dcs=book-nav&dcm=douban',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-User': '?1',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }
    response = requests.get(url=url)
    html = response.text
    soup = BeautifulSoup(html,'lxml')
    books = soup.select('ul[id="qb_ul"] > li')
    print(books)
    print(len(books))
    fp = open('./books.txt',mode = 'a',encoding='utf-8')
    fp.write('%s,%s,%s,%s,%s\n'%('排名','书名','分类','作者','出版时间'))
    for book in books[1:]:
        rank = book.find('span',class_ = 'sp_01').get_text()
        book_name = book.select('span[class="sp_02"] > a')[0].string
        category = book.find('span', class_="sp_03").find('a').string
        author = book.select('span[class="sp_04"] a')[0].get_text()
        time = book.select('span[class="sp_08"]')[0].get_text()
        print(rank,book_name,category,author,time)
        fp.write('%s,%s,%s,%s,%s\n'%(rank,book_name,category,author,time))

    fp.close()

猜你喜欢

转载自blog.csdn.net/qq_42546127/article/details/106401802