[python] Crawler notes (3) Regular analysis of data analysis

Focus crawler

Crawl the specified content in the page.
Encoding process:
Specify url-initiate a request-obtain response data- data analysis -persistent storage

Data analysis classification

  • Regular match
  • bs4
  • xpath

Principles of Data Analysis
Insert picture description here

import re
import requests
import os
#需求:爬取糗事百科中糗图板块下所有图片
#获取一整张页面,然后进行解析
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
url = "https://www.qiushibaike.com/imgrank/"
if __name__ == "__main__":
    #创建文件夹保存图片
    if not os.path.exists('./糗图'):#创建文件夹
        os.mkdir('./糗图')

    headers = {
    
    
        'User-Agent':ua
    }
    #使用通用爬虫进行爬取
    page_text = requests.get(url=url,headers=headers).text

    #使用聚焦爬虫提取图片链接
    #编写正则式
    ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    img_src_list = re.findall(ex, page_text, re.S)
    print(img_src_list)
    for scr in img_src_list:
        #拼接出一个完整的图片
        scr = 'https:' + scr
        img = requests.get(url=scr, headers=headers).content
        #生成图片名称
        name = scr.split('/')[-1] 
        imgPath = './糗图/' + name #文件最终路径
        with open(imgPath,'wb') as f:
            f.write(img)
            print(name,'下载成功!')

Guess you like

Origin blog.csdn.net/Sgmple/article/details/112059102