Data analysis study notes (regular analysis, bs4 analysis, xpath analysis)

Focus crawler: crawl the specified page content in the page.
-Encoding process:
-Specify url
-Initiate request
-Get response data
-Data analysis
-Persistent storage

Data analysis classification:

  1. Regular
  2. bs4
  3. xpath(***)

Summary of the principle of data analysis:
-The parsed partial text content will be stored between the tags or in the attributes corresponding to the tags
-1. Position the specified tags
-2. Extract the data values ​​stored in the tags or the attributes corresponding to the tags ( Analysis)

One, regular analysis

Review of commonly used regular expressions:
Insert picture description here

<div class="thumb">

<a href="/article/121721100" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12172/121721100/medium/DNXDX9TZ8SDU6OK2.jpg" alt="指引我有前进的方向">
</a>

</div>

ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'

Project requirement: Crawl the heat map of the designated page of the Encyclopedia of Embarrassment and save it in the designated folder

Insert picture description here
Insert picture description here

import requests
import re
import os

if __name__ == '__main__':
    # 创建一个文件夹,用来保存所有的图片
    if not os.path.exists('./qiutuLibs'):
        os.mkdir('./qiutuLibs')

    # 2.UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }


    # 设置一个通用的url模板
    url = 'https://www.qiushibaike.com/imgrank/page/%d/'
    for pageNum in range(2,3):
        # 对应页码的url
        new_url = format(url%pageNum)

        # 使用通用爬虫对url对应的一整张页面进行爬取
        page_text = requests.get(url=new_url,headers=headers).text

        # 使用聚焦爬虫将页面中所有的图片进行解析他/提取
        ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
        img_src_list = re.findall(ex,page_text,re.S)
        print(img_src_list)
        for src in img_src_list:
            # 拼接出一个完整的图片url
            src = 'https:' + src
            response = requests.get(url=src,headers=headers)
            # 请求到了图片的二进制数据
            img_data = response.content

            # 生成图片名称
            img_name = src.split('/')[-1]
            # 图片最终存储的路径
            imgPath = './qiutuLibs/' + img_name

            with open(imgPath,'wb') as fp:
                fp.write(img_data)
                print(img_name,'下载成功!!!')


Two, bs4 analysis

bs4进行数据解析
    - 数据解析的原理:
        - 1.标签定位
        - 2.提取标签、标签属性中存储的数据值
    - bs4数据解析的原理:
        - 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中
        - 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
    - 环境安装:
        - pip install bs4
        - pip install lxml

    - 如何实例化BeautifulSoup对象:
        - from bs4 import BeautifulSoup
        - 对象的实例化:
            - 1.将本地的html文档中的数据加载到该对象中
                    fp = open('./test.html','r',encoding='utf-8')
                    soup = BeautifulSoup(fp,'lxml')
            - 2.将互联网上获取的页面源码加载到该对象中
                    page_text = response.text
                    soup = BeatifulSoup(page_text,'lxml')
        - 提供的用于数据解析的方法和属性:
            - soup.tagName:返回的是文档中第一次出现的tagName对应的标签
            - soup.find():
                - find('tagName'):等同于soup.div
                - 属性定位:
                    -soup.find('div',class_/id/attr='song')
            - soup.find_all('tagName'):返回符合要求的所有标签(列表)
        - select:
            - select('某种选择器(id,class,标签...选择器)'),返回的是一个列表。
            - 层级选择器:
                - soup.select('.tang > ul > li > a')>表示的是一个层级
                - oup.select('.tang > ul a'):空格表示的多个层级
        - 获取标签之间的文本数据:
            - soup.a.text/string/get_text()
            - text/get_text():可以获取某一个标签中所有的文本内容
            - string:只可以获取该标签下面直系的文本内容
        - 获取标签中属性值:
            - soup.a['href']

Insert picture description here

from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 想要将本地的html文档的数据加载到对象中
    fp = open('./test.html','r',encoding='utf-8')
    soup = BeautifulSoup(fp,'lxml')
    # print(soup)
    # print(soup.a) # soup.tagName返回的是html中第一次出现的tagName标签

    # find('tagName') : 等同于soup.tagName
    # print(soup.find('div'))

    # 属性定位
    # print(soup.find('div',class_='song'))

    # print(soup.find_all('a')) # 返回符合要求的所有标签(返回一个列表)

    # print(soup.select('.tang')) # 返回一个列表
    # print(soup.select('.tang > ul > li > a')[0]) # 返回一个列表
    # print(soup.select('.tang > ul a')[0]) # 返回一个列表

    # 获取标签之间的文本数据
    # text/get_text():可以获取某一个标签中所有的文本内容
    print(soup.a.text)
    print(soup.a.get_text())
    # string:只可以获取该标签下面直系的文本内容
    print(soup.a.string)

    # 获取标签中属性值
    print(soup.a['href'])

Insert picture description here

Project requirements: Use bs4 to crawl the content of each chapter of the Romance of the Three Kingdoms novel in the poetry and famous sentence website to the local disk for storage

Insert picture description here
Insert picture description here

import requests
from bs4 import BeautifulSoup


if __name__ == '__main__':
    # 对首页的页面数据进行爬取
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'

    # 2.UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }

    page_text = requests.get(url=url,headers=headers).text.encode('ISO-8859-1')
    # 1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text,'lxml')

    # 2.解析章节标题和详情页的url
    li_list = soup.select('.book-mulu > ul > li')
    print(li_list)

    fp = open('./sanguo.txt','w',encoding='utf-8')

    for li in li_list:
        title = li.a.string
        detail_url = 'https://www.shicimingju.com/' + li.a['href']

        # 对详情页发起请求,解析出章节内容
        detail_page_text = requests.get(url=detail_url,headers=headers).text.encode('ISO-8859-1')
        # 解析成详情页中相关的章节内容
        detail_soup = BeautifulSoup(detail_page_text,'lxml')
        div_tag = detail_soup.find('div',class_='chapter_content')
        # 解析到了章节的内容
        content = div_tag.text
        # print(div_tag.text)

        fp.write(title+':'+content+'\n')
        print(title,'爬取成功!!!')

Insert picture description here

Three, xpath analysis

xpath解析:最常用且最便捷高效的一种解析方式。通用性最强。

    - xpath解析原理:
        - 1.实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。
        - 2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
    - 环境的安装:
        - pip install lxml
    - 如何实例化一个etree对象:from lxml import etree
        - 1.将本地的html文档中的源码数据加载到etree对象中:
            etree.parse(filePath)
        - 2.可以将从互联网上获取的源码数据加载到该对象中
            etree.HTML('page_text')
        - xpath('xpath表达式')
    - xpath表达式:
        - /:表示的是从根节点开始定位。表示的是一个层级。
        - //:表示的是多个层级。可以表示从任意位置开始定位。
        - 属性定位://div[@class='song'] tag[@attrName="attrValue"]
        - 索引定位://div[@class="song"]/p[3] 索引是从1开始的。
        - 取文本:
            - /text() 获取的是标签中直系的文本内容
            - //text() 标签中非直系的文本内容(所有的文本内容)
        - 取属性:
            /@attrName     ==>img/src

1. Project requirements: xpath analysis case- 4k image analysis and crawling

Insert picture description here

import requests
from lxml import etree
import os

if __name__ == '__main__':
    url = 'http://pic.netbian.com/4kmeishi/'
    # UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }
    response = requests.get(url=url,headers=headers)
    # 手动设定响应数据的编码格式
    # response.encoding = 'utf-8'
    page_text = response.text

    # 数据解析: src的属性值 alt的属性值
    tree = etree.HTML(page_text)

    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    print(li_list)

    #创建一个文件夹
    if not os.path.exists('./picLibs'):
        os.mkdir('./picLibs')

    for li in li_list:
        img_src = 'http://pic.netbian.com/' + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/img/@alt')[0] + '.jpg'
        # 通用的处理中文乱码的解决方案
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        # print(img_src + img_name)

        # 请求图片进行持久化存储
        img_data = requests.get(url=img_src,headers=headers).content
        img_path = 'picLibs/' + img_name
        with open(img_path,'wb') as fp:
            fp.write(img_data)
            print(img_name,'下载成功!!!')

Insert picture description here

2. Project requirements: xpath analysis case-national city name crawling

Insert picture description here

import requests
from lxml import etree

if __name__ == '__main__':
    url = 'https://www.aqistudy.cn/historydata/'
    # UA伪装:将对应的User-Agent封装到一个字典中
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    }
    page_text = requests.get(url=url,headers=headers).text

    tree = etree.HTML(page_text)
    # 解析到热门城市和所有城市对应的a标签
    a_list = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a | //div[@class="bottom"]/ul/li/a')
    all_city_names = []
    for a in a_list:
        city_name = a.xpath('./text()')[0]
        all_city_names.append(city_name)
    print(all_city_names,len(all_city_names))

Insert picture description here

3. Project requirements: Crawl free ppt of webmaster materials in batches and save them locally

Insert picture description here
Insert picture description here

import time
import requests
from lxml import etree
import os



if __name__ == '__main__':
    start = time.perf_counter()
    if not os.path.exists('./ppt'):
        os.mkdir('./ppt')

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    }

    url = 'https://sc.chinaz.com/ppt/free_1.html'
    response = requests.get(url=url,headers=headers)
    page_text = response.text


    tree = etree.HTML(page_text)

    num = 0
    urls = tree.xpath('//div[@id="vueWaterfall"]//a/@href')
    for url in urls:
        url = 'https://sc.chinaz.com/' + url
        page_text = requests.get(url=url,headers=headers).text
        tree = etree.HTML(page_text)
        download_url = tree.xpath('//div[@class="download-url"]/a[1]/@href')[0]
        print(download_url)
        response = requests.get(url=download_url,headers=headers).content

        with open('./ppt/' + download_url.split('/')[-1],'wb') as fp:
            fp.write(response)
        num = num + 1
        print('已经下载'+str(num)+'个模板!')

    print("爬取完成!")
    end = time.perf_counter()
    print('共耗时:', end - start, '秒')

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44827418/article/details/113825274