Python crawler series-start to enter the soil (2) data analysis

data analysis

Data analysis classification

Regular

Example: Use regular focus to crawl pictures

Crawled website: https://818ps.com/search/0-0-0-0-0-null-0_0_0_67-0-0-0-0.html

Insert picture description here
Key point: Analyze in which piece the picture address needs to be found, you can use .*? to omit the unnecessary part in the middle

ex = '<div class="min-img" has-ajax="0" style="width:216px;height:384px">.*?<img.*?src="(.*?)".*?</div>'
img_src_list = re.findall(ex, page_text, re.S)

Code:

import requests
import random
import re
import os

user_agent_list=[
            'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
            'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
            'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
            'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
            'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
            'Opera/8.0 (Windows NT 5.1; U; en)',
            'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
            'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
        ]

if __name__ == '__main__':
    # 创建一个文件夹，保存所有的图片
    if not os.path.exists('./img'):
        os.mkdir('./img')
    # 如何爬取图片数据
    url = 'https://818ps.com/search/0-0-0-0-0-null-0_0_0_67-0-0-0-0.html'
    header = {
    
     'User-Agent': random.choice(user_agent_list) }
    # 使用通用爬虫对url对应的一整张页面进行爬取
    page_text = requests.get(url=url, headers=header).text
    # 使用聚焦爬虫将页面中所有的图片进行解析/提取  正则, 获取的是()中的.*?
    ex = '<div class="min-img" has-ajax="0" style="width:216px;height:384px">.*?<img.*?src="(.*?)".*?</div>'
    img_src_list = re.findall(ex, page_text, re.S)
    # print(img_src_list)
    for index,src in enumerate(img_src_list):
#         拼接出完整路径
        if ("http" in src):
            continue
        else:
            src = 'http:' + src
        print(src)
        # 请求到了图片的二进制数据
        img_data = requests.get(url=src, headers=header).content
        with open('./img/' + str(index) + '.jpg', 'wb') as fp:
            fp.write(img_data)

effect:
Insert picture description here

bs4

bs4 realizes the principle of data analysis:

1. Instantiate a BeautifulSoup, and load the page source data into the object
2. Label positioning and data extraction by calling related attributes or methods in the BeautifulSoup object

Environmental installation:
pip install bs4
pip install lxml

How to instantiate the BeautifulSoup object:

from bs4 import BeautifulSoup
Instantiation of objects
- 1. Load the data in the local HTML document into the object

from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 将HTML文档中数据加载到BeautifulSoup对象中
    fp = open('./colorPicker.html', 'r', encoding='utf-8')
    soup = BeautifulSoup(fp, 'lxml')
    print(soup)

* 2.将互联网上获取的页面源码加载到该对象中

Attributes and methods for data analysis provided by bs4

tagName: Soup.tagName returns the tagName tag print(soup.li) that appears for the first time in HTML
find(): Soup.find('div') is equivalent to soup.div soup.find('div', class_='aa') to find the div whose class is aa
find_all(): Soup.find_all('div') returns a list
select: Soup.select('.aa') can use a certain selector attribute to find, or you can use hierarchy to find soup.select('.aa> ul> li> a') , and it returns a list
get_text(): Get the text data between tags, soup.a.text/string/get_text()
['属性']: Get the attribute value of the label, soup.a['href']

Practical exercise: Crawl all chapter titles and contents of Romance of the Three Kingdoms

Crawl page: https://www.shicimingju.com
Insert picture description here
page code:

from bs4 import BeautifulSoup
import requests
import random

if __name__ == '__main__':
    header = {
    
    'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url, headers=header)
    page_text.encoding = 'utf-8'
    # 在首页中解析出章节的标题和详情页url
    # 1.实例化BeautifulSoup对象，需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text.text, 'lxml')
    li_list = soup.select('.book-mulu > ul > li')
    fp = open('./sanguo.txt', 'w', encoding='utf-8')
    for li in  li_list:
        title = li.a.text
        detail_url = 'https://www.shicimingju.com' + li.a['href']
        # 对详情页发起请求，解析出章节内容
        detail_content_text = requests.get(url=detail_url, headers=header)
        detail_content_text.encoding = 'utf-8'
        # 解析出详情页中相关章节内容
        detail_soup = BeautifulSoup(detail_content_text.text, 'lxml')
        div_tag = detail_soup.find('div', class_='chapter_content')
        # 解析到了章节的内容
        content = div_tag.text
        # 持久化存储
        fp.write(title+':'+ content + '\n')
        print(title + '---爬取成功！')

effect:
Insert picture description here

xpath

Xpath parsing: The most commonly used and most convenient and efficient parsing method. Versatility.

Xpath analysis principle

1. Instantiate an etree object and load the parsed page source data into the object.
2. Call the xpath method in the etree object combined with the xpath expression to achieve tag positioning and content capture.

Environmental installation

pip install lxml

How to instantiate etree objects

1. Load the source code data in the local HTML document into the etree object: etree.parse()
2. The source code data obtained from the Internet can be loaded into the object: etree.HTML('page_text')

xpath('xpath expression')
example:

tree = etree.parse('test.html')
tree.xpath('/html/head/title')  //此时返回的是存储地址

xpath expression

/: Indicates that the positioning starts from the root node. Represents a level
//: Represents multiple levels. Can start positioning from any position
[@class="属性值"]: Attribute positioning //div[@class="aa"] //div[@class="aa"]/ul/li
[1]: Index positioning, the index starts from 1 //div[3]
text(): Get text** //div[3]/a/text() ** /text() Get the direct text content in the label //text() Get the non-direct text content in the label (all text content)
/@attrName img/@src：

xpath combat

Image analysis download

Image source: Biantu.com

Insert picture description here
Code:

import requests
from lxml import etree
import os

if __name__ == '__main__':
    # 创建一个文件夹，保存所有的图片
    if not os.path.exists('./img'):
        os.mkdir('./img')
    url = 'http://pic.netbian.com/4kmeinv/'
    header = {
    
    'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}
    page = requests.get(url=url, headers=header)
    page.encoding = 'gbk'
    page_text = page.text
    tree = etree.HTML(page_text)
    # 数据解析，src的数据值
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        imgsrc = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0]
        imgname = li.xpath('./a/img/@alt')[0] + '.jpg'
        # 通用处理中文乱码的解决方案
        # imgname.encode('iso-8859-1').decode('gbk')

        # 图片持久化存储
        img_data = requests.get(url=imgsrc, headers=header).content
        with open('./img/' + imgname, 'wb') as fp:
            fp.write(img_data)
            print(imgname, imgsrc, '存储成功')

effect:
Insert picture description here

Principle overview

The parsed partial text content will be stored between the tags or in the attributes corresponding to the tags
1. Position the designated label
2. Extract (analyze) the data value stored in the tag or the attribute corresponding to the tag