Python crawler combat - fast technology · technology news section (regular matching, web page decoding, dynamic loading)

foreword

        I found an interesting website while crawling news

url = ‘https://news.mydrivers.com/’

The first thing about this website is that the data to be crawled is dynamically loaded (swipe down, automatic loading), and the second is that the interface I found is also transcoded and encrypted (but it is not difficult to deal with)

dependent package

import openpyxl
import requests
import re
from tqdm import tqdm

 tqdm is a package for generating progress bars

crawl preparation

1. Look at the crawler protocol of the webpage, and you can see it by adding /robots.txt after the root URL

Look at the instructions, as long as you don’t crawl the content of the live broadcast, you’ll be fine.

2. Observe the return of web page data (F12)

 

 Here I found a returned file (the others are all returned pictures), the data should be hidden here in this interface, right now? The following URLs are deleted, and only the URLs before the page are kept.

 You can see that there is something in the link access, but it has been transcoded and encrypted, copy it to pycharm to have a look

 That is to encode and encrypt the source code of the web page, which is also simple to handle

# 网页套了一层壳,解码网页
req = requests.get(url).text.encode('utf8').decode("unicode_escape")

Add after the get request

.encode('utf8').decode("unicode_escape")

can solve this problem

 I checked the website again and found that

 

 page represents the number of pages, and this ac represents these categories. Basically, we can get all the data of the website here.

data crawling

1. Preliminary preparation

header = {
    'User-Agent': '',
    'cookie': ''
}

# 要获取的分类
cate = ['最新', '热文', '热评', '一图', '好文', '数据', '人物', '手机', '动态', '电脑', '汽车', '影音', '软件',
        '游戏', '科学', 'IT圈', 'CPU', '显卡', '硬盘', '显示器', '内存', '安卓', 'iPhone', 'Windows', '微信', '华为',
        '小米', '苹果', '英特尔', '英伟达', 'AMD', '特斯拉', '索尼', 'OPPO', 'VIvo', '荣耀', '京东', '美团',
        '字节跳动', '阿里巴巴', '支付宝', '腾讯', '微软', '百度', '谷歌', '三星', '比亚迪', '蔚来', '理想', '小鹏',
        '埃安']

 Set up the header file, cate is the category I want (I don’t want to write a crawler to crawl these texts, so I type it by hand), put it in the list

2. Crawl the link of the article

In this way, it is matched with re, and then the matched data is cleaned, so there are so many empty spaces in the list, which we will deal with later

Code for crawling links:

def mydrivwes_url(cate):
    x = 1
    for a in tqdm(range(1, 52)):
        for b in range(1, 91):
            try:
                # 拼接链接
                url = 'https://blog.mydrivers.com/news/getdatelist20200820.aspx?ac=' + str(
                    a) + '&timeks=&timeend=&page=' + str(b)
                # 网页套了一层壳,解码网页
                req = requests.get(url).text.encode('utf8').decode("unicode_escape")
                # 获取新闻链接
                r = re.findall(r'<a href="(.*?)"', req, re.S)
                urll = []
                # 清洗正则匹配到的链接
                for i in r:
                    urll.append(re.sub(r'javascript.*', '', i))
                # 解析和保存数据函数
                x = mydrivwes_data(urll, a, cate, x)  # 调用函数(获取到的新闻链接,分类对应的位置,分类,数据写入到Excel第几行上)
            except Exception as arr:
                print(arr)
                continue

 3. Access the obtained link, get the data we need, and then save it in Excel

 Here you need to set the encoding, you can’t take the text directly, otherwise the Chinese will be garbled

Then it is to perform regular matching on the obtained web page source code, simple cleaning, etc., and finally write it into Excel

Some errors will be reported when accessing the link, but I have also done exception handling (so much data and a few pages less data is not a big problem owo), if you can’t climb it, forget it

Parse and write Excel code:

def mydrivwes_data(url, a, cate, x):
    # 打开一个已经有的Excel文件
    f = openpyxl.load_workbook(r'D:\工作文件\科技新闻.xlsx')
    sheet = f.active
    for j in set(url):
        data = []
        dody = []
        if j != '':
            # 爬取数据,设置解析网页的编码
            req = requests.get(j, headers=header)
            req.encoding = 'utf8'
            # 获取数据(正则匹配)
            data.append([cate[a - 1]])  # 分类
            data.append(list(set(re.findall(r'<i>#</i>(.*?)</a>', req.text, re.S))))  # 标签
            data.append(re.findall(r'<div class="news_bt" id="thread_subject">(.*?)</div>', req.text, re.S))  # 标题
            # 获取正文数据,对正文进行一下处理
            for k in re.findall(r'<p>(.*?)</p>', re.findall(r'<div class="news_info">(.*?)</div>', req.text, re.S)[0],
                                re.S):
                dody.append(re.sub('&.*?;', '', re.sub(r'<.*?>', '', k)))  # 简单清洗一下拿到的数据
            data.append(dody)  # 正文
            data.append([j])  # 爬取的链接
            # 把数据写入Excel
            for y in range(len(data)):
                sheet.cell(x, y+1).value = '\n'.join(data[y])
            x += 1
    # 保存数据
    f.save(r'D:\工作文件\科技新闻.xlsx')
    return x  # 返回写入的行数

Summarize

        It is still very interesting to climb this website, after all, I have encountered many situations that I have not encountered before. However, there are also some problems with this code. The first is that the obtained links are not deduplicated, and some news data will be repeatedly crawled (this can be processed in Excel later), and the second is the obtained data. The amount is too much, and I didn’t write multi-thread (actually, I haven’t written qwq yet). I used a single thread to crawl, and it took two days to crawl (after all, it is 51*90=4590 pages of data).

full code

# -- coding: utf-8 --
# @Time : 2023/3/1 16:08
# @File : mydrivers_spider.py
# @Software: PyCharm


import time
import json
import openpyxl
import requests
import re
from lxml import etree
from tqdm import tqdm

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
    'cookie': 'Hm_lvt_fa993fdd33f32c39cbb6e7d66096c422=1677656055; _ga=GA1.2.85175217.1677656056; _gid=GA1.2.1459179838.1677656056; __gads=ID=a49feffc530281e8-221e4d5f46da00d5:T=1677656343:RT=1677656343:S=ALNI_MZgiorrxc7wFkMAC5KPy1igUUYhkg; __gpi=UID=00000bcf6731797b:T=1677656343:RT=1677656343:S=ALNI_MaFn13jlgmEIsjJSwUPOvVtsbQXeg; CSRFToken=858ff98f521e41d3ae128e65f0574db6; Hm_lpvt_fa993fdd33f32c39cbb6e7d66096c422=1677662913'
}

# 要获取的分类
cate = ['最新', '热文', '热评', '一图', '好文', '数据', '人物', '手机', '动态', '电脑', '汽车', '影音', '软件',
        '游戏', '科学', 'IT圈', 'CPU', '显卡', '硬盘', '显示器', '内存', '安卓', 'iPhone', 'Windows', '微信', '华为',
        '小米', '苹果', '英特尔', '英伟达', 'AMD', '特斯拉', '索尼', 'OPPO', 'VIvo', '荣耀', '京东', '美团',
        '字节跳动', '阿里巴巴', '支付宝', '腾讯', '微软', '百度', '谷歌', '三星', '比亚迪', '蔚来', '理想', '小鹏',
        '埃安']


def mydrivwes_url(cate):
    x = 1
    for a in tqdm(range(1, 52)):
        for b in range(1, 91):
            try:
                # 拼接链接
                url = 'https://blog.mydrivers.com/news/getdatelist20200820.aspx?ac=' + str(
                    a) + '&timeks=&timeend=&page=' + str(b)
                # 网页套了一层壳,解码网页
                req = requests.get(url).text.encode('utf8').decode("unicode_escape")
                # 获取新闻链接
                r = re.findall(r'<a href="(.*?)"', req, re.S)
                urll = []
                # 清洗正则匹配到的链接
                for i in r:
                    urll.append(re.sub(r'javascript.*', '', i))
                # 解析和保存数据函数
                x = mydrivwes_data(urll, a, cate, x)  # 调用函数(获取到的新闻链接,分类对应的位置,分类,数据写入到Excel第几行上)
            except Exception as arr:
                print(arr)
                continue


def mydrivwes_data(url, a, cate, x):
    # 打开一个已经有的Excel文件
    f = openpyxl.load_workbook(r'D:\工作文件\科技新闻.xlsx')
    sheet = f.active
    for j in set(url):
        data = []
        dody = []
        if j != '':
            # 爬取数据,设置解析网页的编码
            req = requests.get(j, headers=header)
            req.encoding = 'utf8'
            # 获取数据(正则匹配)
            data.append([cate[a - 1]])  # 分类
            data.append(list(set(re.findall(r'<i>#</i>(.*?)</a>', req.text, re.S))))  # 标签
            data.append(re.findall(r'<div class="news_bt" id="thread_subject">(.*?)</div>', req.text, re.S))  # 标题
            # 获取正文数据,对正文进行一下处理
            for k in re.findall(r'<p>(.*?)</p>', re.findall(r'<div class="news_info">(.*?)</div>', req.text, re.S)[0],
                                re.S):
                dody.append(re.sub('&.*?;', '', re.sub(r'<.*?>', '', k)))  # 简单清洗一下拿到的数据
            data.append(dody)  # 正文
            data.append([j])  # 爬取的链接
            # 把数据写入Excel
            for y in range(len(data)):
                sheet.cell(x, y+1).value = '\n'.join(data[y])
            x += 1
    # 保存数据
    f.save(r'D:\工作文件\科技新闻.xlsx')
    return x  # 返回写入的行数


if __name__ == '__main__':
    mydrivwes_url(cate)

Guess you like

Origin blog.csdn.net/weixin_54243306/article/details/129408658