Python crawler, crawl 500+ WeChat articles in 30 seconds

introduction

Due to work needs, I made a small tool for the company's front end, using the python language, to crawl the WeChat articles of Sogou WeChat, and the official website of Sogou WeChat is attached.

Sogou WeChat: https://weixin.sogou.com/

Insert picture description here
From popular to fashion circles, and include more content options under each column, a
Insert picture description here
total of 500+ articles

demand

Crawl these articles to get the title of each article and the picture on the right, output the crawled pictures to the specified folder with the specified naming method, and output the article title and image name to Excel and txt correspondingly

effect

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

The complete code is as follows

Package                   Version
------------------------- ---------
altgraph                  0.17
certifi                   2020.6.20
chardet                   3.0.4
future                    0.18.2
idna                      2.10
lxml                      4.5.2
pefile                    2019.4.18
pip                       19.0.3
pyinstaller               4.0
pyinstaller-hooks-contrib 2020.8
pywin32-ctypes            0.2.0
requests                  2.24.0
setuptools                40.8.0
urllib3                   1.25.10
XlsxWriter                1.3.3
xlwt                      1.3.0
# !/usr/bin/python
# -*- coding: UTF-8 -*-
import os

import requests
import xlsxwriter
from lxml import etree

# 请求微信文章的头部信息
headers = {
    
    
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Host': 'weixin.sogou.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
# 下载图片的头部信息
headers_images = {
    
    
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Host': 'img01.sogoucdn.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
a = 0
all = []

# 创建根目录
save_path = './微信文章'
folder = os.path.exists(save_path)
if not folder:
    os.makedirs(save_path)
# 创建图片文件夹
images_path = '%s/图片' % save_path
folder = os.path.exists(images_path)
if not folder:
    os.makedirs(images_path)
for i in range(1, 9):
    for j in range(1, 5):
        url = "https://weixin.sogou.com/pcindex/pc/pc_%d/%d.html" % (i, j)
        # 请求搜狗文章的url地址
        response = requests.get(url=url, headers=headers).text.encode('iso-8859-1').decode('utf-8')
        # 构造了一个XPath解析对象并对HTML文本进行自动修正
        html = etree.HTML(response)
        # XPath使用路径表达式来选取用户名
        xpath = html.xpath('/html/body/li')
        for content in xpath:
            # 计数
            a = a + 1
            # 文章标题
            title = content.xpath('./div[@class="txt-box"]/h3//text()')[0]
            article = {
    
    }
            article['title'] = title
            article['id'] = '%d.jpg' % a
            all.append(article)
            # 图片路径
            path = 'http:' + content.xpath('./div[@class="img-box"]//img/@src')[0]
            # 下载文章图片
            images = requests.get(url=path, headers=headers_images).content
            try:
                with open('%s/%d.jpg' % (images_path, a), "wb") as f:
                    print('正在下载第%d篇文章图片' % a)
                    f.write(images)
            except Exception as e:
                print('下载文章图片失败%s' % e)
# 信息存储在excel中
# 创建一个workbookx
workbook = xlsxwriter.Workbook('%s/Excel格式.xlsx' % save_path)
# 创建一个worksheet
worksheet = workbook.add_worksheet()
print('正在生成Excel...')
try:
    for i in range(0, len(all) + 1):
        # 第一行用于写入表头
        if i == 0:
            worksheet.write(i, 0, 'title')
            worksheet.write(i, 1, 'id')
            continue
        worksheet.write(i, 0, all[i - 1]['title'])
        worksheet.write(i, 1, all[i - 1]['id'])
    workbook.close()
except Exception as e:
    print('生成Excel失败%s' % e)
print("生成Excel成功")
print('正在生成txt...')
try:
    with open('%s/数组格式.txt' % save_path, "w") as f:
        f.write(str(all))
except Exception as e:
    print('生成txt失败%s' % e)
print('生成txt成功')
print('共爬取%d篇文章' % a)

Finally, the program is packaged into an exe file, and the program can be run directly under the windows system.
Insert picture description here
Like, collect, and follow. Your support is my biggest motivation!

Guess you like

Origin blog.csdn.net/y1534414425/article/details/108718569