Python crawler combat (1): Crawling the embarrassing encyclopedia

1. Website Analysis

This article uses requeststhe library to fetch a piece of embarrassments hundred dot com. Readers can click here to access the scripts embarrassments Wikipedia page. The page is shown below:

Insert picture description here
In the bottom of the page is a navigation bar with a digital link, you can switch to a different page, each page will display 25a piece. So to achieve a multi-page piece of reptiles crawl, not only to analyze the current page HTMLof code, but also can capture multiple pages of HTMLcode.

Now switch to a different page, look at URLthe law. 1,2,3 page corresponding to URLthe following:

https://www.qiushibaike.com/text/page/1/
https://www.qiushibaike.com/text/page/2/
https://www.qiushibaike.com/text/page/3/

From the URLlaw can be seen, the page is indexed by URLthe last number you specify. The first 1page number is 1, the 12page number is 12, it is easy to get to any page in accordance with the law URL. The main task now is to analyze each page of HTMLthe code reader can by F12tracking the relevant part of the developer tools in HTMLthe code, as shown below:

Insert picture description here
Insert picture description here
This article is the use of 正则表达式data parsing, do not know if 正则表达式readers can see the bloggers this article: Python Reptile data extraction (III): Regular expressions can be Pycharmcarried out in the 正则表达式validation.

Insert picture description here
Embarrassments Encyclopedia of HTMLcode is relatively standard, the specific HTMLlocation is also relatively easy to find. For example, to identify the gender of the positioning HTMLof code, you can be positioned to the underlying HTMLcode.

<div class="articleGender manIcon">34</div>

By manIconcan send this piece of identifying users are men, women are womenIcon.

2. Sample code

According to the foregoing description and implementation, for the preparation of a gripping 13page embarrassments Encyclopedia piece of reptiles, and save the results to crawl named jokes.txtfile. The sample code is as follows:

# -*- coding: UTF-8 -*-
"""
@author:AmoXiang
@file:3.抓取糗事百科网的段子.py
@time:2020/09/11
"""
import requests
import re

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
}
joke_list = []  # 用于存储所有的段子


# 判断性别
def verify_sex(class_name):
    if class_name == "womenIcon":
        return '女'
    else:
        return '男'


def get_joke(url):
    res = requests.get(url=url, headers=headers)
    text = res.text  # 获取页面的html代码
    print(text)
    # 获取用户的ID
    # \s: 匹配任何空白字符,包括空格、制表符、换页符等等
    # id_list = re.findall(r'<h2>\s(.*?)\s</h2>', text)  # 第一种正则表达式的写法
    # re.S: 使用(.)字符匹配所有字符,包括换行符  ?: 非贪婪模式
    id_list = re.findall(r'<h2>(.*?)</h2>', text, re.S)  # 第二种正则表达式的写法
    # 获取用户级别
    level_list = re.findall(r'<div class="articleGender .*?">(\d+)</div>', text)
    # 获取性别
    sex_list = re.findall(r'<div class="articleGender (.*?)">\d+</div>', text)
    # 获取段子内容
    content_list = re.findall(r'<div class="content">.*?<span>(.*?)</span>', text, re.S)
    # 获取好笑数
    laugh_list = re.findall(r'<span class="stats-vote"><i class="number">(\d+)</i>', text, re.S)
    # 获取评论数
    comment_list = re.findall(r'<i class="number">(\d+)</i> 评论', text)
    # 使用zip函数将上述获得的数据的对应索引的元素放到一起
    # 例子: [1,2]、[a, b]  ==>[(1,a), (2,b)] 形式
    for id, level, sex, content, laugh, comment in zip(id_list, level_list, sex_list, content_list, laugh_list,
                                                       comment_list):
        id = id.strip()
        sex = verify_sex(sex)
        content = content.strip().replace('<br/>', '')
        # 获取每个段子相关的数据
        info = {
    
    
            'id': id,
            'level': level,
            'sex': sex,
            'content': content,
            'laugh': laugh,
            'comment': comment
        }
        joke_list.append(info)


if __name__ == '__main__':
    # 1.产生 1~13 页的URL
    url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1, 14)]
    # 2.使用循环 对13个URL进行请求 获取这13页的段子
    for url in url_list:
        get_joke(url)

    # 将抓取结果保存在当前目录的 jokes.txt 文件中
    for joke in joke_list:
        f = open("./jokes.txt", 'a', encoding="utf8")
        try:
            f.write(joke['id'] + "\n")
            f.write(joke['level'] + "\n")
            f.write(joke['sex'] + "\n")
            f.write(joke['content'] + "\n")
            f.write(joke['laugh'] + "\n")
            f.write(joke['comment'] + "\n\n")
        except Exception:
            pass
        finally:
            f.close()

Run the program, you will see the current directory one more jokes.txtdocument, which reads as follows:

Insert picture description here
The above content is only for technical learning and communication. Please do not collect data for commercial use, otherwise the consequences will be at your own risk and have nothing to do with the blogger. If there is any infringement, contact the blogger to delete it.

Guess you like

Origin blog.csdn.net/xw1680/article/details/108535351