What scrapy is still used! Using this framework is fast and lightweight, it turns out that crawlers are so simple

Before, we wrote crawlers, and the most used framework was scrapy. Today we use the newly released crawler framework feapder to develop crawlers and see what kind of experience it is.

Target website: aHR0cHM6Ly93d3cubGFnb3UuY29tLw==
Requirements: Collect job lists and job details, and the details need to be updated every 7 days.
For demonstration, the following only searches for posts related to crawlers

1. Research

1.1 List page

First of all, we need to see whether the page is dynamically rendered and whether the interface has anti-climbing.

To see if it is dynamic rendering, you can right-click to display the source code of the webpage, and then search for the content source code on the webpage to see if it exists. For example, if the first Zhiyi Technology in the search list matches 0, the preliminary judgment is dynamic rendering.

Or you can use the feapder command to download the source code of the webpage and view it.

The opened page is loading

Calling the response.open() command will produce a temp.html file in the working directory. The content is the source code returned by the current request. We click to view it. It is a section of js with security verification. Therefore, it can be inferred that the website has anti-climbing and early warning of difficulty upgrade

Feapder also supports the use of curl command request, the way is as follows:

Press F12, or right-click to check, open the debug window, refresh the page, click the request of the current page, copy as curl, return to the command line window, enter feapder shell - and paste the content just copied

 


It is found that carrying headers and cookies are not good, maybe some parameters can only be used once.

Research conclusion: anti-climbing on the list page, dynamic rendering of the page

ps: The normal god will continue to investigate, what is the list interface, how to crack the anti-crawl, but because I am a novice, I will not worry about it.

1.2 Details page

Similar to the list page survey, the conclusion is that there is anti-crawl, but the page is not dynamically rendered

2. Create the project

Open the command line tool and enter:

> feapder create -p lagou-spider                                                                                   

lagou-spider 项目生成成功

The generated project is as follows:

I use pycharm, right click first, and add this item to the workspace.
(Right click the project name, Mark Directory as -> Sources Root)

3. Write a listing page crawler

3.1 Create a crawler

> cd lagou-spider/spiders 
> feapder create -s list_spider 

ListSpider 生成成功

The generated code is as follows:

import feapder


class ListSpider(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://www.baidu.com")

    def parse(self, request, response):
        print(response)


if __name__ == "__main__":
    ListSpider().start()

This is an example of requesting Baidu, which can be run directly

3.2 Photo Reptiles

Issue tasks:

def start_requests(self):
    yield feapder.Request("https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=", render=True)

Note that we carried the render parameter in the request to indicate whether to use the browser to render, because this list page is dynamically rendered, and there is anti-crawl, I am rather confused, so I used the rendering mode to avoid hair loss

Write analytical functions

Observe the page structure and write the following analytic function

def parse(self, request, response):
    job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
    for job in job_list:
        job_name = job.xpath("./@data-positionname").extract_first()
        company = job.xpath("./@data-company").extract_first()
        salary = job.xpath("./@data-salary").extract_first()
        job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()

        print(job_name, company, salary, job_url)

We parsed the job name, company, salary, and job details address. Normal logic should send the details address as a task to get the details

def parse(self, request, response):
    job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
    for job in job_list:
        job_name = job.xpath("./@data-positionname").extract_first()
        company = job.xpath("./@data-company").extract_first()
        salary = job.xpath("./@data-salary").extract_first()
        job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()

        print(job_name, company, salary, job_url)

        yield feapder.Request(
            job_url, callback=self.parse_detail, cookies=response.cookies.get_dict()
        )  # 携带列表页返回的cookie,回调函数指向详情解析函数

def parse_detail(self, request, response):
    print(response.text)
    # TODO 解析详情

But the requirement is that the details are updated every 7 days, and the list does not say that it needs to be updated. Therefore, in order to optimize, write a separate crawler for the details. This crawler is only responsible for the tasks of the data and production details of the list.

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
So for these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and the source code of the course!
QQ group: 705933274

3.3 Data storage

Create table

Job list data sheet lagou_job_list

CREATE TABLE `lagou_job_list` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT '自增id',
  `job_name` varchar(255) DEFAULT NULL COMMENT '职位名称',
  `company` varchar(255) DEFAULT NULL COMMENT '公司',
  `salary` varchar(255) DEFAULT NULL COMMENT '薪资',
  `job_url` varchar(255) DEFAULT NULL COMMENT '职位地址',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Detailed task table lagu_job_detail_task

CREATE TABLE `lagou_job_detail_task` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `url` varchar(255) DEFAULT NULL,
  `state` int(11) DEFAULT '0' COMMENT '任务状态(0未做,1完成,2正在做,-1失败)',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Data storage method

There are many ways to store data into the database, directly importing pymysql and then splicing sql statements into the database, or using the MysqlDB that comes with the framework. However, Feapder has a more convenient way of warehousing, automatic warehousing

Automatic storage of AirSpider is not supported, because it is relatively lightweight. In order to maintain the lightweight feature, the author does not support automatic storage for the time being. But distributed crawler Spider is supported, we will directly integrate class instead Spider can

class ListSpider(feapder.AirSpider):

Change to

class ListSpider(feapder.Spider):

Generate item

The item corresponds to the table one by one, and is related to the data storage mechanism, and can be generated by the feapder command.

First configure the database connection information, which is configured in the setting

Generate item:

> cd items 
> feapder create -i lagou_job_list 
> feapder create -i lagou_job_detail_task

Data storage

def parse(self, request, response):
    job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
    for job in job_list:
        job_name = job.xpath("./@data-positionname").extract_first()
        company = job.xpath("./@data-company").extract_first()
        salary = job.xpath("./@data-salary").extract_first()
        job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()

        # 列表数据
        list_item = lagou_job_list_item.LagouJobListItem()
        list_item.job_name = job_name
        list_item.company = company
        list_item.salary = salary
        list_item.job_url = job_url
        yield list_item  # 直接返回,框架实现批量入库

        # 详情任务
        detail_task_item = lagou_job_detail_task_item.LagouJobDetailTaskItem()
        detail_task_item.url = job_url
        yield detail_task_item  # 直接返回,框架实现批量入库

The data is returned to the framework in the form of yield item, and the framework is automatically stored in batches

3.4 Overall code

import feapder
from items import *


class ListSpider(feapder.Spider):
    def start_requests(self):
        yield feapder.Request(
            "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?labelWords=&fromSearch=true&suginput=",
            render=True,
        )

    def parse(self, request, response):
        job_list = response.xpath("//li[contains(@class, 'con_list_item')]")
        for job in job_list:
            job_name = job.xpath("./@data-positionname").extract_first()
            company = job.xpath("./@data-company").extract_first()
            salary = job.xpath("./@data-salary").extract_first()
            job_url = job.xpath(".//a[@class='position_link']/@href").extract_first()

            # 列表数据
            list_item = lagou_job_list_item.LagouJobListItem()
            list_item.job_name = job_name
            list_item.company = company
            list_item.salary = salary
            list_item.job_url = job_url
            yield list_item  # 直接返回,框架实现批量入库

            # 详情任务
            detail_task_item = lagou_job_detail_task_item.LagouJobDetailTaskItem()
            detail_task_item.url = job_url
            yield detail_task_item  # 直接返回,框架实现批量入库


if __name__ == "__main__":
    spider = ListSpider(redis_key="feapder:lagou_list")
    spider.start()

redis_key is the location where the task queue is stored in redis.

Run directly and observe that the data has been automatically stored in the database

4. Write details crawler

Unlike list page crawlers, detailed data needs to be updated every 7 days.

In order to display time series data, we collect data every 7 days. The data needs to carry batch information and divide the data according to the 7-day dimension.

Before contacting the feapder framework, we need to consider taking tasks from the detailed task table to the crawler in batches, and also need to maintain the status of the tasks and the batch information mentioned above. And in order to ensure the timeliness of the data, it is necessary to monitor the collection progress, and writing a crawler is very cumbersome.

So how does Feapder do? In order to save space, give the complete code directly:

import feapder
from items import *


class DetailSpider(feapder.BatchSpider):
    def start_requests(self, task):
        task_id, url = task
        yield feapder.Request(url, task_id=task_id, render=True)

    def parse(self, request, response):
        job_name = response.xpath('//div[@class="job-name"]/@title').extract_first().strip()
        detail = response.xpath('string(//div[@class="job-detail"])').extract_first().strip()

        item = lagou_job_detail_item.LagouJobDetailItem()
        item.title = job_name
        item.detail = detail
        item.batch_date = self.batch_date  # 获取批次信息,批次信息框架自己维护
        yield item  # 自动批量入库
        yield self.update_task_batch(request.task_id, 1)  # 更新任务状态


if __name__ == "__main__":
    spider = DetailSpider(
        redis_key="feapder:lagou_detail",  # redis中存放任务等信息的根key
        task_table="lagou_job_detail_task",  # mysql中的任务表
        task_keys=["id", "url"],  # 需要获取任务表里的字段名,可添加多个
        task_state="state",  # mysql中任务状态字段
        batch_record_table="lagou_detail_batch_record",  # mysql中的批次记录表
        batch_name="详情爬虫(周全)",  # 批次名字
        batch_interval=7,  # 批次周期 天为单位 若为小时 可写 1 / 24
    )

    # 下面两个启动函数 相当于 master、worker。需要分开运行
    # spider.start_monitor_task() # 下发及监控任务
    spider.start()  # 采集

We run spider.start_monitor_task() and spider.start() respectively, and after the crawler is over, observe the database

Task table : lagu_job_detail_task

All tasks have been completed, and the framework has a mechanism for resending tasks after missing tasks until all tasks have been completed

Data sheet : lagu_job_detail:

The data carries batch time information, and we can divide the data according to this time. The current batch is March 19th. If the batch occurs within 7 days, the next batch will be March 26th.

Repeatedly start the crawler during this batch. If there is no new task, the crawler will not grab
spider.start_monitor_task()


spider.start()

Batch table : lagu_detail_batch_record

The batch table is specified in the startup parameters and is automatically generated. The batch table records the capture status of each batch in detail, such as the total amount of tasks, the amount done, the amount of failure, whether it has been completed, and other information

5. Integration

At present, both the list crawler and the detailed crawler have been written. The running entry is distributed in two files, which is messy to manage. Feapder recommends to write it in main.py.


from feapder import ArgumentParser

from spiders import *


def crawl_list():
    """
    列表爬虫
    """
    spider = list_spider.ListSpider(redis_key="feapder:lagou_list")
    spider.start()


def crawl_detail(args):
    """
    详情爬虫
    @param args: 1 / 2 / init
    """
    spider = detail_spider.DetailSpider(
        redis_key="feapder:lagou_detail",  # redis中存放任务等信息的根key
        task_table="lagou_job_detail_task",  # mysql中的任务表
        task_keys=["id", "url"],  # 需要获取任务表里的字段名,可添加多个
        task_state="state",  # mysql中任务状态字段
        batch_record_table="lagou_detail_batch_record",  # mysql中的批次记录表
        batch_name="详情爬虫(周全)",  # 批次名字
        batch_interval=7,  # 批次周期 天为单位 若为小时 可写 1 / 24
    )

    if args == 1:
        spider.start_monitor_task()
    elif args == 2:
        spider.start()


if __name__ == "__main__":
    parser = ArgumentParser(description="xxx爬虫")

    parser.add_argument(
        "--crawl_list", action="store_true", help="列表爬虫", function=crawl_list
    )
    parser.add_argument(
        "--crawl_detail", type=int, nargs=1, help="详情爬虫(1|2)", function=crawl_detail
    )

    parser.start()

View the start command:

> python3 main.py --help                                 
usage: main.py [-h] [--crawl_list] [--crawl_detail CRAWL_DETAIL]

xxx爬虫

optional arguments:
  -h, --help            show this help message and exit
  --crawl_list          列表爬虫
  --crawl_detail CRAWL_DETAIL
                        详情爬虫(1|2)

Start the list crawler:

 python3 main.py --crawl_list

Start the detailed crawler master

python3 main.py --crawl_detail 1 

Start the detailed crawler worker

python3 main.py --crawl_detail 2

to sum up

This article takes a recruitment website as an example and introduces the whole process of collecting data using feapder. Which involves the use of AirSpider, Spider, BatchSpider three crawlers.

  • The AirSpider crawler is relatively lightweight and has low learning costs. In the face of some small data volume, no need to continue crawling without breakpoints, and no need for distributed collection, this crawler can be used.
  • Spider is a distributed crawler based on redis, suitable for massive data collection, and supports functions such as continuous crawling at breakpoints, crawler alarms, and automatic data storage.
  • BatchSpider is a distributed batch crawler. For data that needs to be collected periodically, this crawler is given priority.

In addition to supporting browser rendering and downloading, feapder also supports pipeline, which can be customized by users to facilitate docking with other databases

Rich alarms are built into the framework, and we will be notified in time when there is a problem with the crawler to ensure the timeliness of the data

  1. Calculate crawler crawling speed in real time, estimate the remaining time, and predict whether it will time out within the specified crawling cycle
  2. Reptile stuck alarm
  3. Too many failed crawler tasks alarm, which may be caused by website template changes or blocking
  4. Download status monitoring

I still want to recommend the Python learning Q group I built by myself : 705933274. The group is all learning Python. If you want to learn or are learning Python, you are welcome to join. Everyone is a software development party and share dry goods from time to time ( Only related to Python software development), including a copy of the latest Python advanced materials and zero-based teaching compiled by myself in 2021. Welcome friends who are in advanced and interested in Python to join!

Guess you like

Origin blog.csdn.net/pyjishu/article/details/115264716