Introduce a crawler framework that can replace Scrapy - feapder

1 Introduction

Hello everyone, I'm Ango!

As we all know, the most popular crawler framework for Python is Scrapy, which is mainly used to crawl website structural data

Today I recommend a simpler, lightweight, and powerful crawler framework: feapder

project address:

github.com/Boris-code/…

2. Introduction and Installation

Similar to Scrapy, feapder supports lightweight crawler, distributed crawler, batch crawler, crawler alarm mechanism and other functions

The three built-in crawlers are as follows:

AirSpider

Lightweight crawlers, suitable for crawlers with simple scenarios and small amounts of data
Spider

Distributed crawler, based on Redis, suitable for massive data, and supports functions such as breakpoint continuous crawling, automatic data storage and other functions
BatchSpider

Distributed batch crawlers, mainly used for crawlers that require periodic collection

Before the actual combat, we install the corresponding dependency library in the virtual environment

# 安装依赖库
pip3 install feapder
复制代码

3. Let's fight

We use the simplest AirSpider to crawl some simple data

Target website: aHR0cHM6Ly90b3BodWIudG9kYXkvIA==

The detailed implementation steps are as follows (5 steps)

3-1 Create a crawler project

First, we use the "feapder create -p" command to create a crawler project

# 创建一个爬虫项目
feapder create -p tophub_demo
复制代码

3-2 Create a crawler AirSpider

Go to the spiders folder from the command line and use the "feapder create -s" command to create a crawler

cd spiders

# 创建一个轻量级爬虫
feapder create -s tophub_spider 1
复制代码

1 is the default, which means to create a lightweight crawler AirSpider
2 represents the creation of a distributed crawler Spider
3 represents the creation of a distributed batch crawler BatchSpider

3-3 Configure database, create data table, create mapping item

Taking Mysql as an example, first we create a data table in the database

# 创建一张数据表
create table topic(    id         int auto_increment        primary key,    title      varchar(100)  null comment '文章标题',    auth       varchar(20)   null comment '作者',    like_count     int default 0 null comment '喜欢数',    collection int default 0 null comment '收藏数',    comment    int default 0 null comment '评论数');
复制代码

Then, open the settings.py file in the project root directory to configure the database connection information

# settings.py

MYSQL_IP = "localhost"
MYSQL_PORT = 3306
MYSQL_DB = "xag"
MYSQL_USER_NAME = "root"
MYSQL_USER_PASS = "root"
复制代码

Finally, create a mapping Item (optional)

Go to the items folder and use the "feapder create -i" command to create a file that maps to the database

PS: Since AirSpider does not support automatic data storage, this step is not necessary

3-4 Write crawler and data analysis

The first step is to initialize the database with "MysqlDB"

from feapder.db.mysqldb import MysqlDB

class TophubSpider(feapder.AirSpider):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.db = MysqlDB()
复制代码

In the second step, in the start_requests method, specify the address of the main link to be crawled, and use the keyword "download_midware" to configure a random UA

import feapder
from fake_useragent import UserAgent

def start_requests(self):
    yield feapder.Request("https://tophub.today/", download_midware=self.download_midware)

def download_midware(self, request):
    # 随机UA
    # 依赖：pip3 install fake_useragent
    ua = UserAgent().random
    request.headers = {'User-Agent': ua}
    return request
复制代码

The third step is to crawl the title and link address of the home page

Use feapder's built-in method xpath to parse the data

def parse(self, request, response):
    # print(response.text)
    card_elements = response.xpath('//div[@class="cc-cd"]')

    # 过滤出对应的卡片元素【什么值得买】
    buy_good_element = [card_element for card_element in card_elements if
                        card_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first() == '什么值得买'][0]

    # 获取内部文章标题及地址
    a_elements = buy_good_element.xpath('.//div[@class="cc-cd-cb nano"]//a')

    for a_element in a_elements:
        # 标题和链接
        title = a_element.xpath('.//span[@class="t"]/text()').extract_first()
        href = a_element.xpath('.//@href').extract_first()

        # 再次下发新任务，并带上文章标题
        yield feapder.Request(href, download_midware=self.download_midware, callback=self.parser_detail_page,
                              title=title)
复制代码

The fourth step is to crawl the details page data

In the previous step, a new task is issued, and the callback function is specified through the keyword "callback", and finally the data analysis is performed on the details page in the parser_detail_page

def parser_detail_page(self, request, response):
    """
    解析文章详情数据
    :param request:
    :param response:
    :return:
    """
    title = request.title

    url = request.url

    # 解析文章详情页面，获取点赞、收藏、评论数目及作者名称
    author = response.xpath('//a[@class="author-title"]/text()').extract_first().strip()

    print("作者：", author, '文章标题:', title, "地址：", url)

    desc_elements = response.xpath('//span[@class="xilie"]/span')

    print("desc数目:", len(desc_elements))

    # 点赞
    like_count = int(re.findall('\d+', desc_elements[1].xpath('./text()').extract_first())[0])
    # 收藏
    collection_count = int(re.findall('\d+', desc_elements[2].xpath('./text()').extract_first())[0])
    # 评论
    comment_count = int(re.findall('\d+', desc_elements[3].xpath('./text()').extract_first())[0])

    print("点赞：", like_count, "收藏:", collection_count, "评论:", comment_count)
复制代码

3-5 Data storage

Use the database object instantiated above to execute SQL and insert data into the database.

# 插入数据库
sql = "INSERT INTO topic(title,auth,like_count,collection,comment) values('%s','%s','%s','%d','%d')" % (
title, author, like_count, collection_count, comment_count)

# 执行
self.db.execute(sql)
复制代码

4. Finally

This article talks about the simplest crawler AirSpider in feapder through a simple example

Regarding the use of advanced functions of feapder, I will explain in detail through a series of examples later.

I have uploaded all the code in the article to the background of the official account, and replied to the keyword " airspider " in the background to get the complete source code

If you think the article is not bad, please like , share, and leave a message , because this will be the strongest motivation for me to continue to output more high-quality articles!