Wechat search followed the "Water Drop and Silver Bullet" public account to get high-quality technical dry goods in the first place. With 7 years of experience in back-end research and development, he explained the technology clearly in a simple way.

When I was doing crawlers before, I designed and developed a general vertical crawler platform in the company. Later, I shared the internal technology in the company. This article sorted out the design ideas of the entire crawler platform and shared it with everyone.

Writing a crawler is easy, and it is not difficult to write a crawler that runs continuously and stably, but how to build a generalized vertical crawler platform?

In this article, I will share with you the idea of building a general vertical crawler platform.

Introduction to crawlers

First of all, what is a crawler?

Search engine is defined like this:

Web crawlers (also known as web spiders, web robots) are programs or scripts that automatically crawl web information according to certain rules.

Very simple, a crawler is a program script that specifies rules to automatically collect data, with the purpose of getting the data you want.

The crawlers are mainly divided into two categories:

General crawler (search engine)
Vertical crawler (specific field)

Due to the relatively high development cost of the first category, only search engine companies such as Google and Baidu are doing it.

Most companies are doing the second category, with low cost and high data value.

For example, if an e-commerce company only needs valuable data in the e-commerce field, it is of greater significance to develop a crawler platform that only collects data in the e-commerce field.

What I want to share with you is mainly for the second category, vertical crawler platform design ideas.

How to write a crawler

First of all, starting from the simplest, let's first understand how to write a crawler?

Simple crawler

The fastest language to develop crawlers is generally Python, which has very little code to write. Let's take grabbing the pages of Douban books as an example to write a simple program.

# coding: utf8

"""简单爬虫"""

import requests
from lxml import etree

def main():
    # 1. 定义页面URL和解析规则
    crawl_urls = [
        'https://book.douban.com/subject/25862578/',
        'https://book.douban.com/subject/26698660/',
        'https://book.douban.com/subject/2230208/'
    ]
    parse_rule = "//div[@id='wrapper']/h1/span/text()"

    for url in crawl_urls:
        # 2. 发起HTTP请求
        response = requests.get(url)

        # 3. 解析HTML
        result = etree.HTML(response.text).xpath(parse_rule)[0]

        # 4. 保存结果
        print result

if __name__ == '__main__':
    main()

This crawler is relatively simple, and the general process is:

Define page URL and parsing rules
Initiate an HTTP request
Parse HTML and get data
save data

Any crawler, in order to obtain data on a web page, goes through these steps.

Of course, the efficiency of this simple crawler is relatively low. It uses synchronous crawling. It can only crawl one webpage before crawling the next one. Is there a way to improve efficiency?

Asynchronous crawler

We optimize. Since crawler requests are blocked on network IO, we can use asynchronous methods to optimize, such as multi-threading or coroutines to fetch webpage data in parallel. Here we use Python coroutines to implement.

# coding: utf8

"""协程版本爬虫，提高抓取效率"""

from gevent import monkey
monkey.patch_all()

import requests
from lxml import etree
from gevent.pool import Pool

def main():
    # 1. 定义页面URL和解析规则
    crawl_urls = [
        'https://book.douban.com/subject/25862578/',
        'https://book.douban.com/subject/26698660/',
        'https://book.douban.com/subject/2230208/'
    ]
    rule = "//div[@id='wrapper']/h1/span/text()"

    # 2. 抓取
    pool = Pool(size=10)
    for url in crawl_urls:
        pool.spawn(crawl, url, rule)

    pool.join()

def crawl(url, rule):
    # 3. 发起HTTP请求
    response = requests.get(url)

    # 4. 解析HTML
    result = etree.HTML(response.text).xpath(rule)[0]

    # 5. 保存结果
    print result

if __name__ == '__main__':
    main()

After optimization, we have completed the asynchronous version of the crawler code.

With these basic knowledge, let's look at a complete example, how to grab a whole site data?

Whole site crawler

# coding: utf8

"""整站爬虫"""

from gevent import monkey
monkey.patch_all()

from urlparse import urljoin

import requests
from lxml import etree
from gevent.pool import Pool
from gevent.queue import Queue

base_url = 'https://book.douban.com'

# 种子URL
start_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'

# 解析规则
rules = {
    
    
    # 标签页列表
    'list_urls': "//table[@class='tagCol']/tbody/tr/td/a/@href",
    # 详情页列表
    'detail_urls': "//li[@class='subject-item']/div[@class='info']/h2/a/@href",
    # 页码
    'page_urls': "//div[@id='subject_list']/div[@class='paginator']/a/@href",
    # 书名
    'title': "//div[@id='wrapper']/h1/span/text()",
}

# 定义队列
list_queue = Queue()
detail_queue = Queue()

# 定义协程池
pool = Pool(size=10)

def crawl(url):
    """首页"""
    response = requests.get(url)
    list_urls = etree.HTML(response.text).xpath(rules['list_urls'])
    for list_url in list_urls:
        list_queue.put(urljoin(base_url, list_url))

def list_loop():
    """采集列表页"""
    while True:
        list_url = list_queue.get()
        pool.spawn(crawl_list_page, list_url)

def detail_loop():
    """采集详情页"""
    while True:
        detail_url = detail_queue.get()
        pool.spawn(crawl_detail_page, detail_url)

def crawl_list_page(list_url):
    """采集列表页"""
    html = requests.get(list_url).text
    detail_urls = etree.HTML(html).xpath(rules['detail_urls'])
    # 详情页
    for detail_url in detail_urls:
        detail_queue.put(urljoin(base_url, detail_url))

    # 下一页
    list_urls = etree.HTML(html).xpath(rules['page_urls'])
    for list_url in list_urls:
        list_queue.put(urljoin(base_url, list_url))

def crawl_detail_page(list_url):
    """采集详情页"""
    html = requests.get(list_url).text
    title = etree.HTML(html).xpath(rules['title'])[0]
    print title

def main():
    # 1. 标签页
    crawl(start_url)
    # 2. 列表页
    pool.spawn(list_loop)
    # 3. 详情页
    pool.spawn(detail_loop)
    # 开始采集
    pool.join()

if __name__ == '__main__':
    main()

We want to grab the entire site data of Douban Books. The process performed is:

Find the entrance, that is, enter from the book tag page, extract all tag URLs
Enter each tab page, extract all list URLs
Enter each list page, extract the detailed URL of each page and the next page list URL
Enter each detail page to get the book information
Repeat this cycle until the data capture is complete

This is the idea of crawling a whole site. It is very simple. It is nothing more than analyzing the behavior trajectory of our website browsing and using programs to automate requests and crawls.

Ideally, we should be able to get the data of the entire site, but the actual situation is that the other party's website often takes anti-crawling measures. After a period of crawling, our IP will be blocked.

How to break through these anti-climbing measures and get data? We continue to optimize the code.

Anti-climbing whole site crawler

# coding: utf8

"""防反爬的整站爬虫"""

from gevent import monkey
monkey.patch_all()

import random
from urlparse import urljoin

import requests
from lxml import etree
import gevent
from gevent.pool import Pool
from gevent.queue import Queue

base_url = 'https://book.douban.com'

# 种子URL
start_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'

# 解析规则
rules = {
    
    
    # 标签页列表
    'list_urls': "//table[@class='tagCol']/tbody/tr/td/a/@href",
    # 详情页列表
    'detail_urls': "//li[@class='subject-item']/div[@class='info']/h2/a/@href",
    # 页码
    'page_urls': "//div[@id='subject_list']/div[@class='paginator']/a/@href",
    # 书名
    'title': "//div[@id='wrapper']/h1/span/text()",
}

# 定义队列
list_queue = Queue()
detail_queue = Queue()

# 定义协程池
pool = Pool(size=10)

# 定义代理池
proxy_list = [
    '118.190.147.92:15524',
    '47.92.134.176:17141',
    '119.23.32.38:20189',
]

# 定义UserAgent
user_agent_list = [
    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko',
]

def fetch(url):
    """发起HTTP请求"""
    proxies = random.choice(proxy_list)
    user_agent = random.choice(user_agent_list)
    headers = {
    
    'User-Agent': user_agent}
    html = requests.get(url, headers=headers, proxies=proxies).text
    return html

def parse(html, rule):
    """解析页面"""
    return etree.HTML(html).xpath(rule)

def crawl(url):
    """首页"""
    html = fetch(url)
    list_urls = parse(html, rules['list_urls'])
    for list_url in list_urls:
        list_queue.put(urljoin(base_url, list_url))

def list_loop():
    """采集列表页"""
    while True:
        list_url = list_queue.get()
        pool.spawn(crawl_list_page, list_url)

def detail_loop():
    """采集详情页"""
    while True:
        detail_url = detail_queue.get()
        pool.spawn(crawl_detail_page, detail_url)

def crawl_list_page(list_url):
    """采集列表页"""
    html = fetch(list_url)
    detail_urls = parse(html, rules['detail_urls'])

    # 详情页
    for detail_url in detail_urls:
        detail_queue.put(urljoin(base_url, detail_url))

    # 下一页
    list_urls = parse(html, rules['page_urls'])
    for list_url in list_urls:
        list_queue.put(urljoin(base_url, list_url))

def crawl_detail_page(list_url):
    """采集详情页"""
    html = fetch(list_url)
    title = parse(html, rules['title'])[0]
    print title

def main():
    # 1. 首页
    crawl(start_url)
    # 2. 列表页
    pool.spawn(list_loop)
    # 3. 详情页
    pool.spawn(detail_loop)
    # 开始采集
    pool.join()

if __name__ == '__main__':
    main()

The difference between this version of the code and the previous version is that when an HTTP request is initiated, a random proxy IP and the request header UserAgent are added, which is also a common method to break through anti-climbing measures. Use these methods, plus some high-quality proxy IP, to deal with the data capture of some small websites, it is not a problem.

Of course, here is only to show the idea of writing crawlers and optimizing crawlers step by step to achieve the purpose of capturing data. The crawling and anti-crawling of the actual situation are more complicated than imagined, and specific analysis of specific scenarios is required.

Existing problem

After the above steps, we want to analyze the structure of the web page of which website we want, and it should not be a problem to write the code.

However, you can write like this when you grab a few websites, but can you write about dozens or hundreds of websites?

As we collect more and more websites, we will write more and more crawler scripts, and it will become difficult to maintain. The problems exposed from this include:

Many crawler scripts make management and maintenance difficult
The crawler rules are fragmented and may be repeatedly developed
Crawlers are all background scripts without monitoring
The data format output by the crawler script is not uniform, it may be a file or a database
It is difficult for businesses to use crawler data, and there is no unified access port

These problems are unavoidable problems we will encounter when we write more and more crawlers.

At this time, we urgently need a better solution to better develop crawlers, so the crawler platform came into being.

So how to design a generalized vertical crawler platform?

Platform architecture

Let’s analyze the common points of each crawler and find that writing a crawler is nothing more than the steps of rules, crawling, parsing, and storage . Can we separate each piece?

According to this idea, we can design the crawler platform as shown below:
Insert picture description here

Our crawler platform includes modules:

Configuration service: including crawling page configuration, parsing rule configuration, data cleaning configuration
Collection service: only focus on the download of web pages, and configure anti-climbing strategies
Proxy service: continue to provide stable and available proxy IP
Cleaning service: further cleaning and regularizing the data collected by the crawler
Data service: display of crawler data and docking of business systems

We disassemble each link of a crawler into individual service modules, each of which performs its duties, and each module communicates through API or message queue.

The advantage of this is that each module maintains only the functions of its own domain, and each module can be upgraded and optimized independently without affecting other modules.

Let's take a look at how each module is specifically designed.

detailed design

Configuration service

Configure the service module. This module mainly includes the configuration of the collection URL, the configuration of page parsing rules, and the configuration of data cleaning rules.

We extract the crawler rules from the crawler script and configure and maintain them separately. The advantage of this is that it is easy to reuse and manage .

Since this module only focuses on configuration management, we can further disassemble the configuration rules and support various data analysis modes, mainly including the following:

Regular parsing rules
CSS parsing rules
XPATH parsing rules

For each parsing rule mode, only the corresponding expression can be configured.

The collection service can write a configuration parser to interface with the configuration service. The configuration parser implements the specific analysis logic of various modes.

Data cleaning rule configuration mainly includes configuration rules for further cleaning and regularization of the page fields after collecting data on each page. For example, the data captured by the collection service contains special characters. It will not be processed further in the collection service. Instead, it will be processed in the cleaning service. The specific cleaning rules can be customized. Common ones include deleting some special characters and special characters. Field type conversion and so on.

Collection service

This service module is relatively pure, which is to write crawler logic. We can develop, debug, and run crawler scripts as before, and use this module to develop and debug crawler logic.

But the previous method can only write a crawler program in a command line script, and then debug and run it. Is there a good solution to visualize it?

We investigated the good crawler frameworks implemented in the Python language on the market and found that pyspider meets our needs. The characteristics of this framework:

Support distributed
Configuration visualization
Periodic collection
Support priority
Task can be monitored

pyspiderThe architecture diagram is as follows:
Insert picture description here

As the saying goes, standing on the shoulders of giants, this framework can basically meet our needs, but in order to better implement our crawler platform, we have decided to perform secondary development on it and enhance some components to make crawler development costs lower. More in line with our business rules.

The functions of secondary development mainly include:

Develop a configuration parser, dock with configuration services, and resolve multiple rule modes of configuration services
spider handlerThe module customizes crawler templates and classifies crawler tasks and defines them as templates to reduce development costs
fetcherThe module adds a proxy IP scheduling mechanism, docks proxy services, and adds proxy IP scheduling strategies
result_workerThe module customizes the output results for docking cleaning services

Based on this open source framework and the way to enhance its components, we can make a distributed, visualized, task-monitored collection service module that can generate crawler templates.

The function of this module only focuses on the collection of web data.

Agency Service

All crawlers know that proxy is a common method to break through anti-grab. How to obtain a stable and continuous proxy?

The proxy service module is used to realize this function.

This module maintains the quality and quantity of the proxy IP internally and outputs it to the collection service for its collection and use.

This module mainly includes two parts:

Free agent
Paid agent

Free agent

The free proxy IP is mainly collected by our own proxy collection program. The general idea is:

Collection agent source
Timed collection agent
Test agent
Output available proxy

The specific implementation logic can refer to this article I wrote before: How to build a crawler agent collection service?

Paid agent

The quality and stability of free agents are relatively poor, and they are still not enough for collecting websites that are more resistant to crawling.

At this time, we will purchase some paid proxies specifically for collecting such anti-climbing websites. The proxy IP is generally a highly hidden proxy and updated regularly.

Free proxy IP + paid proxy IP, provided to the collection service through API.

Cleaning service

The cleaning service module is relatively simple. It mainly receives the data output by the collection service, and then executes the cleaning logic according to the corresponding rules.

For example, the normalization and conversion of webpage fields and database fields, special field cleaning and customization, etc.

This service module runs a lot of Workers, and finally sends the output result to the data service.

data service

The data service module will receive the final cleaned structured data and store it uniformly. And unified push output for data required by other business systems:

The main functions include:

Data platform display
Data push
Data API

solved problem

Well, after the construction of the above crawler platform, we have basically solved the first few problems that were troubled at the beginning. The current crawler platform can achieve functions including:

Crawler script unified management and configuration visualization
The crawler template quickly generates crawler code, reducing development costs
The collection progress can be monitored and easily tracked
Unified output of collected data
Business systems use crawler data more conveniently

Crawler skills

Finally, let me share some skills when doing crawlers. On the whole, the core idea is actually one: simulate human behavior as much as possible .

Mainly include the following aspects:

Random UserAgent simulates different clients (github has UserAgent library, very comprehensive)
Random proxy IP (high hidden proxy + proxy scheduling strategy)
Cookie pool (for collection behavior that requires login)
JavaScript rendering page (using a non-interface browser to load the page to get data)
Verification code recognition (OCR, machine learning)

Of course, crawling is a process of mutual game. Sometimes there is no need to head-to-head. It is also a solution to change your mind when encountering problems. For example, if the other party's mobile client is very anti-scratch, then go to see the other party's PC station, can you do it? Can I try on the WAP side? Getting data at a limited cost is the purpose of crawlers.

As crawlers do more and more, you will find that this is an area where strategies and skills are equally important.

The above is the design idea of building a vertical crawler platform. From the simplest crawler script, to writing more and more crawlers, to difficult to maintain, to the construction of the entire crawler platform, step by step is the product of problems and solutions. , When we really find the core problem, it is not difficult to solve the idea.