Principle and Practice of Python Web Crawler

Author: JD Logistics Tian Yu

1 Web crawler

Web crawler: It is a program or script that automatically grabs information on the World Wide Web according to certain rules.

There are many technologies and frameworks related to web crawlers, and different web crawler technologies can be selected for different scenarios.

2 Scrapy framework (Python)

2.1. Scrapy Architecture

2.1.1. System Architecture

 

2.1.2. Execution process

Summarize the crawler development process and simplify the crawler execution process as shown in the following figure:

 

The main process of crawler operation is as follows:

(1) After Scrapy starts the Spider, it loads the start_url of the Spaider and generates a request object;

(2) Improve the request object through middleware (add IP agent, User-Agent);

(3) The Downloader object downloads the page according to the request object;

(4) Pass the response result to the parser method of the spider for analysis;

(5) The spider obtains the data and encapsulates it as an item object and passes it to the pipeline, and the parsed request object will be returned to the scheduler for a new round of data capture;

2.2. Introduction to Framework Core Documents

2.2.1. scrapy.cfg

scrapy.cfg is the entry file of the scrapy framework, the settings node specifies the configuration information of the crawler, and the deploy node is used to specify the deployment path of the scrapyd service.

 

[settings]

default = sfCrawler.settings

[deploy]

url = http://localhost:6800/

project = jdCrawler

 
 
 

2.2.2. settings.py

settings is mainly used to configure crawler startup information, including: number of concurrent threads, used middleware, items and other information; it can also be used as a global configuration file in the system.

Note: At present, relevant configuration information such as redis and database connection are mainly added.

2.2.3. middlewares.py

middleware defines a variety of interfaces, which are called in crawler loading, input, output, request, request exception, etc.

Note: At present, the main user is to add User-Agent information and IP proxy information for crawlers.

2.2.4. pipelines.py

It is used to define the Pipline object for processing data. The scrapy framework can configure multiple pipeline objects in the settings.py file. The processes of processing data will be executed sequentially according to the order of priority configured in settings.py.

Note: Each item object generated in the system will go through all the pipeline objects configured by settings.py.

2.2.5. items.py

Data dictionary used to define different data types, each attribute is Field type;

2.2.6. spider directory

It is used to store the definition of the Spider subclass. When scrapy starts the crawler, it will be loaded and called according to the name attribute in the spider class.

2.3. Crawler function extension description

2.3.1. user_agents_middleware.py

Through the procces_request method, add hearer information to the request object, and randomly simulate the User-Agent information of various browsers to make network requests.

2.3.2. proxy_server.py

Through the procces_request method, add network proxy information to the requests object, and randomly simulate multiple IP calls.

2.3.3. db_connetion_pool.py

The file location is
db_manager/db_connection_pool.py. The file defines the basic data connection pool, which is convenient for all aspects of the system to operate the database.

2.3.4. redis_connention_pool.py

The file location is db_manager/redis_connention_pool.py. The file defines the basic Redis connection pool, which facilitates the operation of Redis cache in all aspects of the system.

2.3.5. scrapy_redis package

The scrapy_redis package is an extension to the scrapy framework, using Redis as a request queue to store crawler task information.

spiders.py file: Define the distributed RedisSpider class, and obtain the initial request list information from the Redis cache by overriding the Spider class start_requests() method. Among them, the RedisSpider subclass needs to assign a value to redis_key.

pipelines.py file: defines a simple data storage method, which can directly serialize item objects and save them in the Redis cache.

dupefilter.py file: Define the data deduplication class, adopt the Redis cache method, and the saved data will be added to the filter queue.

queue.py file: defines several queues with different enqueue and dequeue orders, and the queues are stored in Redis.

2.4. Weibo crawler development example

2.4.1. Find crawler entry

2.4.1.1. Site Analysis

Websites are generally divided into two types: Web-side and M-side, and the design and architecture of the two types of sites will be quite different. Usually, the web side is relatively mature, with restrictions such as User-Agent inspection, mandatory cookies, and login jumps, making it relatively difficult to crawl, and the returned results are mainly HTML content; Independent data interface. Therefore, during the site analysis process, the M-terminal site entrance is searched first. The effect of the Weibo web terminal and M terminal is shown in the figure:

Weibo web address: https://weibo.com/ , the page display effect is as shown in the figure below:

Note: The picture comes from the screenshot of Weibo PC

Weibo M terminal address: https://m.weibo.cn/?jumpfrom=weibocom , the page display effect is as shown in the figure below:

Note: The picture comes from the screenshot of Weibo M terminal

2.4.1.2. HTML source code analysis

The results returned by the web-side site and the M-side site are both in HTML format. In order to improve the page rendering speed or increase the difficulty of code analysis, some sites dynamically generate HTML pages through dynamic JavaScript execution. Web crawlers lack JS execution and rendering processes. It is difficult to obtain real data. The HTML code snippet of the Weibo website is as follows:

 

Body content in script:

 

M terminal site HTML content:

 

The key information in the page does not appear in the HTML content of the M terminal, which can be judged as the design method of separating the front and back ends. Through the development mode of the Chrome browser, all request information can be viewed, and the interface address can basically be determined by the type of request and the returned result. , the search process is shown in the figure below:

Note: The picture comes from the screenshot of Weibo M terminal

(1) Open the Chrome developer tools and refresh the current page;

(2) Change the request type to XHR, and filter Ajax requests;

(3) View all request information, ignoring interfaces that do not return results;

(4) Find relevant content on the page in the results returned by the interface.

2.4.1.3. Interface analysis

Interface analysis mainly includes: request address analysis, request method, parameter list, return result, etc.

The request address, request method, and parameter list can be obtained from the network request Header information in the Chrome developer tools. The request information is shown in the figure below:

 

The interface address in the figure above uses the GET method to request, and the request address is unicode encoded. For the parameter content, you can view the request parameters in the Query String Parameters list, and the effect is shown in the following figure:

 

Request result analysis mainly analyzes the characteristics of the data structure, finds the same data structure as the content of the text, and checks whether all the results are consistent with the content of the text, so as to avoid affecting the data parsing process by special return results.

2.4.1.4. Interface Validation

Interface verification generally requires two steps:

(1) Use a browser (preferably a new browser, such as Chrome's incognito mode) to simulate the request process, and enter the request address with parameters in the address bar to view the returned results.

(2) Use tools such as Postman to simulate the browser request process, mainly to simulate non-Get network requests, and also verify whether the site is forced to use Cookie and User-Agent information.

2.4.2. Defining data structures

The definition of crawler data structure is mainly designed in combination with business requirements and data capture results. Weibo data is mainly used by domestic public opinion systems. Therefore, during the development process, the data of related sites is uniformly defined as the OpinionItem type, and the data storage process of different sites , assemble the data according to the characteristics of the OpinionItem data structure. Define the public opinion data structure in the items.py file as follows:

class OpinionItem(Item):
    rid = Field()
    pid = Field()
    response_content = Field()  # 接口返回的全部信息
    published_at = Field()  # 发布时间
    title = Field()  # 标题
    description = Field()  # 描述
    thumbnail_url = Field()  # 缩略图
    channel_title = Field()  # 频道名称
    viewCount = Field()  # 观看数
    repostsCount = Field()  # 转发数
    likeCount = Field()  # 点赞数
    dislikeCount = Field()  # 不喜欢数量
    commentCount = Field()  # 评论数
    linked_url = Field()  # 链接
    updateTime = Field()  # 更新时间
    author = Field()  # 作者
    channelId = Field()  # 渠道ID
    mediaType = Field()  # 媒体类型
    crawl_time = Field() # 抓取时间
    type = Field() # 信息类型:1 主贴 2 主贴的评论

2.4.3. Reptile development

The Weibo crawler adopts the distributed RedisSpider as the parent class, and the definition of the crawler is as follows:

class weibo_list(RedisSpider):
    name = 'weibo'
    allowed_domains = ['weibo.cn']
    redis_key = 'spider:weibo:list:start_urls'
 
    def parse(self, response):
        a = json.loads(response.body)
        b = a['data']['cards']
        for j in range(len(b)):
            bb = b[j]
            try:
                for c in bb['card_group']:
                    try:
                        d = c['mblog']
                        link = 'https://m.weibo.cn/api/comments/show?id={}'.format(d['mid'])
                        # yield scrapy.Request(url=link, callback=self.parse_detail)
                        # 内容解析代码片段
                        opinion['mediaType'] = 'weibo'
                        opinion['type'] = '1'
                        opinion['crawl_time'] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
                        yield opinion
                    except Exception as e:
                        print(e)
                        continue
            except Exception as e:
                print(e)
                continue

 

Code analysis:

Code line 1: define the weibo_list class and inherit the RedisSpider class;

Code line 2: Define the name of the crawler, which is used when the crawler starts;

Code line 3: Add a list of address domain names that are allowed to be accessed;

Code line 4: Define the redis key of the Weibo start request address;

Code line 6: Define the parsing method of the crawler, and the parser method is called by default after the crawler downloads the page;

Code lines 7~21: parse the downloaded result content and assemble it into the item object;

Code line 22: Generate a Python-specific generator object through the yield keyword, and the caller can traverse all the results through the generator object.

2.4.4. Data Storage

Data storage mainly saves the data parsed by the crawler by defining the Pipline implementation class. Microblog data needs to be added to the process of sentiment analysis. During the development process, the microblog data is first saved in the Redis service, and then acquired, processed, and stored in the database through the subsequent sentiment analysis analysis service. The data saving code looks like this:

class ShunfengPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, OpinionItem):
            try:
                print('===========Weibo查询结果============')
                key = 'spider:opinion:data'
                dupe = 'spider:opinion:dupefilter'
 
                attr_list = []
                for k, v in item.items():
                    if isinstance(v, str):
                        v = v.replace('\'', '\\\'')
 
                    attr_list.append("%s:'%s'" % (k, v))
                data = ",".join(attr_list)
                data = "{%s}" % data
 
                # 按照数据来源、类型和唯一id作为去重标识
                single_key = ''.join([item['mediaType'], item['type'], item['rid']])
                if ReidsPool().rconn.execute_command('SADD', dupe, single_key) != 0:
                    ReidsPool().rconn.execute_command('RPUSH', key, data)
            except Exception as e:
                print(e)
                pass
 
        return item

 

Key code description:

Code line 1: Define the Pipline class;

Code line 2: define the process_item method for receiving data;

Code line 3: process separately according to the item type;

Code lines 4~17: Item objects are assembled into JSON strings;

Code lines 18~21: de-duplicate the data, and save the data to the redis queue;

Code line 26: Return the item object for other Pipline operations;

After the Pipline is defined, it needs to be configured in the settings.py file in the project. The configuration content is as follows:

 

# configure the project pipeline

ITEM_PIPELINES = {

"sfCrawler.pipelines.JdcrawlerPipeline": 401,

"sfCrawler.pipelines_manage.mysql_pipelines.MySqlPipeline": 402,

"sfCrawler.pipelines_manage.shunfeng_pipelines.ShunfengPipeline": 403,

}

 
 
 

Crawler entry address:
https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

3 WebMagic framework (Java)

3.1 Preface

Summarize the problems existing in the use of Scrapy, and consider the needs of the later launch of the crawler system, and use the Java language to design and develop the crawler. The specific reasons are as follows:

(1) Dependence on the basic environment of the online: need to use the basic environment such as online Clover, JimDB, MySQL;

(2) Strong scalability: Based on the existing framework, it encapsulates the Request request object twice, realizes the development of general web crawlers, and provides easy-to-extend address generation and web page analysis interfaces.

(3) Centralized deployment: By deploying a general-purpose crawler method, all crawlers that support sites are captured, and the problem of one site-one deployment of the Scrapy framework is solved.

(4) Anti-anti-crawler: Some sites implement anti-crawler strategies (such as: Tmall) according to the characteristics of Scrapy framework requests, and reject all crawler requests. WebMagic simulates browser requests and is not restricted by the crawler.

3.2 Overview of WebMagic

(内容来源:https://webmagic.io/docs/zh/posts/ch1-overview/architecture.html)

3.2.1 Overall Architecture

The structure of WebMagic is divided into four major components: Downloader, PageProcessor, Scheduler, and Pipeline, and Spider organizes them with each other. These four components correspond to the functions of downloading, processing, management and persistence in the crawler life cycle. The design of WebMagic refers to Scapy, but the implementation is more Java-like.

The Spider organizes these components so that they can interact with each other and perform process-based execution. It can be considered that the Spider is a large container, and it is also the core of WebMagic logic.

The overall architecture of WebMagic is as follows:

 

3.1.2 Four components of WebMagic

3.1.2.1 Downloader

Downloader is responsible for downloading pages from the Internet for subsequent processing. WebMagic uses Apache HttpClient as the download tool by default.

3.1.2.2 PageProcessor

PageProcessor is responsible for parsing pages, extracting useful information, and discovering new links. WebMagic uses Jsoup as an HTML parsing tool, and based on it, Xsoup, a tool for parsing XPath, is developed.

Among these four components, PageProcessor is different for each page of each site, and is a part that needs to be customized by users.

3.1.2.3 Scheduler

Scheduler is responsible for managing the URLs to be crawled and some deduplication work. By default, WebMagic provides JDK's memory queue to manage URLs, and uses collections for deduplication. Distributed management using Redis is also supported.

Unless the project has some special distributed requirements, there is no need to customize the Scheduler yourself.

3.1.2.4 Pipeline

Pipeline is responsible for the processing of the extracted results, including calculation, persistence to files, databases, etc. By default, WebMagic provides two result processing schemes, "output to console" and "save to file".

Pipeline defines the way to save the results. If you want to save to a specified database, you need to write the corresponding Pipeline. For a class of requirements, generally only one Pipeline needs to be written.

3.1.3 Objects for data flow

3.1.3.1 Request

Request is a layer of encapsulation of URL address, and a Request corresponds to a URL address.

It is the carrier for the interaction between PageProcessor and Downloader, and it is also the only way for PageProcessor to control Downloader.

In addition to the URL itself, it also contains a field extra of the Key-Value structure. You can save some special attributes in extra, and then read them in other places to complete different functions. For example, add some information of the previous page, etc.

3.1.3.2 Page

Page represents a page downloaded from Downloader—it may be HTML, JSON or other text format content.

Page is the core object of WebMagic's extraction process, and it provides some methods for extraction and result preservation. In the examples in Chapter 4, we will introduce its use in detail.

3.1.3.3 ResultItems

ResultItems is equivalent to a Map, which saves the results processed by PageProcessor for use by Pipeline. Its API is very similar to Map. It is worth noting that it has a field skip. If it is set to true, it should not be processed by Pipeline.

3.1.4 The engine that controls the operation of the crawler--Spider

Spider is the core of WebMagic's internal processes. Downloader, PageProcessor, Scheduler, and Pipeline are all attributes of Spider. These attributes can be set freely, and different functions can be realized by setting this attribute. Spider is also the entry point for WebMagic operations, which encapsulates functions such as crawler creation, start, stop, and multi-threading. The following is an example of setting various components, and setting multi-threading and startup. For detailed Spider settings, please see Chapter 4 - Spider Configuration, Startup and Termination.

public static void main(String[] args) {
    Spider.create(new GithubRepoPageProcessor())
            //从https://github.com/code4craft开始抓   
            .addUrl("https://github.com/code4craft")
            //设置Scheduler,使用Redis来管理URL队列
            .setScheduler(new RedisScheduler("localhost"))
            //设置Pipeline,将结果以json方式保存到文件
            .addPipeline(new JsonFilePipeline("D:\\data\\webmagic"))
            //开启5个线程同时执行
            .thread(5)
            //启动爬虫
            .run();
}

 

3.3 General crawler analysis and design

3.2.1 Analysis of general crawler functions

(1) A single application supports data capture of multiple sites at the same time;

(2) Support cluster deployment;

(3) Easy to expand;

(4) Support repeated crawling;

(5) Support timing capture;

(6) Possess the ability to expand big data analysis;

(7) Reduce the complexity of big data analysis integration and improve code reusability;

(8) Support online deployment;

3.2.2 General crawler design

The general crawler design idea is to customize the Scheduler, Processor, and Pipeline on the basis of WebMagic, as shown in the following figure:

 

In the design process, according to the characteristics of crawler development, the crawler implementation process is divided into two links: generating requests and content analysis.

(1) Generate request (UrlParser): according to the request address and parameter characteristics of different sites (such as: Get/Post request method, URL parameter splicing, etc.) and business needs (such as: use domestic or foreign agents), assemble according to site parameters Become a general Request request object and guide Downloader to download web pages.

(2) Content analysis (HtmlParser): According to the characteristics of parsing webpage content, the page content is extracted through XPATH, JSON, etc. Each content parser only analyzes the same page content. When the content contains deep page crawling When a request is made, a new request object is generated by UrlParser and returned to the scheduler.

3.2.3 Task scheduling design

In order to realize the distribution of crawlers, the Scheduler function is weakened, and Clover, Worker, and Redis links are added to the implementation process. Clover is responsible for regularly scheduling Workers to generate default request objects (generally retrieval functions, homepage repetitive crawling tasks, etc.), and adding the generated request objects to the Redis queue. The Scheduler is only responsible for obtaining the request address from the Redis queue.

3.2.4 Processor design

Processor is used to analyze the webpage content downloaded by Downloader. In order to make full use of the server network and computing resources, it is considered to be able to split webpage download and content analysis into different services for processing at the beginning of the design, so as to avoid the excessive CPU time of crawler nodes causing network bandwidth. waste. Therefore, in the design process, the design is carried out in two ways: the internal analysis of the crawler and the integrated analysis of the external platform.

(1) Internal parsing of the crawler: that is, the downloaded content is directly parsed by the Processor to generate an Items object and a deep Request object. In order to simplify the process of parsing the content of multiple sites, Processor is mainly responsible for the organization of data structures and the invocation of HtmlParser, and realizes the integration process of HtmlParser for multiple sites through Spring IOC.

(2) External platform integration: It can realize the docking process of other platforms or services by using MQ and other methods to crawl the content captured by crawlers. In the implementation process, the captured webpage content can be organized into text types, and the data can be sent to JMQ through the Pipline method, and the docking process with other services and platforms can be realized according to the JMQ method. Other services can reuse HtmlParser and UrlParser to complete the content parsing process.

3.2.5 Pipeline design

Pipline is mainly used for data dumping. In order to apply the two solutions of Processor, two implementations of MySQLPipeline and JMQPipeline are designed.

3.4 Universal crawler implementation

3.4.1 Request

The Request class provided by WebMagic can meet the basic requirements of network requests, including URL address, request method, Cookies, Headers and other information. In order to realize general network requests, the existing request object is extended to add whether to filter, filter token, request header type (PC/APP/WAP), proxy IP country classification, number of failed retries, etc. The extension content is as follows:

/**
 * 站点
 */
private String site;
/**
 * 类型
 */
private String type;
 
/**
 * 是否过滤 default:TRUE
 */
private Boolean filter = Boolean.TRUE;
/**
 * 唯一token,URL地址去重使用
 */
private String token;
/**
 * 解析器名称
 */
private String htmlParserName;
 
/**
 * 是否填充Header信息
 */
private Integer headerType = HeaderTypeEnums.NONE.getValue();
 
/**
 * 国别类型,用于区分使用代理类型
 * 默认为国内
 */
private Integer nationalType = NationalityEnums.CN.getValue();
 
/**
 * 页面最大抓取深度,用于限制列表页深度钻取深度,按照访问次数依次递减
 * <p>Default: 1</p>
 * <p> depth = depth - 1</p>
 */
private Integer depth = 1;
 
/**
 * 失败重试次数
 */
private Integer failedRetryTimes;

3.4.2 UrlParser & HtmlParser

3.4.2.1 UrlParser implementation

UrlParser is mainly used to generate fixed request objects according to the parameter list. In order to simplify the development process of Worker, a method for generating initialization requests is added to the interface.

/**
 * URL地址转换
 * @author liwanfeng1
 */
public interface UrlParser {
    /**
     * 获取定时任务初始化请求对象列表
     * @return 请求对象列表
     */
    List<SeparateRequest> getStartRequest();
 
    /**
     * 按照参数生成Request请求对象
     * @param params
     * @return
     */
    SeparateRequest parse(Map<String, Object> params);
}

3.4.2.2 HtmlParser implementation

HtmlParser mainly parses the content of the Downloader, and returns the data list and the Request object for deep crawling. The implementation is as follows:

/**
 * HTML代码转换
 * @author liwanfeng1
 */
public interface HtmlParser {
    /**
     * HTML格式化
     * @param html 抓取的网页内容
     * @param request 网络请求的Request对象
     * @return 数据解析结果
     */
    HtmlDataEntity parse(String html, SeparateRequest request);
}
  
/**
 * @author liwanfeng1
 * @param <T> 数据类型
 */
@Data
@AllArgsConstructor
@NoArgsConstructor
public class HtmlDataEntity<T extends Serializable> {
    private List<T> data;
    private List<SeparateRequest> requests;
 
    /**
     * 添加数据对象
     * @param obj 数据对象
     */
    public void addData(T obj){
        if(data == null) {
            data = new ArrayList<>();
        }
        data.add(obj);
    }
 
    /**
     * 添加Request对象
     * @param request 请求对象
     */
    public void addRequest(SeparateRequest request) {
        if(requests == null) {
            requests = new ArrayList<>();
        }
        requests.add(request);
    }
}

3.4.3 Worker

The role of Worker is to regularly generate request objects, combined with the application of the UrlParser interface, to design a unified WorkerTask implementation class, the code is as follows:

/**
 *
 * @author liwanfeng1
 */
@Slf4j
@Data
public class CommonTask extends AbstractScheduleTaskProcess<SeparateRequest> {
 
    private UrlParser urlParser;
 
    private SpiderQueue spiderQueue;
    /**
     * 获取任务列表
     * @param taskServerParam 参数列表
     * @param i 编号
     * @return 任务列表
     */
    @Override
    protected List<SeparateRequest> selectTasks(TaskServerParam taskServerParam, int i) {
        return urlParser.getStartRequest();
    }
 
    /**
     * 执行任务列表,组织Google API请求地址,添加到YouTube列表爬虫的队列中
     * @param list 任务列表
     */
    @Override
    protected void executeTasks(List<SeparateRequest> list) {
        spiderQueue.push(list);
    }
}
 

Add Worker configuration as follows:


<!-- Facebook start -->
<bean id="facebookTask" class="com.jd.npoms.worker.task.CommonTask">
    <property name="urlParser">
        <bean class="com.jd.npoms.spider.urlparser.FacebookUrlParser"/>
    </property>
    <property name="spiderQueue" ref="jimDbQueue"/>
</bean>
 
<jsf:provider id="facebookTaskProcess"
              interface="com.jd.clover.schedule.IScheduleTaskProcess" ref="facebookTask"
              server="jsf" alias="woker:facebookTask">
</jsf:provider>
<!-- Facebook end -->

3.4.4 Scheduler

Scheduling is mainly used to push and pull the latest tasks, as well as auxiliary push repeat verification, queue length acquisition and other methods. The interface design is as follows:

/**
 * 爬虫任务队列
 * @author liwanfeng
 */
public interface SpiderQueue {
 
    /**
     * 添加队列
     * @param params 爬虫地址列表
     */
    default void push(List<SeparateRequest> params) {
        if (params == null || params.isEmpty()) {
            return;
        }
        params.forEach(this::push);
    }
 
 
    /**
     * 将SeparaterRequest地址添加到Redis队列中
     * @param separateRequest SeparaterRequest地址
     */
    void push(SeparateRequest separateRequest);
 
    /**
     * 弹出队列
     * @return SeparaterRequest对象
     */
    Request poll();
 
 
    /**
     * 检查separateRequest是否重复
     * @param separateRequest 封装的爬虫url地址
     * @return 是否重复
     */
    boolean isDuplicate(SeparateRequest separateRequest);
 
    /**
     * 默认URL地址生成Token<br/>
     * 建议不同的URLParser按照站点地址的特点,生成较短的token
     * 默认采用site、type、url地址进行下划线分割
     * @param separateRequest 封装的爬虫url地址
     */
    default String generalToken(SeparateRequest separateRequest) {
        return separateRequest.getSite() + "_" + separateRequest.getType() + "_" + separateRequest.getUrl();
    }
 
    /**
     * 获取队列总长度
     * @return 队列长度
     */
    Long getQueueLength();
}

4 The browser calls the crawler (Python)

The browser calls the crawler mainly by means of Selenium and ChromeDriver technology, loads and parses the page content through the localized browser call method, and realizes data capture. Browser calls mainly solve the data capture of complex sites. Some sites increase the complexity of code analysis through process splitting, logic encapsulation, code splitting, code obfuscation, etc., combined with request splitting, data encryption, client behavior analysis, etc. The anti-crawling operation is carried out in this way, so that the crawler program cannot simulate the client request process to initiate a request to the server.

This method is mainly used in the tracking number query process of SF Express. The order query uses the Tencent sliding verification code plug-in for man-machine verification. The basic process is shown in the figure below:

 

First configure the ChromeDriver component to the operating system, the component download address:
https://chromedriver.storage.googleapis.com/index.html?path=2.44/ , save the file to any path specified in the system environment variable "PATH" , Suggestion: C:\Windows\system32 directory.

Component addition verification: Enable the command line window -> run "ChromeDriver.exe" from any path, and the program will run as a service.

The crawler implementation process is as follows:

(1) Start the browser: In order to realize the concurrency of crawlers, it is necessary to optimize the settings of the browser through parameters;

start browser

def getChromeDriver(index):
    options = webdriver.ChromeOptions()
    options.add_argument("--incognito")  # 无痕模式
    options.add_argument("--disable-infobars")  # 关闭菜单栏
    options.add_argument("--reset-variation-state")  # 重置校验状态
    options.add_argument("--process-per-tab")  # 独立每个tab页为单独进程
    options.add_argument("--disable-plugins")  # 禁用所有插件
    options.add_argument("headless")  # 隐藏窗口
    proxy = getProxy()
    if proxy is not None:
        options.add_argument("--proxy-server==http://%s:%s" % (proxy["host"], proxy["port"]))  # 添加代理
 
    return webdriver.Chrome(chrome_options=options)
 
 

(2) Loading the page: call the browser to access the specified address page, and wait for the page to be loaded;

driver.get('http://www.sf-express.com/cn/sc/dynamic_function/waybill/#search/bill-number/' + bill_number)
driver.implicitly_wait(20) #最长等待20秒加载时间
 

(3) Switch Frame: The verification code is loaded into the current page in IFrame mode, and then the page elements need to be operated, and the driver needs to be switched to the iframe;

driver.switch_to.frame("tcaptcha_popup")
driver.implicitly_wait(10) #等待切换完成,其中iframe加载可能有延迟

(4) Sliding module: slide the slider in the page to the specified position to realize the verification process;

The sliding verification code operation process is shown in the figure below:

 

 

The execution distance of the sliding module is around 240 pixels. The entire sliding process takes 14 samples. The parabolic execution process is simulated to control the sliding speed. The sliding process is divided into 20 movements (to avoid the same sampling results each time). The analysis is shown in the following figure:

 

 

code show as below:

# 随机生成滑块拖动轨迹
def randomMouseTrace():
    trace = MouseTrace()
    trace.x = [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 9, 9, 8, 8, 7, 7, 6, 6, 5, 5]
    trace.y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    trace.time = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
    return trace
 
 
# 拖动滑块
def dragBar(driver, action):
    dragger = driver.find_element_by_id("tcaptcha_drag_button")
    action.click_and_hold(dragger).perform()  # 鼠标左键按下不放
 
    action.reset_actions()
    trace = randomMouseTrace()
    for index in range(trace.length()):
        action.move_by_offset(trace.x[index], trace.y[index]).perform()  # 移动一个位移
        action.reset_actions()
        time.sleep(trace.time[index])  # 等待停顿时间
 
    action.release().perform()  # 鼠标左键松开
    action.reset_actions()
 
    return driver.find_element_by_id("tcaptcha_note").text == ""

(4) Data analysis and storage: The data analysis process is mainly to locate elements according to id or class to obtain text content, and insert the result into the database to complete the data capture process.

(5) Others: Use Python's threading to make multi-thread calls, save the order number in Redis to realize the distributed task acquisition process, and pop an order number every time it is executed.

To be improved:

(1) The sliding speed and time of the module are fixed and can be randomly optimized;

(2) The release position of the slider is not recognized. Currently, the method of updating the slider is used to retry, and there is a certain error rate;

(3) If you do not switch the proxy IP, you need to optimize the browser startup to reduce the number of startups and increase the crawling speed;

5 gocolly framework (Go)

Multi-threaded concurrent execution is one of the advantages of the Go language. The Go language implements concurrent operations through "coroutines". When I/O blocking occurs in the executed coroutines, the blocking tasks will be managed by a dedicated coroutine, which is dependent on server resources. Fewer, crawling efficiency will also be improved.

(The following content is reproduced from: https://www.jianshu.com/p/23d4ecb8428f )

5.1 Overview

gocolly is a web crawler framework implemented in go. gocolly is fast and elegant, and can initiate more than 1K requests per second on a single core; it provides a set of interfaces in the form of callback functions, which can implement any type of crawler; relying on the goquery library can be like jquery Select the web element.

The official website of gocolly is http://go-colly.org/ , which provides detailed documentation and sample code.

5.2 Installation configuration

Install

go get -u github.com/gocolly/colly/

import package

import "github.com/gocolly/colly"

5.3 Process description

5.3.1 Use process

The use process is mainly to explain the preparation work before using colly to grab data

  • Initialize the Collector object, which is colly's global handle
  • Set the global settings, the global settings are mainly to set the proxy settings of the colly handle, etc.
  • Register the capture callback function, which is mainly used to extract data and start other operations in each process of data processing after capturing data
  • Set auxiliary tools, such as the storage queue for grabbing links, data cleaning queue, etc.
  • Register to grab the link
  • Start the program to start crawling

5.3.2 Capture process

Every node in the data capture process will try to trigger the capture callback function registered by the user to complete the data extraction and other requirements. The capture process is as follows.

  • According to the link, call the registered  OnRequest every time before preparing to grab data to do the preprocessing work before each grab
  • When fetching data fails, OnError will be called for error handling
  • Call OnResponse after the data is captured , and do the processing when the data is just captured
  • Then analyze the captured data and trigger the OnHTML callback for data analysis according to the dom node on the page
  • After the data analysis is completed, the OnScraped function will be called to complete the finishing work after each scraping 

5.4. Auxiliary interface

Colly also provides some auxiliary interfaces to assist in completing the data capture and analysis process. Some of the main supports are listed below.

  • queue is used to store links waiting to be crawled
  • Proxy is used for proxies to initiate grabbing sources
  • thread supports multi-ctrip concurrent processing
  • filter supports filtering of special links
  • depth can be set to grab depth control grab

5.5. Examples

For more examples, please refer to the source link ( https://github.com/gocolly/colly/tree/master/_examples )

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8702776