Case Development || Scheduler component analysis

Ready to use WebMagic realize the function crawling data. Here is a more complete implementation.

Here we achieve is focused web crawler, crawling only data related recruitment.



Business Analysis

To achieve today is crawling https://www.51job.com/ used cars. Only crawling information "Computer Software" and "Internet e-commerce," the two industries

First access the page and search for the two industries. The results are as follows

Click the Job Details page, our analysis found that there are some details page of data to be fetched:

Title, company name, place of work, salary, Published, jobs, Company Information, Company Info





Database Table

CREATE TABLE `job_info` (
  `id` BIGINT(20) NOT NULL AUTO_INCREMENT COMMENT '主键id',
  `company_name` VARCHAR(100) DEFAULT NULL COMMENT '公司名称',
  `company_addr` VARCHAR(200) DEFAULT NULL COMMENT '公司联系方式',
  `company_info` TEXT COMMENT '公司信息',
  `job_name` VARCHAR(100) DEFAULT NULL COMMENT '职位名称',
  `job_addr` VARCHAR(50) DEFAULT NULL COMMENT '工作地点',
  `job_info` TEXT COMMENT '职位信息',
  `salary_min` INT(10) DEFAULT NULL COMMENT '薪资范围,最小',
  `salary_max` INT(10) DEFAULT NULL COMMENT '薪资范围,最大',
  `url` VARCHAR(150) DEFAULT NULL COMMENT '招聘信息详情页',
  `time` VARCHAR(10) DEFAULT NULL COMMENT '职位最近发布时间',
  PRIMARY KEY (`id`)
) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='招聘信息';


Implementation process

Need to parse the job listings page, get job details page, and then parse the page to get data.

Get url address of the process is as follows

But here there is a problem: when parsing the page, is likely to parse out the same url address (such as product title and product images hyperlinks, and the same url), if not treated, the same process many times will parse url ,a waste of resource. So we need to have a heavy url function



Scheduler Component

WebMagic provides Scheduler can help us solve the above problem.

Scheduler is managed WebMagic components of the URL. In general, Scheduler includes two effects:

  • Treat crawl URL queue management.

  • The URL to be crawled deduplication.

WebMagic built several popular Scheduler. If you're just in the local implementation of the relatively small size of the reptile, then the basic need for custom Scheduler, but look at a few Scheduler has been provided or meaningful.

Deduplication alone has become a part of the abstract interfaces: DuplicateRemover , so can choose to go a different way with a heavy Scheduler, to suit different needs, currently offers two ways to go heavy.

RedisScheduler is set Redis use of de-heavy, the other default Scheduler uses HashSetDuplicateRemover to go heavy.

If you are using BloomFilter, must be added the following dependencies:

 <!--WebMagic对布隆过滤器的支持-->
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>16.0</version>
        </dependency>

 

Published 434 original articles · won praise 105 · views 70000 +

Guess you like

Origin blog.csdn.net/qq_39368007/article/details/105047966