Business Analysis
Get www.51.job.com used cars. Only crawling information "Computer Software" and "Internet e-commerce," the two industries.
1. query page for the query to the list in the url
2. Go to the appropriate page for the required data
Storing data
Create a database, create a table to store the corresponding data
Implementation process
Start -> list page -> Get url -> url added task -> End
Scheduler Component
When parsing the page, is likely to resolve the same url address, if not treated, the same url parsing process many times, waste of resources. A need to re-url function.
Scheduler is managed WebMagic components of the URL. It consists of two functions:
1. The URL of the page to crawl queue management;
2. crawled the URL to heavy
- WebMagic built to a few common Scheduler, and then perform the relatively small size of the local reptiles, then the basic need for custom Scheduler:
-
-
DuplicateRemoveScheduler: abstract base class that provide a template method;
-
QueueScheduler: Use memory queue to be saved crawl URL. (Small memory space, could easily lead to memory overflow)
-
FileCacheQueueScheduler: Use File Save Crawl URL, you can close the program and when the next time you start, grab the URL from before to continue to crawl (need to specify the path, and will establish .urls.txt .cusor.txt two files)
-
PriorityScheduler: using a URL with a priority queue to save memory to be captured
-
RedisScheduler: use Redis to save the crawl queue, you can grab multiple machines simultaneously Cooperation (need to install and start the Redis)
-
- weight section to be abstracted into a single interface to: DuplicateRemove. So you can choose a different way to the same weight as the Scheduler to suit different needs. Currently it offers two ways to re-:
-
HashSetDuplicateRemove (default): Use HashSet to go heavy, take up memory is relatively large
-
BloomFilterDuplicateRemove: Use BloomFilter to go heavy, take up memory is relatively small, but it may leak grab page
-
If you use BloomFilter, you must be added dependent on:
-
<!-- WebMagic 对布隆过滤器的支持 -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>18.0</version>
</dependency>
Code
1. Import dependent packages
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.0.2.RELEASE</version> </parent> <groupId>com.xiaojian</groupId> <artifactId>crawler-jobinfo</artifactId> <version>1.0-SNAPSHOT</version> <dependencies> <!--SpringMVC--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <!--SpringData Jpa--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <!--MySQL连接包--> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>8.0.17</version> <-! WebMagic core packages -> </ dependency> <dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> <!-- WebMagic扩展包 --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <!- WebMagic Bloom filter support -> <the groupId> com.google.guava </ the groupId> <artifactId>guava</artifactId> <version>18.0</version> </dependency> <!--工具包--> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> </dependency> </dependencies> </project>
2. application.properties profile
#DB Configuration: spring.datasource.driverClassName=com.mysql.cj.jdbc.Driver spring.datasource.url=jdbc:mysql://localhost:3306/db_crawler?serverTimezone=GMT%2B8&useUnicode=true&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull spring.datasource.username=root spring.datasource.password=243600 #JPA Configuration: spring.jpa.database=mysql spring.jpa.show-sql=true
3. Write Class: pojo, dao, service, bootstrap class
pojo Deliverable
@Entity @Table(name = "t_jobinfo") public class JobInfo { // 主键 @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; // 公司名称 private String companyName; ... set,get.toString.....
dao
public interface JobInfoDao extends JpaRepository<JobInfo,Long> { }
service
public interface JobInfoService { / ** * save job * @param JobInfo * / void Save (the JobInfo JobInfo); / ** * The query condition Job * @param JobInfo * / List <the JobInfo> findJobInfo (the JobInfo JobInfo); }
serviceImpl
@Service public class JobInfoServiceImpl the implements JobInfoService { @Resource Private JobInfoDao jobInfoDao; @Override the @Transactional public void the Save (JobInfo JobInfo) { // query data in accordance with the recruitment and url Published JobInfo param = new new JobInfo (); param.setUrl (JobInfo. getUrl ()); param.setTime (jobInfo.getTime ()); // query List <the JobInfo> List = the this .findJobInfo (param); // determines whether the existing data iF (list.size () == 0 ) { //If the database is empty, showing jobs data does not exist, or has been updated, the data need to add or update the this .jobInfoDao.saveAndFlush (JobInfo); } jobInfoDao.save (JobInfo); } @Override public List <the JobInfo> findJobInfo ( JobInfo the JobInfo) { // set query conditions Example <the JobInfo> = Example Example.of (JobInfo); return jobInfoDao.findAll (Example); } }
Bootstrap class
@SpringBootApplication // using the timing task that requires opening timing tasks required of annotated @EnableScheduling public class the Application { public static void main (String [] args) { SpringApplication.run (the Application. Class , args); } }
JobProcessor.java
@Component public class JobProcessor implements PageProcessor { private String url = "https://search.51job.com/list/030200%252C110200,000000,0000,01%252C32,9,99,java,2,1.html" + "?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary" + "=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; @Override public void process(Page page) { // 解析页面,获取招聘信息详情的url地址 List<Selectable> list = page.getHtml().css("div#resultList div.el").nodes(); // determines whether the set is empty list IF (list.size () == 0 ) { // empty: page showing details // save job information before the this .saveJobInfo (Page); } the else { // not empty: represents the job list page // parse the details page url, into the task queue for (the Selectable S: list) { // Get the url String Link = s.links () toString ();. // the acquired ulr address onto the task queue page.addTargetRequest (Link); } // Get Next url address String nextUrl = page.getHtml () css ( "div.p_in li.bk") nodes () get (1... .) .links () toString () ;// put into the task queue Next url page.addTargetRequest (nexturl); } String HTML = page.getHtml () toString ();. } Private Site Five Site = Site.me () .setTimeOut ( 10 * 1000) // set the timeout .setCharset ( "GBK") // encoding .setRetryTimes (. 3) // retries .setRetrySleepTime (3000) // retry interval ; @Override public Site Five GetSite () { return Site; } / ** * save Job details information *@param Page * / Private void saveJobInfo (Page Page) { the Html HTML = page.getHtml (); the JobInfo JobInfo = new new the JobInfo (); // information encapsulated object // Name jobInfo.setCompanyName (html.css ( "div.cn p.cname a", "text" ) .toString ()); ... according to the needs, fetch corresponding data // save the result together page.putField ( "JobInfo" , JobInfo); } / / the initialDelay, after the task start, and so long execution method // FIXEDDELAY, how often to perform a @Scheduled (the initialDelay = 1000, FIXEDDELAY * 600 = 1000 ) public void Processor () { Spider.create ( new new JobProcessor ()) .addUrl (URL) .setScheduler ( new new QueueScheduler (). setDuplicateRemover ( new new BloomFilterDuplicateRemover (10000 ))) .addPipeline (springDataPipeline) // data to be stored in the database .thread (10 ) .run () ; } }
SpringDataPipeline.java
@Component public class SpringDataPipeline the implements the Pipeline { @Resource Private JobInfoService jobInfoService; @Override public void Process (ResultItems resultItems, the Task Task) { // Get the object encapsulated JobInfo JobInfo JobInfo = resultItems.get ( "JobInfo" ); // Analyzing data is not empty IF (JobInfo =! null ) { the this .jobInfoService.save (JobInfo); } } }
4. Start Application.java
carry out !