Web crawlers - case realization

Business Analysis

    Get  www.51.job.com  used cars. Only crawling information "Computer Software" and "Internet e-commerce," the two industries.

    1. query page for the query to the list in the url

    image.png

 

    2. Go to the appropriate page for the required data

        image.png

Storing data

Create a database, create a table to store the corresponding data

        image.png

Implementation process

    Start -> list page -> Get url -> url added task -> End

    

Scheduler Component

    When parsing the page, is likely to resolve the same url address, if not treated, the same url parsing process many times, waste of resources. A need to re-url function.

    Scheduler is managed WebMagic components of the URL. It consists of two functions:

        1. The URL of the page to crawl queue management;

        2. crawled the URL to heavy

    

    - WebMagic built to a few common Scheduler, and then perform the relatively small size of the local reptiles, then the basic need for custom Scheduler:

    • DuplicateRemoveScheduler: abstract base class that provide a template method;

    • QueueScheduler: Use memory queue to be saved crawl URL. (Small memory space, could easily lead to memory overflow)

    • FileCacheQueueScheduler: Use File Save Crawl URL, you can close the program and when the next time you start, grab the URL from before to continue to crawl (need to specify the path, and will establish .urls.txt .cusor.txt two files)

    • PriorityScheduler: using a URL with a priority queue to save memory to be captured

    • RedisScheduler: use Redis to save the crawl queue, you can grab multiple machines simultaneously Cooperation (need to install and start the Redis)

 

    - weight section to be abstracted into a single interface to: DuplicateRemove. So you can choose a different way to the same weight as the Scheduler to suit different needs. Currently it offers two ways to re-:

    • HashSetDuplicateRemove (default): Use HashSet to go heavy, take up memory is relatively large

    • BloomFilterDuplicateRemove: Use BloomFilter to go heavy, take up memory is relatively small, but it may leak grab page

      • If you use BloomFilter, you must be added dependent on:

      • <!-- WebMagic 对布隆过滤器的支持 -->
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>18.0</version>
        </dependency>

Code        

    1. Import dependent packages

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.2.RELEASE</version>
    </parent>

    <groupId>com.xiaojian</groupId>
    <artifactId>crawler-jobinfo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!--SpringMVC-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <!--SpringData Jpa-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--MySQL连接包-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.17</version>
        <-! WebMagic core packages ->
        </ dependency>

        <dependency>
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- WebMagic扩展包 -->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>

        <!- WebMagic Bloom filter support -> 
            <the groupId> com.google.guava </ the groupId>
            <artifactId>guava</artifactId>
            <version>18.0</version>
        </dependency>

        <!--工具包-->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
        </dependency>

    </dependencies>
</project>

  2. application.properties profile

#DB Configuration:
spring.datasource.driverClassName=com.mysql.cj.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/db_crawler?serverTimezone=GMT%2B8&useUnicode=true&characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull
spring.datasource.username=root
spring.datasource.password=243600

#JPA Configuration:
spring.jpa.database=mysql
spring.jpa.show-sql=true

 

    3. Write Class: pojo, dao, service, bootstrap class

pojo Deliverable

@Entity
@Table(name = "t_jobinfo")
public class JobInfo {
    // 主键
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    // 公司名称
    private String companyName;
    ...
    set,get.toString.....

 

dao

public interface JobInfoDao extends JpaRepository<JobInfo,Long> {
}

 

service

public  interface JobInfoService {
     / ** 
     * save job 
     * @param JobInfo
      * / 
    void Save (the JobInfo JobInfo); 

    / ** 
     * The query condition Job 
     * @param JobInfo
      * / 
    List <the JobInfo> findJobInfo (the JobInfo JobInfo); 
}

 

serviceImpl

@Service
 public  class JobInfoServiceImpl the implements JobInfoService { 
    @Resource 
    Private JobInfoDao jobInfoDao; 
    @Override 
    the @Transactional 
    public  void the Save (JobInfo JobInfo) {
         // query data in accordance with the recruitment and url Published 
        JobInfo param = new new JobInfo (); 
        param.setUrl (JobInfo. getUrl ()); 
        param.setTime (jobInfo.getTime ()); 
        // query 
        List <the JobInfo> List = the this .findJobInfo (param);
         // determines whether the existing data 
        iF (list.size () == 0 ) {
             //If the database is empty, showing jobs data does not exist, or has been updated, the data need to add or update 
            the this .jobInfoDao.saveAndFlush (JobInfo); 
        } 

        jobInfoDao.save (JobInfo); 
    } 

    @Override 
    public List <the JobInfo> findJobInfo ( JobInfo the JobInfo) {
         // set query conditions 
        Example <the JobInfo> = Example Example.of (JobInfo); 

        return jobInfoDao.findAll (Example); 
    } 
}

 

Bootstrap class

@SpringBootApplication
 // using the timing task that requires opening timing tasks required of annotated 
@EnableScheduling
 public  class the Application { 

    public  static  void main (String [] args) { 
        SpringApplication.run (the Application. Class , args); 
    } 

}

 

JobProcessor.java

@Component
public class JobProcessor implements PageProcessor {
    private String url = "https://search.51job.com/list/030200%252C110200,000000,0000,01%252C32,9,99,java,2,1.html" +
            "?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary" +
            "=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
    @Override
    public void process(Page page) {
        // 解析页面,获取招聘信息详情的url地址
     List<Selectable> list = page.getHtml().css("div#resultList div.el").nodes();
     // determines whether the set is empty list 
     IF (list.size () == 0 ) {
             // empty: page showing details
      // save job information before 
     the this .saveJobInfo (Page); 
     } the else {
             // not empty: represents the job list page
      // parse the details page url, into the task queue 
     for (the Selectable S: list) {
                 // Get the url 
     String Link = s.links () toString ();.
      // the acquired ulr address onto the task queue 
     page.addTargetRequest (Link); 
     } 
            // Get Next url address 
     String nextUrl = page.getHtml () css ( "div.p_in li.bk") nodes () get (1... .) .links () toString ()
      ;// put into the task queue Next url 
     page.addTargetRequest (nexturl); 
     } 
    
        String HTML = page.getHtml () toString ();. 
    } 

    Private Site Five Site = Site.me () 
        .setTimeOut ( 10 * 1000) // set the timeout 
        .setCharset ( "GBK")     // encoding 
        .setRetryTimes (. 3)     // retries 
        .setRetrySleepTime (3000)     // retry interval 
        ; 
    @Override 
    public Site Five GetSite () {
         return Site; 
    } 
     / ** 
     * save Job details information 
     *@param Page
      * / 
    Private  void saveJobInfo (Page Page) { 
        the Html HTML = page.getHtml (); 
    
        the JobInfo JobInfo = new new the JobInfo ();
         // information encapsulated object
         // Name 
        jobInfo.setCompanyName (html.css ( "div.cn p.cname a", "text" ) .toString ()); 
        ... according to the needs, fetch corresponding data 
    
        // save the result together 
        page.putField ( "JobInfo" , JobInfo); 
    } 
    
    / / the initialDelay, after the task start, and so long execution method
     // FIXEDDELAY, how often to perform a 
    @Scheduled (the initialDelay = 1000, FIXEDDELAY * 600 = 1000 )
    public  void Processor () { 
        Spider.create ( new new JobProcessor ()) 
                .addUrl (URL) 
                .setScheduler ( new new QueueScheduler (). setDuplicateRemover ( new new BloomFilterDuplicateRemover (10000 ))) 
                .addPipeline (springDataPipeline)     // data to be stored in the database 
                .thread (10 ) 
                .run () 
        ; 
    } 
}

 

SpringDataPipeline.java

@Component
 public  class SpringDataPipeline the implements the Pipeline { 

    @Resource 
    Private JobInfoService jobInfoService; 

    @Override 
    public  void Process (ResultItems resultItems, the Task Task) {
         // Get the object encapsulated JobInfo 
        JobInfo JobInfo = resultItems.get ( "JobInfo" );
         // Analyzing data is not empty 
        IF (JobInfo =! null ) {
             the this .jobInfoService.save (JobInfo); 
        } 
    } 
}

 

4. Start Application.java

    image.png

 

carry out !

Guess you like

Origin www.cnblogs.com/jr-xiaojian/p/12310480.html