2 --- Getting reptile reptiles framework webmagic

Getting reptile 1 --- talk about web crawler

2 --- Getting reptile reptiles framework webmagic

Getting real reptile reptilian 3 ---

2 crawler frame Webmagic

      2.1 Architecture resolve   

        WebMagic is a simple and flexible Java framework reptiles. Based WebMagic, you can quickly develop a highly efficient, easy to maintain crawlers.

        The structure is divided into WebMagic Downloader, PageProcessor, Scheduler, Pipeline four components by Spider to organize each other. These four components corresponds download reptile lifecycle, processing, management, and persistence capabilities. The Spider will organize these components, so that they can interact with each other, the implementation of process-oriented, can be considered Spider is a large container, it is also the core WebMagic logic.

      Four components

  • Downloader    Downloader responsible for downloading from the Internet page, for subsequent processing. WebMagic ApacheHttpClient used as the default download tool.
  • PageProcessor      PageProcessor responsible for parsing the page, extract useful information, as well as discover new links. WebMagic use Jsoup as HTML parsing tool, based on its development of an analytical tool XPath Xsoup. In these four components, PageProcessor not the same for each page of each site, is the need for the user to customize the section.
  • Scheduler        Scheduler manages to be captured URL, and some of them to re-work. WebMagic default JDK memory provides queue management URL, and to go with a set of weights. Redis also supports the use of distributed management.
  • Pipeline      Processing Pipeline responsible for taking the results, including computing, persisted to files, databases, etc. WebMagic provided by default "to the console" and "Save to File" two results treatment program.

      2.2 PageProcessor

            2.2.1 crawl the entire contents of the page

        Below hands with us to build a crawler system, crawling csdn in blog content https://blog.csdn.net/

( 1 ) create a project, the introduction of dependence

<dependency>
     <groupId>us.codecraft</groupId> 
     <artifactId>webmagic‐core</artifactId> 
     <version>0.7.3</version> 
</dependency> 
<dependency> 
     <groupId>us.codecraft</groupId> 
     <artifactId>webmagic‐extension</artifactId> 
     <version>0.7.3</version> 
</dependency>

 ( 2 ) writing class implements crawling web content


/**
 * 爬取类
 */
public class MyProcessor implements PageProcessor {


    public void process(Page page) {
        //打印页面内容
        System.out.println( page.getHtml().toString()  );
    }

    public Site getSite() {
        return Site.me().setSleepTime(100).setRetryTimes(3);
    }

    public static void main(String[] args) {
        Spider.create( new MyProcessor() ).addUrl("https://blog.csdn.net").run();
    }

}

        Spider is a reptile start entrance. Before starting the reptiles, we need to use a PageProcessor create a Spider object and then use the run () start.

       Spider while other components ( Downloader , Scheduler , the Pipeline ) may by set to the setting method.

       Page representatives from Downloader download to a page - may be HTML , it may be JSON content or other text formats. Page is WebMagic core object extraction process, it provides a number of methods for extraction, save the results and so on.

       Site for some configuration information defining the site itself, such as encoding, the HTTP header, timeout, retry strategy and the like, and other agents, can be provided by Site be configured objects.

            2.2.2 crawling specified content (XPath)

       If we want to crawl the content part of the page, you need to specify the xpath . XPath, that is, XML Path Language ( XMLPathLanguage ), which is a method for determining the XML language in a part of the location in the document. XPath uses path expressions to select XML document node or set of nodes. These path expressions and our conventional computer files expressions see a system very similar.

       We specify the xpath to grab part of the content of the page

System.out.println(page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a").toString());

      Css learned distal shoes can be seen, the meaning of the above code: id is nav node under the div node under div node under ul of the next . 5 th li under the node a node, a look at the output:

<a href="/nav/ai">人工智能</a>

            2.2.3 Adding a destination address 

      We can add the target address, from seed pages to crawl more pages

 public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());//将当前页面里的所有链接都添加到目标页面中
        System.out.println(page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a" ).toString() );
    }

     After running found a lot of addresses appear in the console  

            2.2.4 the target address regular match

    How to extract only the blog article that detailed page content, and extracts a title?


/**
 * 爬取类
 */
public class MyProcessor implements PageProcessor {


    public void process(Page page) {

        //page.addTargetRequests( page.getHtml().links().all() );//将当前页面里的所有链接都添加到目标页面中
         page.addTargetRequests(  page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
         System.out.println(page.getHtml().xpath("//* [@id=\"mainBox\"]/main/div[1]/div[1]/h1/text()").toString());
    }

    public Site getSite() {
        return Site.me().setSleepTime(100).setRetryTimes(3);
    }

    public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .run();
    }

}

      2.3 Pipeline  

            2.3.1 ConsolePipeline console output

/**
 * 爬取类
 */
public class MyProcessor implements PageProcessor {


    public void process(Page page) {
        //打印页面内容
        //System.out.println( page.getHtml().toString()  );
        //page.addTargetRequests( page.getHtml().links().all()  );

        page.addTargetRequests(  page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
        //System.out.println( page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a" ).toString() );
        //System.out.println(page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1"));
        page.putField("title" ,page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1").toString() );
    }

    public Site getSite() {
        return Site.me().setSleepTime(100).setRetryTimes(3);
    }

    public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new ConsolePipeline())
                .run();
    }

}

            2.3.2 FilePipeline file is saved  

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new ConsolePipeline())
                .addPipeline(new FilePipeline("e:/data"))
                .run();
    }

            2.3.3 JsonFilePipeline

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new ConsolePipeline())
                .addPipeline(new FilePipeline("e:/data"))
                .addPipeline(new JsonFilePipeline("e:/json"))
                .run();
    }

            2.3.4 custom Pipeline

       If the above Pipeline can not meet your needs, you can customize Pipeline

     ( 1 ) create a class MyPipeline implement the interface Pipeline


public class MyPipeline implements Pipeline {
    public void process(ResultItems resultItems, Task task) {
        String title = resultItems.get("title");
        System.out.println("定制  title:"+title);

    }
}

    ( 2 ) modify the main method

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")
                .addPipeline(new ConsolePipeline())
                .addPipeline(new FilePipeline("e:/data"))
                .addPipeline(new JsonFilePipeline("e:/json"))
                .addPipeline(new MyPipeline())
                .run();
    }

        2.4 Scheduler  

       We have just completed the function of each run may be crawling duplicate pages, this is no meaning. Scheduler (URL management ) the most basic function is to achieve already crawling URL be marked. It can be achieved URL incremental deduplication.

       Currently scheduler There are three main ways:

  1. Memory queue QueueScheduler
  2. File queue FileCacheQueueScheduler
  3.  Redis queue RedisScheduler

            2.4.1 memory queue

       Use setScheduler to set the Scheduler

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net/")            
                .setScheduler(new QueueScheduler())
                .run();
    }

            2.4.2 file queue

      Use File Save Crawl URL , you can close the program and at the time of the next start, from before to grab the URL to continue crawling

   ( 1 ) Create a folder D: \ scheduler

   ( 2 ) Modify the code

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net")
                //.setScheduler(new QueueScheduler())//设置内存队列 
                .setScheduler(new FileCacheQueueScheduler("D:\\scheduler"))//设置文件队列
                .run();
    }

     After running the folder D: \ scheduler will produce two files blog.csdn.net.urls.txt and blog.csdn.net.cursor.txt

            2.4.3 Redis queue

     Use Redis to save the crawl queue, can be multiple machines while working crawl, which is in the so-called distributed crawling

  ( 1 ) running redis server

  ( 2 ) Modify the code

public static void main(String[] args) {
        Spider.create( new MyProcessor() )
                .addUrl("https://blog.csdn.net")
                //.setScheduler(new QueueScheduler())//设置内存队列 
                //.setScheduler(new FileCacheQueueScheduler("D:\\scheduler"))//设置文件队列
                .setScheduler(new RedisScheduler("127.0.0.1"))//设置Redis队列
                .run();
    }

 

 

Published 41 original articles · won praise 47 · views 30000 +

Guess you like

Origin blog.csdn.net/u014526891/article/details/102690606