java crawler WebMagic (a)

Now do most reptiles are using Python, in fact, java can also introduce a lightweight frame Webmagic domestic reptiles here

Official address: http: //webmagic.io/

 

 

 

For personal understanding of reptiles is divided into two kinds, the first one is crawling pages (static data), the second is (data dynamically loaded) crawling Interface

 

For static pages of data, the key to get to the page document structure.

For the data interface, the key is to find an interface link and the corresponding parameters.

 

Webmagic against both come in very simple, easy to understand treatment options.

Three core: PageProcessor, Pipeline, Spider

PageProcessor achieve crawling rules

Pipeline data persistence

Spider started reptiles, specify rules.

 

E.g:

Spider.create(new MyProcessor())
.addPipeline(new MyPipeline())
.addUrl("http://www.xxxx.com").thread(3).run();

Represents a reptile start crawling rules MyProcesser, data processing mode after crawling into MyPipeline, target site http://www.xxxx.com, the number of threads is three, it is so simple.

Attached an official Chart

 

Guess you like

Origin www.cnblogs.com/yhood/p/11597081.html