Now do most reptiles are using Python, in fact, java can also introduce a lightweight frame Webmagic domestic reptiles here
Official address: http: //webmagic.io/
For personal understanding of reptiles is divided into two kinds, the first one is crawling pages (static data), the second is (data dynamically loaded) crawling Interface
For static pages of data, the key to get to the page document structure.
For the data interface, the key is to find an interface link and the corresponding parameters.
Webmagic against both come in very simple, easy to understand treatment options.
Three core: PageProcessor, Pipeline, Spider
PageProcessor achieve crawling rules
Pipeline data persistence
Spider started reptiles, specify rules.
E.g:
Spider.create(new MyProcessor())
.addPipeline(new MyPipeline())
.addUrl("http://www.xxxx.com").thread(3).run();
Represents a reptile start crawling rules MyProcesser, data processing mode after crawling into MyPipeline, target site http://www.xxxx.com, the number of threads is three, it is so simple.
Attached an official Chart