Getting reptile 1 --- talk about web crawler
2 --- Getting reptile reptiles framework webmagic
Getting real reptile reptilian 3 ---
2 crawler frame Webmagic
2.1 Architecture resolve
WebMagic is a simple and flexible Java framework reptiles. Based WebMagic, you can quickly develop a highly efficient, easy to maintain crawlers.
The structure is divided into WebMagic Downloader, PageProcessor, Scheduler, Pipeline four components by Spider to organize each other. These four components corresponds download reptile lifecycle, processing, management, and persistence capabilities. The Spider will organize these components, so that they can interact with each other, the implementation of process-oriented, can be considered Spider is a large container, it is also the core WebMagic logic.
Four components
- Downloader Downloader responsible for downloading from the Internet page, for subsequent processing. WebMagic ApacheHttpClient used as the default download tool.
- PageProcessor PageProcessor responsible for parsing the page, extract useful information, as well as discover new links. WebMagic use Jsoup as HTML parsing tool, based on its development of an analytical tool XPath Xsoup. In these four components, PageProcessor not the same for each page of each site, is the need for the user to customize the section.
- Scheduler Scheduler manages to be captured URL, and some of them to re-work. WebMagic default JDK memory provides queue management URL, and to go with a set of weights. Redis also supports the use of distributed management.
- Pipeline Processing Pipeline responsible for taking the results, including computing, persisted to files, databases, etc. WebMagic provided by default "to the console" and "Save to File" two results treatment program.
2.2 PageProcessor
2.2.1 crawl the entire contents of the page
Below hands with us to build a crawler system, crawling csdn in blog content https://blog.csdn.net/
( 1 ) create a project, the introduction of dependence
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic‐core</artifactId>
<version>0.7.3</version>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic‐extension</artifactId>
<version>0.7.3</version>
</dependency>
( 2 ) writing class implements crawling web content
/**
* 爬取类
*/
public class MyProcessor implements PageProcessor {
public void process(Page page) {
//打印页面内容
System.out.println( page.getHtml().toString() );
}
public Site getSite() {
return Site.me().setSleepTime(100).setRetryTimes(3);
}
public static void main(String[] args) {
Spider.create( new MyProcessor() ).addUrl("https://blog.csdn.net").run();
}
}
Spider is a reptile start entrance. Before starting the reptiles, we need to use a PageProcessor create a Spider object and then use the run () start.
Spider while other components ( Downloader , Scheduler , the Pipeline ) may by set to the setting method.
Page representatives from Downloader download to a page - may be HTML , it may be JSON content or other text formats. Page is WebMagic core object extraction process, it provides a number of methods for extraction, save the results and so on.
Site for some configuration information defining the site itself, such as encoding, the HTTP header, timeout, retry strategy and the like, and other agents, can be provided by Site be configured objects.
2.2.2 crawling specified content (XPath)
If we want to crawl the content part of the page, you need to specify the xpath . XPath, that is, XML Path Language ( XMLPathLanguage ), which is a method for determining the XML language in a part of the location in the document. XPath uses path expressions to select XML document node or set of nodes. These path expressions and our conventional computer files expressions see a system very similar.
We specify the xpath to grab part of the content of the page
System.out.println(page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a").toString());
Css learned distal shoes can be seen, the meaning of the above code: id is nav node under the div node under div node under ul of the next . 5 th li under the node a node, a look at the output:
<a href="/nav/ai">人工智能</a>
2.2.3 Adding a destination address
We can add the target address, from seed pages to crawl more pages
public void process(Page page) {
page.addTargetRequests(page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());//将当前页面里的所有链接都添加到目标页面中
System.out.println(page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a" ).toString() );
}
After running found a lot of addresses appear in the console
2.2.4 the target address regular match
How to extract only the blog article that detailed page content, and extracts a title?
/**
* 爬取类
*/
public class MyProcessor implements PageProcessor {
public void process(Page page) {
//page.addTargetRequests( page.getHtml().links().all() );//将当前页面里的所有链接都添加到目标页面中
page.addTargetRequests( page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
System.out.println(page.getHtml().xpath("//* [@id=\"mainBox\"]/main/div[1]/div[1]/h1/text()").toString());
}
public Site getSite() {
return Site.me().setSleepTime(100).setRetryTimes(3);
}
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.run();
}
}
2.3 Pipeline
2.3.1 ConsolePipeline console output
/**
* 爬取类
*/
public class MyProcessor implements PageProcessor {
public void process(Page page) {
//打印页面内容
//System.out.println( page.getHtml().toString() );
//page.addTargetRequests( page.getHtml().links().all() );
page.addTargetRequests( page.getHtml().links().regex("https://blog.csdn.net/[a-z 0-9 -]+/article/details/[0-9]{8}").all());
//System.out.println( page.getHtml().xpath("//*[@id=\"nav\"]/div/div/ul/li[5]/a" ).toString() );
//System.out.println(page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1"));
page.putField("title" ,page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1").toString() );
}
public Site getSite() {
return Site.me().setSleepTime(100).setRetryTimes(3);
}
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.addPipeline(new ConsolePipeline())
.run();
}
}
2.3.2 FilePipeline file is saved
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.addPipeline(new ConsolePipeline())
.addPipeline(new FilePipeline("e:/data"))
.run();
}
2.3.3 JsonFilePipeline
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.addPipeline(new ConsolePipeline())
.addPipeline(new FilePipeline("e:/data"))
.addPipeline(new JsonFilePipeline("e:/json"))
.run();
}
2.3.4 custom Pipeline
If the above Pipeline can not meet your needs, you can customize Pipeline
( 1 ) create a class MyPipeline implement the interface Pipeline
public class MyPipeline implements Pipeline {
public void process(ResultItems resultItems, Task task) {
String title = resultItems.get("title");
System.out.println("定制 title:"+title);
}
}
( 2 ) modify the main method
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.addPipeline(new ConsolePipeline())
.addPipeline(new FilePipeline("e:/data"))
.addPipeline(new JsonFilePipeline("e:/json"))
.addPipeline(new MyPipeline())
.run();
}
2.4 Scheduler
We have just completed the function of each run may be crawling duplicate pages, this is no meaning. Scheduler (URL management ) the most basic function is to achieve already crawling URL be marked. It can be achieved URL incremental deduplication.
Currently scheduler There are three main ways:
- Memory queue QueueScheduler
- File queue FileCacheQueueScheduler
- Redis queue RedisScheduler
2.4.1 memory queue
Use setScheduler to set the Scheduler
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net/")
.setScheduler(new QueueScheduler())
.run();
}
2.4.2 file queue
Use File Save Crawl URL , you can close the program and at the time of the next start, from before to grab the URL to continue crawling
( 1 ) Create a folder D: \ scheduler
( 2 ) Modify the code
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net")
//.setScheduler(new QueueScheduler())//设置内存队列
.setScheduler(new FileCacheQueueScheduler("D:\\scheduler"))//设置文件队列
.run();
}
After running the folder D: \ scheduler will produce two files blog.csdn.net.urls.txt and blog.csdn.net.cursor.txt
2.4.3 Redis queue
Use Redis to save the crawl queue, can be multiple machines while working crawl, which is in the so-called distributed crawling
( 1 ) running redis server
( 2 ) Modify the code
public static void main(String[] args) {
Spider.create( new MyProcessor() )
.addUrl("https://blog.csdn.net")
//.setScheduler(new QueueScheduler())//设置内存队列
//.setScheduler(new FileCacheQueueScheduler("D:\\scheduler"))//设置文件队列
.setScheduler(new RedisScheduler("127.0.0.1"))//设置Redis队列
.run();
}