Small example of how to get the contents of the internal use webmagic label

Today had some little trouble in getting a web page specific content,

Source:

 1 package com.ms.test;
 2 
 3 import us.codecraft.webmagic.Page;
 4 import us.codecraft.webmagic.Site;
 5 import us.codecraft.webmagic.Spider;
 6 import us.codecraft.webmagic.processor.PageProcessor;
 7 
 8 public class TestWebmagic implements PageProcessor{
 9 
10     Site site = Site.me();
11     @Override
12     public Site getSite() {
13         // TODO Auto-generated method stub
14         return site;
15     }
16 
17     @Override
18     public void process(Page page) {
19         // TODO Auto-generated method stub
20         page.putField("test", page.getHtml().xpath("//div[@class=p-2]/div[@class=o-border-bottom2]/div[@class=my-2]/strong"));
21     }
22 
23     public static void main(String[] args) {
24         Spider.create(new TestWebmagic())
25         .addUrl("http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20011700001")
26         .run();
27     }
28 }
View Code

I get the results look like this:

 

 But I do not want must be content with label, So I went to Baidu a bit and found no good examples, but found the answer in a review, that is behind the increase a "/ text ()" function.

We all know that in jsoup, it is very easy to get content within the tag, because he has a "text ()" function, so I see an example of when I will know.

The updated Code:

 1 package com.ms.test;
 2 
 3 import us.codecraft.webmagic.Page;
 4 import us.codecraft.webmagic.Site;
 5 import us.codecraft.webmagic.Spider;
 6 import us.codecraft.webmagic.processor.PageProcessor;
 7 
 8 public class TestWebmagic implements PageProcessor{
 9 
10     Site site = Site.me();
11     @Override
12     public Site getSite() {
13         // TODO Auto-generated method stub
14         return site;
15     }
16 
17     @Override
18     public void process(Page page) {
19         // TODO Auto-generated method stub
20         page.putField("test", page.getHtml().xpath("//div[@class=p-2]/div[@class=o-border-bottom2]/div[@class=my-2]/strong/text()"));
21         
22     }
23 
24     public static void main(String[] args) {
25         Spider.create(new TestWebmagic())
26         .addUrl("http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20011700001")
27         .run();
28     }
29 }
View Code

The results are as follows:

 

 Think of it I've also encountered a similar problem, that's it, the former has not changed, and the result is this:

 

The results obtained after the change is this:

 

 Nothing output, this. . .

Or in my own way now.

Guess you like

Origin www.cnblogs.com/msdog/p/12212731.html