Today had some little trouble in getting a web page specific content,
Source:
1 package com.ms.test; 2 3 import us.codecraft.webmagic.Page; 4 import us.codecraft.webmagic.Site; 5 import us.codecraft.webmagic.Spider; 6 import us.codecraft.webmagic.processor.PageProcessor; 7 8 public class TestWebmagic implements PageProcessor{ 9 10 Site site = Site.me(); 11 @Override 12 public Site getSite() { 13 // TODO Auto-generated method stub 14 return site; 15 } 16 17 @Override 18 public void process(Page page) { 19 // TODO Auto-generated method stub 20 page.putField("test", page.getHtml().xpath("//div[@class=p-2]/div[@class=o-border-bottom2]/div[@class=my-2]/strong")); 21 } 22 23 public static void main(String[] args) { 24 Spider.create(new TestWebmagic()) 25 .addUrl("http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20011700001") 26 .run(); 27 } 28 }
I get the results look like this:
But I do not want must be content with label, So I went to Baidu a bit and found no good examples, but found the answer in a review, that is behind the increase a "/ text ()" function.
We all know that in jsoup, it is very easy to get content within the tag, because he has a "text ()" function, so I see an example of when I will know.
The updated Code:
1 package com.ms.test; 2 3 import us.codecraft.webmagic.Page; 4 import us.codecraft.webmagic.Site; 5 import us.codecraft.webmagic.Spider; 6 import us.codecraft.webmagic.processor.PageProcessor; 7 8 public class TestWebmagic implements PageProcessor{ 9 10 Site site = Site.me(); 11 @Override 12 public Site getSite() { 13 // TODO Auto-generated method stub 14 return site; 15 } 16 17 @Override 18 public void process(Page page) { 19 // TODO Auto-generated method stub 20 page.putField("test", page.getHtml().xpath("//div[@class=p-2]/div[@class=o-border-bottom2]/div[@class=my-2]/strong/text()")); 21 22 } 23 24 public static void main(String[] args) { 25 Spider.create(new TestWebmagic()) 26 .addUrl("http://www.beijing.gov.cn/hudong/hdjl/com.web.consult.consultDetail.flow?originalId=AH20011700001") 27 .run(); 28 } 29 }
The results are as follows:
Think of it I've also encountered a similar problem, that's it, the former has not changed, and the result is this:
The results obtained after the change is this:
Nothing output, this. . .
Or in my own way now.