Basic use of webmagic, a java-based crawler framework

     Simply record the basic use of the java project to achieve web crawling data.

     Dependencies that need to be imported

<dependency>
      <groupId>us.codecraft</groupId>
      <artifactId>webmagic-core</artifactId>
      <version>0.7.3</version>
    </dependency>
    <dependency>
      <groupId>us.codecraft</groupId>
      <artifactId>webmagic-extension</artifactId>
      <version>0.7.3</version>
    </dependency>

     If you fail to import the project, you can go to GitHub to download the latest source code of the master branch, import the project into the idea and find the parent project webmagic-parentto proceed install, and then put the jar in the target:

webmagic-core-0.8.0.jar、
webmagic-extension-0.8.0.jar

You can package it to the local warehouse. The maven package to the local method can refer to the following:

mvn install:install-file -Dfile="D:\storage\maven_repository\com\us\codecraft\webmagic-core-0.8.0.jar" -DgroupId=us.codecraft -DartifactId=webmagic-core -Dversion=0.8.0 -Dpackaging=jar
	mvn install:install-file -Dfile="D:\storage\maven_repository\com\us\codecraft\webmagic-extension-0.8.0.jar" -DgroupId=us.codecraft -DartifactId=webmagic-extension -Dversion=0.8.0 -Dpackaging=jar

     jar storage directory: D:\storage\maven_repository\us\codecraft\0.8.0, and then execute the above command from this directory.
D:\storage\maven_repository\com\us\codecraftIt is the jar installation directory.

     WebMagic instructions: http://webmagic.io/docs/zh/posts/ch1-overview/

     Common methods of obtaining tag content and attributes refer to the following:

html

<a href='www.some.com'><span>hello </span>world</a>

#获取a标签下的文本

xpath("//a/text()") # world

#获取a标签以及子标签中的内容

xpath("//a//text()") # hello world

#获取a标签中的连接

xpath("//a/@href") # www.some.com

即获取标签属性值 (位置/@属性)

     More xpath instructions: https://www.runoob.com/xpath/xpath-examples.html

The following is a demo to get the nickname on the webpage:

public class MyPageProcessor implements PageProcessor {
    
    

    private Site site = Site.me().setRetryTimes(3).setSleepTime(100);

	// 存储获取的昵称
    private static List<String> nickNameList=new ArrayList<>();

    @Override
    public void process(Page page) {
    
    
        //对抓取到的页面进行处理
        page.putField("value",page.getHtml().xpath("//p//text()").all());
        int a=0;
        // 组装数据
        nickNameList.addAll(Collection.class.cast(page.getResultItems().getAll().get("value")));

    }

    @Override
    public Site getSite() {
    
    
        return site;
    }

    public static void main(String[] args) {
    
    
        //创建爬虫
        Spider.create(new MyPageProcessor())            //将创建好的PageProcessor页面处理器交给Spider
                .addUrl("https://www.qiwangming.com/wm/haoting/4583.html")   //输入url
                //.addPipeline(new FilePipeline())          //设置结果 保存到文件
//                .addPipeline(new ConsolePipeline())         //设置结果 控制台输出
                //.addPipeline(new JsonFilePipeline())        //设置结果 以Json格式输出
                //当没有设置结果输出时 默认控制台输出
//                .thread(5)                  //设置5个线程同时执行
                .run();                   //启动爬虫
        System.out.println("输出内容:"+MyPageProcessor.nickNameList);
    }
}

Output content:

输出内容:[ 漫步云中月,  关于你,  触摸的星光,  踏雪无痕,  但愿,  藏在云里的喜欢,  美梦收藏家,  趁月色还在,  倾听寂寞,  独往归途,  扬花落满肩,  人间烟火,  ,  微云淡月,  山月记,  追逐我的明天。,  染指流年,  且听且行,  簡單陪絆,  你如温阳,  梦里七彩虹,  闻风丧破胆,  初雪未霁,  偏于谁,  暖光的惆怅,  你眼里的雾,  恰上心头,  长得帅会喊麦,  初夏的雨,  望断归来路,  终于说出口,  故事讲完,  云淡风轻,  怀抱清风,  落梅香带雪,  泪染裳,  佯装执着,  深爱不腻,  月亮魔法,  笑眼迷人,  顾北清歌寒,  难能心动,  世俗眼光,  满是欢喜,  月亮遮住脸,  莫洛曾过往,  难得一生,  往复随安,  笑弄清风,  枯守一座城,  南风向北,  草莓仙,  一池喜欢,  起舞弄影,  寒橘,  沧桑为饮,  雨下的芭蕉,  绝世的容颜,  从心动到古稀,  时光旅行者,  风中的歌声,  凡尘一梦,  繁星画作泥尘,  眉黛浅,  旧城的伤,  掌握梦想,  云深不知处,  证明给你看,  笔尖微凉,  一纸水与青,  望一片星辰,  北巷长歌悠,  饮惯烈酒,  泪水中成长,  遙遙無歸期,  灵感集市,  等风醒来,  山水几相逢,  如初不遇,  心动甜甜圈,  山后别相逢,  北葵向暖,  与我何干,  你不好看,  笑中带伤,  清风徐来,  光辉时刻,  稳做枕边人,  你是柔风,  随风远走,  煙雨霓裳,  孤魂伴野鬼,  澄澈的眼,  绿杨堤黄鸟,  迷上书甜,  恬淡春风,  许你春夏,  余情已逝,  人间惊鸿宴,  碎了星光一地,  椰果味的牛奶,  湛蓝星空,  躺在你的梨涡里,  一曲墨白,  光辉终结,  追尾的猫,   |  |  ,  copyright © 2018-2020   ]

     If you are not sure about the position of the label on the page, you can check the source code of the webpage for formatting, and format the webpage code online: http://www.wetools.com/html-formatter

Guess you like

Origin blog.csdn.net/weixin_43401380/article/details/129122759