本文承接《WebCollector 简介与快速入门》

正文提取简介

网页正文提取项目 ContentExtractor 已并入 WebCollector 维护。
WebCollector 的正文抽取 API 都被封装为 ContentExtractor（内容提取）类的静态方法。
ContentExtractor 可以抽取结构化新闻，也可以只抽取网页的正文（或正文所在Element)。
正文抽取效果指标 :

比赛数据集 CleanEval P=93.79% R=86.02% F=86.72%

常见新闻网站数据集 P=97.87% R=94.26% F=95.33%

算法无视语种，适用于各种语种的网页。
标题抽取和日期抽取使用简单启发式算法，并没有像正文抽取算法一样在标准数据集上测试，算法仍在更新中。经过实测标题与时间确实错误率略高一些

正文提取API

# 直接将新闻类网页爬取结果封装为 News，默认包含标题、时间、正文、网址、以及正文元素
News news = ContentExtractor.getNewsByHtml(html, url);
News news = ContentExtractor.getNewsByHtml(html);
News news = ContentExtractor.getNewsByUrl(url);

# 可以将任意类型的网址（不限于新闻类）的正文爬取下来
String content = ContentExtractor.getContentByHtml(html, url);
String content = ContentExtractor.getContentByHtml(html);
String content = ContentExtractor.getContentByUrl(url);

# 第一个中的 News 中也包含了 contentElement 这个属性，这是网页正文元素，可以更加灵活的获取正文中的任意内容
Element contentElement = ContentExtractor.getContentElementByHtml(html, url);
Element contentElement = ContentExtractor.getContentElementByHtml(html);
Element contentElement = ContentExtractor.getContentElementByUrl(url);

ContentExtractor

ContentExtractor. getNewsByUrl( url)

本例爬取凤凰新闻，包括新闻的标题、时间、正文。

package com.lct.webCollector;

import cn.edu.hfut.dmic.contentextractor.ContentExtractor;
import cn.edu.hfut.dmic.contentextractor.News;
import org.jsoup.nodes.Element;

/**
 * Created by Administrator on 2018/8/14 0014.
 */
public class NewsContentCrawler {
    public static void main(String[] args) throws Exception {

        /**凤凰网新闻*/
        String url = "http://news.ifeng.com/a/20180814/59808481_0.shtml";

        /**
         * getNewsByHtml 方法内部在获取 标题、时间等内容时如果错误则内部抛异常
         * 对于新闻内容是 js 动态生成时，即页面右击 查看源码 不能看到爬取的内容时，则 ContentExtractor 方法也无能为力
         *
         * ContentExtractor 内容提取器重载了4个方法 获取 News 对象
         * getNewsByUrl(String url)：输入URL，获取结构化新闻信息-------常用的方式
         * getNewsByDoc(Document doc)：输入Jsoup的Document，获取结构化新闻信息
         * getNewsByHtml(String html)：输入HTML，获取结构化新闻信息
         * getNewsByHtml(String html, String url)：输入HTML和URL，获取结构化新闻信息
         *
         */
        News news = ContentExtractor.getNewsByUrl(url);
        System.out.println("爬取网址：" + news.getUrl());
        System.out.println("发布时间：" + news.getTime());
        System.out.println("文章标题：" + news.getTitle());
        System.out.println("文章内容：" + news.getContent());
    }
}

运行结果

爬取网址：http://news.ifeng.com/a/20180814/59808481_0.shtml
发布时间：2018-08-14 13:17:20 :: 	
文章标题：云南通海地震已造成24人受伤 6.96万人受灾	
文章内容：原标题：云南通海地震已造成24人受伤6.96万人受灾 记者从云南省民政厅了解到，截至8月14日11时30分，地震造成通海县、江川区、华宁县3县6.96万人受灾，24人受伤（其中通海18人，江川6人），紧急转移安置33148人；房屋不同程度受损8000余户。灾情仍在进一步核查中。
    截至目前，云南省民政厅共向灾区调拨帐篷2053顶，棉被6000床，折叠床3600张，床垫3600张，床上用品3000件，彩条布3000件，灾区共搭建帐篷590顶，发放折叠床100张，棉被80床，大米4.0吨，用于保障受灾群众基本生活。14日上午7时，省减灾委办公室、省民政厅决定将四级响应提升至三级，目前，灾区社会秩序稳定，各项抗震救灾工作正有序开展。（央视记者 陈坚）

Process finished with exit code 0

News

News 是新闻的实体类，如下所示4个属性，用于封装新闻内容
其中默认会封装新闻的标题、时间、正文内容、以及当前爬取的网址，如果想要继续抓取正文中的图片、视频或者其它内容，则可以使用其提供的 contentElement 属性
注意：org.jsoup.nodes.Element 类型的 contentElement 是正文内容所在的标签元素，而不是整个页面(html)标签

/*
 * Copyright (C) 2015 hu
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License
 * as published by the Free Software Foundation; either version 2
 * of the License, or (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */
package cn.edu.hfut.dmic.contentextractor;

import org.jsoup.nodes.Element;

/**
 *
 * @author hu
 */
public class News {

    protected String url = null;
    protected String title = null;
    protected String content = null;
    protected String time = null;

    protected Element contentElement = null;

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        if (content == null) {
            if (contentElement != null) {
                content = contentElement.text();
            }
        }
        return content;
    }
    
    

    public void setContent(String content) {
        this.content = content;
    }

    public String getTime() {
        return time;
    }

    public void setTime(String time) {
        this.time = time;
    }

    @Override
    public String toString() {
        return "URL:\n" + url + "\nTITLE:\n" + title + "\nTIME:\n" + time + "\nCONTENT:\n" + getContent() + "\nCONTENT(SOURCE):\n" + contentElement;
    }

    public Element getContentElement() {
        return contentElement;
    }

    public void setContentElement(Element contentElement) {
        this.contentElement = contentElement;
    }

   
}

新闻图片爬取

package com.lct.webCollector;

import cn.edu.hfut.dmic.contentextractor.ContentExtractor;
import cn.edu.hfut.dmic.contentextractor.News;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * Created by Administrator on 2018/8/14 0014.
 */
public class NewsContentCrawler {
    public static void main(String[] args) throws Exception {

        /**凤凰网新闻*/
        String url = "http://news.ifeng.com/a/20180814/59807597_0.shtml";

        /**
         * getNewsByHtml 方法内部在获取 标题、时间等内容时如果错误则内部抛异常
         * 对于新闻内容是 js 动态生成时，即页面右击 查看源码 不能看到爬取的内容时，则 ContentExtractor 方法也无能为力
         *
         * ContentExtractor 内容提取器重载了4个方法 获取 News 对象
         * getNewsByUrl(String url)：输入URL，获取结构化新闻信息-------常用的方式
         * getNewsByDoc(Document doc)：输入Jsoup的Document，获取结构化新闻信息
         * getNewsByHtml(String html)：输入HTML，获取结构化新闻信息
         * getNewsByHtml(String html, String url)：输入HTML和URL，获取结构化新闻信息
         *
         */
        News news = ContentExtractor.getNewsByUrl(url);
        System.out.println("爬取网址：" + news.getUrl());
        System.out.println("发布时间：" + news.getTime());
        System.out.println("文章标题：" + news.getTitle());
        System.out.println("文章内容：" + news.getContent());

        /**
         * news.getContentElement() 返回的是正文所在的标签元素
         */
        Element contentElement = news.getContentElement();
        System.out.println("正文内容标签：" + contentElement.tagName());
        System.out.println("正文内容标签样式：" + contentElement.className());

        /** 根据标签名递归查询正文下的图片标签
         * 同理可以获取正文标签下其它任意想要获取的内容*/
        Elements elements = contentElement.getElementsByTag("img");
        if (elements != null && elements.size() > 0) {
            Element loopElement = null;
            for (int i = 0; i < elements.size(); i++) {
                loopElement = elements.get(i);
                System.out.println("图片地址：" + loopElement.attr("src"));
            }
        }
    }
}

运行结果：

爬取网址：http://news.ifeng.com/a/20180814/59807597_0.shtml
发布时间：2018-08-14 12:14:52 :: 	文章标题：重庆市公安局政治部主任蔡聘被查	
文章内容：重庆市公安局党委委员、政治部主任蔡聘涉嫌严重违纪违法，目前正接受重庆市纪委监委纪律审查和监察调查。 蔡聘简历 蔡聘，男，汉族，1963年9月出生，重庆潼南人，重庆市委党校在职研究生，1986年7月参加工作，1987年10月加入中国共产党。 ........省略300字
正文内容标签：div
正文内容标签样式：js_selection_area
图片地址：http://p0.ifengimg.com/a/2018_33/e5d5dc9179cc46c_size82_w580_h385.jpg
图片地址：http://p2.ifengimg.com/a/2016/0810/204c433878d5cf9size1_w16_h16.png

Process finished with exit code 0

ContentExtractor. getContentByUrl( url)

package com.lct.webCollector;

import cn.edu.hfut.dmic.contentextractor.ContentExtractor;
import cn.edu.hfut.dmic.contentextractor.News;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * Created by Administrator on 2018/8/14 0014.
 */
public class NewsContentCrawler {
    public static void main(String[] args) throws Exception {

        /**
         * 获取 News 要求是新闻类的结构性强的网站
         * 然而获取网页正文，则适用于各种网页
         */
        /*String url = "http://news.ifeng.com/a/20180814/59807593_0.shtml";*/
        String url = "https://baike.baidu.com/item/spring%20cloud/20269825";

        /**
         * 除了获取 News（新闻），也提供了获取网页正文的方式，正文提取
         *
         * ContentExtractor 重载了4个方法 获取网页正文
         * getContentByUrl(String url)：输入URL，获取正文文本 ----常用方式
         * getContentByHtml(String html, String url)：输入HTML和URL，获取正文文本
         * getContentByHtml(String html)：输入HTML，获取正文文本
         * getContentByDoc(Document doc)：输入Jsoup的Document，获取正文文本
         */
        String content = ContentExtractor.getContentByUrl(url);
        System.out.println("网页正文：\n"+content);

    }
}

运行结果：

网页正文：
收藏 查看我的收藏 0 有用+1 已投票 0 spring cloud 编辑 锁定 本词条缺少名片图，补充相关内容使词条更完整，还能快速升级，赶紧来编辑吧！ Spring Cloud是一系列框架的有序集合。
它利用Spring Boot的开发便利性巧妙地简化了分布式系统基础设施的开发，如服务发现注册、配置中心、消息总线、负载均衡、断路器、数据监控等，都可以用Spring Boot的开发风格做到一键启动和部署。.......省略400字

Process finished with exit code 0

ContentExtractor. getContentElementByUrl( url)

与其它方式一样，默认情况下，如果网页右击->查看源码不能看到爬取的元素时，则也是爬取不到的

package com.lct.webCollector;

import cn.edu.hfut.dmic.contentextractor.ContentExtractor;
import cn.edu.hfut.dmic.contentextractor.News;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * Created by Administrator on 2018/8/14 0014.
 */
public class NewsContentCrawler {
    public static void main(String[] args) throws Exception {

        /**
         * News 实体类也封装了正文所在的 Element
         * 为了更加灵活，ContentExtractor 重载了直接获取正文元素的4个方法
         * 拿到了正文元素，则可以自己操作其中的任意元素
         * 暂时 像 http://www.58pic.com/ 千图网这种通过js动态生成的网页，想要爬取它们的图片这样还是不行的
         *
         * getContentElementByUrl(String url)：输入URL，获取正文所在Element----常用方式
         * getContentElementByHtml(String html, String url)：输入HTML和URL，获取正文所在Element
         * getContentElementByHtml(String html)：输入HTML，获取正文所在Element
         * getContentElementByDoc(Document doc)：输入Jsoup的Document，获取正文所在Element
         */
        String url = "http://news.ifeng.com/a/20180814/59801493_0.shtml";
        Element contentElement = ContentExtractor.getContentElementByUrl(url);
        Elements elements = contentElement.getElementsByTag("img");
        for (int i = 0; i < elements.size(); i++) {
            System.out.println("图片地址：" + elements.get(i).attr("src"));
        }
    }
}

运行结果：

图片地址：http://p3.ifengimg.com/a/2018_33/3a2bfaa2d922782_size26_w550_h367.jpg
图片地址：http://p2.ifengimg.com/a/2016/0810/204c433878d5cf9size1_w16_h16.png

Process finished with exit code 0

WebCollector 网页正文快速提取

正文提取简介

正文提取API

ContentExtractor

ContentExtractor. getNewsByUrl( url)

运行结果

News

新闻图片爬取

ContentExtractor. getContentByUrl( url)

ContentExtractor. getContentElementByUrl( url)

猜你喜欢