【HTML Parser】解析HTML:基于第三方库Jsoup

JSoup官方地址:http://jsoup.org

Apache HttpComponents官方地址:http://hc.apache.org/index.html

1、抓取HTML内容

这里我们使用HttpClient库,根据URL请求远端的HTML

public static String getHTMLFromURL(String url) {
    String html = null;
    HttpClient httpClient = new DefaultHttpClient();
    HttpGet httpGet = new HttpGet(url);
    try {
        HttpResponse httpResponse = httpClient.execute(httpGet);
        int resStatu = httpResponse.getStatusLine().getStatusCode();
        if (resStatu == HttpStatus.SC_OK) {
            HttpEntity entity = httpResponse.getEntity();
            if (entity != null) {
                html = EntityUtils.toString(entity);
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        httpClient.getConnectionManager().shutdown();
    }
    return html;
}

2、解析HTML

示例,打印百度的标题

> 解析,获得Document对象

Document doc = Jsoup.parse(html);

> 使用 CSS 或 类似 JQuery 的 Selector 选择元素

扫描二维码关注公众号,回复: 6564423 查看本文章

Elements elements = doc.select("title");

> 打印元素的文本内容

System.out.println(ele.text());

String html = WebCrawler.getHTMLFromURL("http://www.baidu.com");
if (html != null) {
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select("title");
    for (Element element : linksElements) {
        System.out.println(element.text());
    }
}

运行结果:

转载于:https://www.cnblogs.com/dyingbleed/archive/2013/03/20/2970841.html

猜你喜欢

转载自blog.csdn.net/weixin_34310785/article/details/93301857