[] HTML Parser parsing HTML: Based on third-party libraries Jsoup

JSoup official address: http://jsoup.org

Apache HttpComponents official address: http://hc.apache.org/index.html

 

1, crawl HTML content

Here we use the HttpClient library, based on the URL request remote HTML

public static String getHTMLFromURL(String url) {
    String html = null;
    HttpClient httpClient = new DefaultHttpClient();
    HttpGet httpGet = new HttpGet(url);
    try {
        HttpResponse httpResponse = httpClient.execute(httpGet);
        int resStatu = httpResponse.getStatusLine().getStatusCode();
        if (resStatu == HttpStatus.SC_OK) {
            HttpEntity entity = httpResponse.getEntity();
            if (entity != null) {
                html = EntityUtils.toString(entity);
            }
        }
    } catch (Exception e) {
        e.printStackTrace ();
    } finally {
        httpClient.getConnectionManager().shutdown();
    }
    return html;
}

 

2, parsing HTML

Example, print Baidu title

> Parse, get Document object

Document doc = Jsoup.parse(html);

> Use CSS Selector selection of JQuery or similar elements

Elements elements = doc.select("title");

> Text printing elements

System.out.println(ele.text());

String html = WebCrawler.getHTMLFromURL("http://www.baidu.com");
if (html != null) {
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select("title");
    for (Element element : linksElements) {
        System.out.println(element.text());
    }
}

 

operation result:

 

Reproduced in: https: //www.cnblogs.com/dyingbleed/archive/2013/03/20/2970841.html

Guess you like

Origin blog.csdn.net/weixin_34310785/article/details/93301857