JSoup official address: http://jsoup.org
Apache HttpComponents official address: http://hc.apache.org/index.html
1, crawl HTML content
Here we use the HttpClient library, based on the URL request remote HTML
public static String getHTMLFromURL(String url) { String html = null; HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); try { HttpResponse httpResponse = httpClient.execute(httpGet); int resStatu = httpResponse.getStatusLine().getStatusCode(); if (resStatu == HttpStatus.SC_OK) { HttpEntity entity = httpResponse.getEntity(); if (entity != null) { html = EntityUtils.toString(entity); } } } catch (Exception e) { e.printStackTrace (); } finally { httpClient.getConnectionManager().shutdown(); } return html; }
2, parsing HTML
Example, print Baidu title
> Parse, get Document object
Document doc = Jsoup.parse(html);
> Use CSS Selector selection of JQuery or similar elements
Elements elements = doc.select("title");
> Text printing elements
System.out.println(ele.text());
String html = WebCrawler.getHTMLFromURL("http://www.baidu.com"); if (html != null) { Document doc = Jsoup.parse(html); Elements elements = doc.select("title"); for (Element element : linksElements) { System.out.println(element.text()); } }
operation result:
Reproduced in: https: //www.cnblogs.com/dyingbleed/archive/2013/03/20/2970841.html