How should I say, static pages, but I also wrote an interface to support dynamic, easy to follow crawling other news sites use.
An interface, the interface has an abstract method pullNews for pulling news, there is a default method for acquiring News: public interface NewsPuller { void pullNews (); // url: namely News url // useHtmlUnit: whether to use HtmlUnit default the Document getHtmlFromUrl (String url, boolean useHtmlUnit) throws Exception { IF (! useHtmlUnit) { return Jsoup.connect (url) // analog Firefox .userAgent ( "Mozilla / 4.0 (compatible ; MSIE 9.0; NT 6.1 Windows; Trident / 5.0 ) ") .get (); } {the else the WebClient WebClient the WebClient new new = (BrowserVersion.CHROME); . webClient.getOptions () setJavaScriptEnabled (to true); . webClient.getOptions () setCssEnabled (to false); . webClient.getOptions () setActiveXNative (to false); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setTimeout(10000); HtmlPage htmlPage = null; try { htmlPage = webClient.getPage(url); webClient.waitForBackgroundJavaScript(10000); String htmlString = htmlPage.asXml(); return Jsoup.parse(htmlString); } finally { webClient.close(); } } } }
After that reptiles;
import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebConsole.Logger; import com.gargoylesoftware.htmlunit.html.HtmlPage; import java.io.IOException; import java.net.MalformedURLException; import java.util.Date; import java.util.HashSet; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.slf4j.LoggerFactory; public class SohuNewsPuller implements NewsPuller { public static void main(String []args) { System.out.println("123"); SohuNewsPuller ss=new SohuNewsPuller(); ss.pullNews(); } private String url="http://news.sohu.com/"; public void pullNews() { Document html= null; try { html = getHtmlFromUrl(url, false); } catch (Exception e) { e.printStackTrace(); return; } // 2.jsoup获取新闻<a>标签 Elements newsATags = html.select("div.focus-news") .select("div.list16") .select("li") .select("a"); for (Element a : newsATags) { String url = a.attr("href"); System.out.println("内容"+a.text()); Document newsHtml = null; try { newsHtml = getHtmlFromUrl(url, false); Element newsContent = newsHtml.select("div#article-container") .select("div.main") .select("div.text") .first(); String title1 = newsContent.select("div.text-title").select("h1").text(); String content = newsContent.select("article.article").first().text(); System.out.println("url"+"\n"+title1+"\n"+content); } catch (Exception e) { e.printStackTrace (); } } } }
result:
Of course, the content has not been cleaned, the cleaning will follow and crawl dynamic Web site about her.
Reference blog: https: //blog.csdn.net/gx304419380/article/details/80619043#commentsedit
Code has been uploaded GitHub: https://github.com/mmmjh/GetSouhuNews
Welcome Tucao! ! ! !
Most of the code is a reference to the people's blog. I just restore the project.