Jsoup + HttpUnit crawling Sohu News

How should I say, static pages, but I also wrote an interface to support dynamic, easy to follow crawling other news sites use.

An interface, the interface has an abstract method pullNews for pulling news, there is a default method for acquiring News: 

public interface NewsPuller { 

    void pullNews (); 

    // url: namely News url 
    // useHtmlUnit: whether to use HtmlUnit 
    default the Document getHtmlFromUrl (String url, boolean useHtmlUnit) throws Exception { 
        IF (! useHtmlUnit) { 
            return Jsoup.connect (url) 
                    // analog Firefox 
                    .userAgent ( "Mozilla / 4.0 (compatible ; MSIE 9.0; NT 6.1 Windows; Trident / 5.0 ) ") 
                    .get (); 
        } {the else 
            the WebClient WebClient the WebClient new new = (BrowserVersion.CHROME); 
            . webClient.getOptions () setJavaScriptEnabled (to true); 
            . webClient.getOptions () setCssEnabled (to false); 
            . webClient.getOptions () setActiveXNative (to false);
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.getOptions().setTimeout(10000);
            HtmlPage htmlPage = null;
            try {
                htmlPage = webClient.getPage(url);
                webClient.waitForBackgroundJavaScript(10000);
                String htmlString = htmlPage.asXml();
                return Jsoup.parse(htmlString);
            } finally {
                webClient.close();
            }
        }
    }

}

  After that reptiles;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebConsole.Logger;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.HashSet;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.LoggerFactory;

public class SohuNewsPuller implements NewsPuller  {
    public static void main(String []args) {
    	System.out.println("123");
    	SohuNewsPuller ss=new SohuNewsPuller();
    	ss.pullNews();
    }
    private String url="http://news.sohu.com/";
    public void pullNews() {
        Document html= null;
        try {
            html = getHtmlFromUrl(url, false);
        } catch (Exception e) {
            e.printStackTrace();
            return;
        }
        // 2.jsoup获取新闻<a>标签
        Elements newsATags = html.select("div.focus-news")
                .select("div.list16")
                .select("li")
                .select("a");

        for (Element a : newsATags) {
            String url = a.attr("href");
            System.out.println("内容"+a.text());
            Document newsHtml = null;
            try {
                newsHtml = getHtmlFromUrl(url, false);
                Element newsContent = newsHtml.select("div#article-container")
                        .select("div.main")
                        .select("div.text")
                        .first();
                String title1 = newsContent.select("div.text-title").select("h1").text();
                String content = newsContent.select("article.article").first().text();
                System.out.println("url"+"\n"+title1+"\n"+content);
               
            } catch (Exception e) {
                e.printStackTrace (); 
            } 
        }                 
    } 

}

  result:

Of course, the content has not been cleaned, the cleaning will follow and crawl dynamic Web site about her.

Reference blog: https: //blog.csdn.net/gx304419380/article/details/80619043#commentsedit

Code has been uploaded GitHub: https://github.com/mmmjh/GetSouhuNews

Welcome Tucao! ! ! !

Most of the code is a reference to the people's blog. I just restore the project.

 

Guess you like

Origin www.cnblogs.com/mm20/p/11328941.html