HtmlUnit+Jsoup solves the problem that crawlers cannot parse and execute javascript

I've been working on crawlers recently. as a newbie. After researching some crawler frameworks, I found that there are many open source crawler frameworks with complete functions, but unfortunately, I have not found the crawler that perfectly interprets and executes js. After reading the brief talk on web crawler crawling js dynamically loading web page (2) , I am very emotional. First of all, I admire the blogger's research spirit quarterly. Although the second and third schemes in this article are not very reliable, but being able to think of these schemes shows that bloggers have strong divergent thinking and are not limited to unidirectional thinking. It's a pity, because that's who I am. I always feel that bloggers don't know enough about HtmlUnit (maybe it's my misunderstanding). So began to dig into the horns. After reading the introduction of HtmlUnit, I have a hunch that there is no reason why HtmlUnit cannot support the automatic interpretation and execution of Js, and the facts prove my idea. Talking nonsense is useless.

Here we take the address [http://cq.qq.com/baaliao/detail.htm?294064] to test. By viewing the source code of the page, we can find that the title, content and pageviews of the page are all placeholders. form, when the page is loaded, it is replaced by js, and the following code gets the pageview field of the article.

 

public void testCrawler() throws Exception {  
        /**HtmlUnit request web page*/  
        WebClient wc = new WebClient();  
        wc.getOptions().setJavaScriptEnabled(true); //Enable JS interpreter, the default is true  
        wc.getOptions().setCssEnabled(false); //Disable css support  
        wc.getOptions().setThrowExceptionOnScriptError(false); //Whether an exception is thrown when js runs in error  
        wc.getOptions().setTimeout(10000); //Set the connection timeout, here is 10S. If 0, wait indefinitely  
        HtmlPage page = wc.getPage("http://cq.qq.com/baoliao/detail.htm?294064");  
        String pageXml = page.asXml(); //Get the response text in the form of xml  
  
        /**jsoup parsing document*/  
        Document doc = Jsoup.parse(pageXml, "http://cq.qq.com");   
        Element pv = doc.select("#feed_content span").get(1);  
        System.out.println(pv.text());  
        Assert.assertTrue(pv.text().contains("浏览"));  
  
        System.out.println("Thank God!");  
    }

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326487190&siteId=291194637