WebMagic grabs front-end Ajax-rendered pages

Reprinted from http://blog.csdn.net/u013510614/article/details/50313931

Grab front-end rendered pages

With the continuous popularization of AJAX technology and the emergence of the Single-page application framework such as AngularJS, more and more pages are rendered by js. For crawlers, this kind of page is rather annoying: only extracting HTML content, often unable to get effective information. So how to deal with this kind of page? In general there are two approaches:

  1. In the crawling stage, a browser kernel is built into the crawler, and js is executed to render the page before crawling. The corresponding tools in this regard are Selenium, HtmlUnitor PhantomJs. But these tools all have certain efficiency problems, and they are not so stable at the same time. The advantage is that writing rules are the same as static pages.
  2. Because the data of the js rendering page is also obtained from the backend, and basically it is obtained by AJAX, it is also feasible to analyze the AJAX request and find the request corresponding to the data. And relative to the page style, this interface is less likely to change. The disadvantage is that finding this request and simulating it is a relatively difficult process and requires a relatively large amount of analysis experience.

Comparing the two methods, my point is that for one-off or small-scale needs, the first method saves time and effort. But for long-term, large-scale demand, the second one is more reliable. For some sites, there are even some js obfuscation techniques. At this time, the first method is basically omnipotent, and the second method will be very complicated.

For the first method, webmagic-seleniumit is such an attempt, which defines a Downloaderbrowser kernel for rendering when the page is downloaded. The configuration of selenium is more complicated, and it is related to the platform and version, and there is no stable solution. If you are interested, you can read my blog: Using Selenium to crawl dynamically loaded pages

Here I mainly introduce the second method. I hope you will find out that it is not that complicated to parse a front-end rendered page. Here we take the AngularJS Chinese community http://angularjs.cn/ as an example.

How to judge front-end rendering

The way to judge whether the page is rendered by js is relatively simple. You can directly view the source code in the browser (Ctrl+U under Windows, command+alt+u under Mac). If no valid information is found, it is basically js rendering.

angular-view

angular-source

In this example, the title "Youfu Computer Network - Front-end Siege Master" in the page cannot be found in the source code, so it can be concluded that it is js rendering, and this data is obtained by AJAX.

Analysis request

Now we get to the hardest part: finding this data request. This step can help our tools, mainly the developer tools in the browser to view network requests.

Taking Chome as an example, we open the "Developer Tools" (F12 under Windows, command+alt+i under Mac), and then refresh the page (it may also be a drop-down page, in short, all the operations that you think may trigger new data) ), then remember to keep the scene and analyze the requests one by one!

This step requires a little patience, but it's not random. The first thing that can help us is the classification filter above (All, Document and other options). If it is normal AJAX, XHRit will be displayed under the label, and JSONP request will be under the Scriptslabel, these are two more common data types.

Then you can judge based on the size of the data. Generally, the larger result is more likely to be the interface that returns the data. The rest is basically based on experience. For example, the "latest?p=1&s=20" here is very suspicious at first glance...

angular-ajax-list

For suspicious addresses, you can look at the content of the response body at this time. It is not clear in the developer tools here. We http://angularjs.cn/api/article/latest?p=1&s=20copy the URL to the address bar and request it again (if you use Chrome to recommend installing a jsonviewer, it is very convenient to view the AJAX results). Looking at the results, it looks like we found what we were looking for.

json

同样的办法,我们进入到帖子详情页,找到了具体内容的请求:http://angularjs.cn/api/article/A0y2

编写程序

回想一下之前列表+目标页的例子,会发现我们这次的需求,跟之前是类似的,只不过换成了AJAX方式-AJAX方式的列表,AJAX方式的数据,而返回数据变成了JSON。那么,我们仍然可以用上次的方式,分为两种页面来进行编写:

  1. 数据列表

    在这个列表页,我们需要找到有效的信息,来帮助我们构建目标AJAX的URL。这里我们看到,这个_id应该就是我们想要的帖子的id,而帖子的详情请求,就是由一些固定URL加上这个id组成。所以在这一步,我们自己手动构造URL,并加入到待抓取队列中。这里我们使用JsonPath这种选择语言来选择数据(webmagic-extension包中提供了JsonPathSelector来支持它)。

     if (page.getUrl().regex(LIST_URL).match()) {
         //这里我们使用JSONPATH这种选择语言来选择数据
         List<String> ids = new JsonPathSelector("$.data[*]._id").selectList(page.getRawText());
         if (CollectionUtils.isNotEmpty(ids)) {
             for (String id : ids) {
                 page.addTargetRequest("http://angularjs.cn/api/article/"+id);
             }
         }
     }
    
  2. 目标数据

    有了URL,实际上解析目标数据就非常简单了,因为JSON数据是完全结构化的,所以省去了我们分析页面,编写XPath的过程。这里我们依然使用JsonPath来获取标题和内容。

     page.putField("title", new JsonPathSelector("$.data.title").select(page.getRawText()));
     page.putField("content", new JsonPathSelector("$.data.content").select(page.getRawText()));
    

这个例子完整的代码请看AngularJSProcessor.java

总结

在这个例子中,我们分析了一个比较经典的动态页面的抓取过程。实际上,动态页面抓取,最大的区别在于:它提高了链接发现的难度。我们对比一下两种开发模式:

  1. 后端渲染的页面

    下载辅助页面=>发现链接=>下载并分析目标HTML

  2. 前端渲染的页面

    发现辅助数据=>构造链接=>下载并分析目标AJAX

对于不同的站点,这个辅助数据可能是在页面HTML中已经预先输出,也可能是通过AJAX去请求,甚至可能是多次数据请求的过程,但是这个模式基本是固定的。

但是这些数据请求的分析比起页面分析来说,仍然是要复杂得多,所以这其实是动态页面抓取的难点。

本节这个例子希望做到的是,在分析出请求后,为这类爬虫的编写提供一个可遵循的模式,即发现辅助数据=>构造链接=>下载并分析目标AJAX这个模式。

PS:

After WebMagic 0.5.0, Json support will be added to the chained API, in the future you can use:

page.getJson().jsonPath("$.name").get();

This way to parse AJAX requests.

Also supports

page.getJson().removePadding("callback").jsonPath("$.name").get();

This way to parse JSONP requests.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327049993&siteId=291194637