Java reptile encounter asynchronous data loading, try both approaches!

This is the Java reptile Bowen the third in the series, in the previous encounter site requires a login Java reptiles, how to do? When encountered, we simply explain solutions reptiles login problems encountered problems when loading data asynchronously In this article we take a chat reptiles, reptile and this is a common problem.

Now many of them are front and rear end of the separation project, which would make asynchronous data loading issues become more prominent, so you encounter such problems in the reptile Not surprisingly, do not panic. Overall, there are two ways to solve this kind of problem:

1, a built-in browser kernel

Built-in browser is crawling program, we launch a browser kernel, so that we get to the page after js rendering, so we just collect static pages the same. This tool commonly used are the following three:

  • Selenium
  • HtmlUnit
  • PhantomJs

These tools can help us solve the problem of data loaded asynchronously, but they are flawed, and that is not efficient and unstable.

2, the reverse analysis method

What is a reverse lookup law? We js rendering data page is taken from the rear by Ajax way, we just need to find the corresponding Ajax request to connect on OK, so that we get the data we need, the benefits of reverse analytical method is obtained in this way the data is json data format, parsing is also more convenient, another advantage is that relatively speaking the page, change the interface probability is smaller. Likewise, it has two shortcomings, one at Ajax when you need to have patience and skill, because you need to find another place where you want to deficiencies in the request is a big push to the JavaScript rendered helpless page.

The above two solutions are asynchronous data loading, in order to deepen everyone's understanding and how to use in the project, to collect my Netease news, for example, Netease news address: https://news.163.com/. NetEase to get the news list highlights the appeal of using the two methods. Netease News as follows:

Selenium built-in browser mode

Selenium is an analog browser, a tool for automated testing, which provides a set of core API can interact with the real-world browser. In the automated test more use asynchronous loading often use it when reptiles resolved, we want to use Selenium in the project, you need to do two things:

  • 1, dependencies introduced Selenium added in pom.xml
<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java -->
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.141.59</version>
</dependency>
复制代码
  • 2, download the corresponding driver, for example, I downloaded chromedriver, download address is: https://npm.taobao.org/mirrors/chromedriver/After downloading, the position of the driver needs to be written in the Java environment variables, for example, I placed directly under the project, so my code:
    System.getProperties().setProperty("webdriver.chrome.driver", "chromedriver.exe");
复制代码

After completion of the above two steps, since we can write using Selenium acquisition Netease news friends. Specific code as follows:

/**
 * selenium 解决数据异步加载问题
 * https://npm.taobao.org/mirrors/chromedriver/
 *
 * @param url
 */
public void selenium(String url) {
    // 设置 chromedirver 的存放位置
    System.getProperties().setProperty("webdriver.chrome.driver", "chromedriver.exe");
    // 设置无头浏览器,这样就不会弹出浏览器窗口
    ChromeOptions chromeOptions = new ChromeOptions();
    chromeOptions.addArguments("--headless");

    WebDriver webDriver = new ChromeDriver(chromeOptions);
    webDriver.get(url);
    // 获取到要闻新闻列表
    List<WebElement> webElements = webDriver.findElements(By.xpath("//div[@class='news_title']/h3/a"));
    for (WebElement webElement : webElements) {
        // 提取新闻连接
        String article_url = webElement.getAttribute("href");
        // 提取新闻标题
        String title = webElement.getText();
        if (article_url.contains("https://news.163.com/")) {
            System.out.println("文章标题:" + title + " ,文章链接:" + article_url);
        }
    }
    webDriver.close();
}
复制代码

The operation method, to obtain the following results:

We use Selenium has the right to extract the Netease news list news.

Reverse analytical method

Reverse analysis method is to get links to Ajax asynchronous data acquisition, data acquisition directly to the news. If there are no tricks, then look for Ajax process will be very painful, because too many links a page to load, and look at the network Netease News:

There are hundreds of requests, how to find data that highlights which of requests to obtain it? You took the trouble, you can go one by one point, certainly can find, and another quick way is to use network search function, the search button if you do not know, I've been out of it in the circle on the map, we News headlines in just a copy and then retrieve it, you can get to the results, as shown below:

So we quickly get to the data request link news, links to: https://temp.163.com/special/00804KVA/cm_yaowen.js?callback=data_callbackaccess the link, the link to view the data returned, as shown below:

From the data we can see that the data we need all the friends here, so we just need to parse this data can access it, this data is parsed from the news headlines and links to news, there are two ways, one is regular expression, and the other is the data into json or list. I chose the second way, using the fastjson to convert the returned data into JSONArray. So we are to introduce fastjson, introduced fastjson depend in pom.xml:

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.59</version>
</dependency>
复制代码

In addition to introducing fastjson dependent, we also need the data before the conversion process simple, because the data do not conform to the format of the list, we need to remove data_callback(and rearmost ). DETAILED reverse analysis highlights YORK acquired code is as follows:

/**
 * 使用反向解析法 解决数据异步加载的问题
 *
 * @param url
 */
public void httpclientMethod(String url) throws IOException {

    CloseableHttpClient httpclient = HttpClients.createDefault();
    HttpGet httpGet = new HttpGet(url);
    CloseableHttpResponse response = httpclient.execute(httpGet);
    if (response.getStatusLine().getStatusCode() == 200) {
        HttpEntity entity = response.getEntity();
        String body = EntityUtils.toString(entity, "GBK");
        // 先替换掉最前面的 data_callback(
        body = body.replace("data_callback(", "");
        // 过滤掉最后面一个 )右括号
        body = body.substring(0, body.lastIndexOf(")"));
        // 将 body 转换成 JSONArray
        JSONArray jsonArray = JSON.parseArray(body);
        for (int i = 0; i < jsonArray.size(); i++) {
            JSONObject data = jsonArray.getJSONObject(i);
            System.out.println("文章标题:" + data.getString("title") + " ,文章链接:" + data.getString("docurl"));
        }
    } else {
        System.out.println("处理失败!!!返回状态码:" + response.getStatusLine().getStatusCode());
    }

}
复制代码

Write the main method, the method of implementation of the above, the place should be noted that: this time for the incoming links https://temp.163.com/special/00804KVA/cm_yaowen.js?callback=data_callbackinstead https://news.163.com/. Results are as follows:

Both methods successfully get to the news list News NetEase loaded asynchronously, these two methods for selecting, my personal inclination is to use a reverse analytical method, because of its performance and stability is better than the built-in browser kernel by spectrum, but for some pages using JavaScript fragment rendering, built-in browser and more reliable. So depending on the circumstances decide.

Hopefully this article for your help, the next is about the reptile was closed IP issues. If you are interested in reptiles, may wish to focus a wave, mutual learning, mutual progress

Source: source code

The inadequacies of the article, hope a lot of pointing, learn together, and common progress

At last

Play a little advertising, welcomed the focus on micro-channel scan code number public: "flat head brother of technical Bowen" progress together.

Flathead brother of technical Bowen

Guess you like

Origin juejin.im/post/5d9d81fbf265da5bbe2a3116