A method to solve the invalid HTTP crawling webpage timeout setting

Today, I found that when superword is getting word definitions, for uncommon words, the web page opens very slowly, more than 10 seconds. After inspection, it is found that when using Jsoup to grab word definitions, the set timeout of 3 seconds is invalid, and the execution of the _getContent method is invalid. The time is more than 10 seconds, the code is as follows:

 

    public static String getContent(String url) {
        String html = _getContent(url);
        int times = 0;
        while(StringUtils.isNotBlank(html) && html.contains("Sorry, requests from your ip are unusually frequent")){
            // use the new IP address
            ProxyIp.toNewIp();
            html = _getContent(url);
            if(++times > 2){
                break;
            }
        }
        return html;
    }

    private static String _getContent(String url) {
        Connection conn = Jsoup.connect(url)
                .header("Accept", ACCEPT)
                .header("Accept-Encoding", ENCODING)
                .header("Accept-Language", LANGUAGE)
                .header("Connection", CONNECTION)
                .header("Referer", REFERER)
                .header("Host", HOST)
                .header("User-Agent", USER_AGENT)
                .timeout(3000)
                .ignoreContentType(true);
        String html = "";
        try {
            html = conn.post().html();
            html = html.replaceAll("[\n\r]", "");
        }catch (Exception e){
            LOGGER.error("Get URL:" + url + "page error", e);
        }
        return html;
    }

 

So I thought of a way to solve this problem. The core idea is that the main thread starts a sub-thread to grab the word definition, and then the main thread sleeps for the specified timeout period. When the timeout period elapses, the grab result is obtained from the sub-thread. At this time If the sub-thread fetching has not been completed, the main thread returns an empty word definition, the code is as follows:

 

    public static String getContent(String url) {
        long start = System.currentTimeMillis();
        String html = _getContent(url, 1000);
        LOGGER.info("Time to get Pinyin: {}", TimeUtils.getTimeDes(System.currentTimeMillis()-start));
        int times = 0;
        while(StringUtils.isNotBlank(html) && html.contains("Sorry, requests from your ip are unusually frequent")){
            // use the new IP address
            ProxyIp.toNewIp();
            html = _getContent(url);
            if(++times > 2){
                break;
            }
        }
        return html;
    }

    private static String _getContent(String url, int timeout) {
        Future<String> future = ThreadPool.EXECUTOR_SERVICE.submit(()->_getContent(url));
        try {
            Thread.sleep(timeout);
            return future.get(1, TimeUnit.NANOSECONDS);
        } catch (Throwable e) {
            LOGGER.error("Get web page exception", e);
        }
        return "";
    }

    private static String _getContent(String url) {
        Connection conn = Jsoup.connect(url)
                .header("Accept", ACCEPT)
                .header("Accept-Encoding", ENCODING)
                .header("Accept-Language", LANGUAGE)
                .header("Connection", CONNECTION)
                .header("Referer", REFERER)
                .header("Host", HOST)
                .header("User-Agent", USER_AGENT)
                .timeout(1000)
                .ignoreContentType(true);
        String html = "";
        try {
            html = conn.post().html();
            html = html.replaceAll("[\n\r]", "");
        }catch (Exception e){
            LOGGER.error("Get URL:" + url + "page error", e);
        }
        return html;
    }

 

 

Detailed code address:

https://github.com/ysc/superword/commit/e4bc3c4197af95a8d7519856c89d592515a1c18f

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326396287&siteId=291194637
Recommended