java httpclient4.X 无法判断文件大小问题

httpclient4.X 网页抓取代码：

InputStream is = null;
HttpGet httpGet = null;
try {

URL url = new URL(URL);
URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), url.getRef());
httpGet = new HttpGet(uri);
RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(2000).setRedirectsEnabled(true).build();//设置请求和传输超时时间
httpGet.setConfig(requestConfig);

HttpClientContext context = HttpClientContext.create();

HttpResponse response = null;
response = httpClient.execute(httpGet,context);

            long len = response.getEntity().getContentLength();
            if (len > 0 && len/(1024*1024) > 1) {//大于1M
            return false;
}

            // 获取所有的重定向位置
            List<URI> redirectLocations = context.getRedirectLocations();
int responseCode = response.getStatusLine().getStatusCode();

if (responseCode == 200) {
if (null != redirectLocations && redirectLocations.size()>0) {
URL = redirectLocations.get(redirectLocations.size()-1).toString();
}

//过滤掉非html页面，比如：json、xml
if (!response.getEntity().getContentType().toString().contains("text/html")) {
return false;
}

is = response.getEntity().getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(
is, charset));

String line = null;
String content = "";
while ((line = br.readLine()) != null) {
content += line+"\n";
}

//过滤掉非html页面，比如：json、xml
if (!content.contains("html")) {
return false;
}

......

return true;
}
return false;

} catch (Exception localException) {
localException.printStackTrace();
} finally {
try {
if (httpGet != null) {
httpGet.releaseConnection();
}
if (is != null)
is.close();
} catch (Exception localException1) {
}
}
return null;

----------------------------------------------------------------------------

response header是服务器可以设置的，所以content-length并不能完全解决
大文件拒绝抓取的问题。

需要这样一种思路：
1.header无法读取，能不能下载页面，边下载边判断，读取超时，默认文件太大，停止下载。

于是乎：

想到，java调用wget或者curl命令的方式：

String cmd = "wget -v --output-document="+CrawlerConstants.Tmp_Dir+"/wget.txt --no-check-certificate --tries=3 "
+ url;*
//cmd = "Wget_SingleDown_run.sh"
// 执行命令
p = Runtime.getRuntime().exec(cmd);

InputStream stderr = p.getErrorStream();
InputStreamReader isr = new InputStreamReader(stderr);
BufferedReader br1 = new BufferedReader(isr);
String line = null;
String link = url;
boolean is_text_html = false;
//此处必须读取流，不然会阻塞
while ( (line = br1.readLine()) != null){
                                        //这里会输出wget下载进度、Location跳转等header信息
System.out.println(line);

}

p.waitFor();
p.destroy();

---------------------------------------------------------------------

继续看代码：

把wget命令封装在一个shell脚本中，在脚本中做超时判断；
具体有2个脚本文件：

Wget_SingleDown_run.sh:

#!/bin/bash

##usage: ./singleDown killapp dirfile url
if [ $# -lt 3 ]
then
echo 'usage: ./singleDown killapp dirfile url'
exit 1
fi
##重试次数
retryTime=2
##等待超时(s)
idleTimeOut=3
##下载限时
downTimeOut=5
##下载限速
##limitRate=128k
##每次重试的间隔时间(s)
##waitRetry=1

url=`echo "$3" | sed "s/ /%20/g"`

wget --no-check-certificate -t $retryTime -T $idleTimeOut -O $2 $url

downPid=$!
echo $downPid
$1 $downTimeOut $downPid $2 >>/dev/null 2>&1 &

clockPid=$!

wait $downPid

ps $clockPid
if [ $? -eq 0 ]
then
kill -9 $clockPid
fi
exit 1

Wget_SleepAndKill_run.sh:

#!/bin/bash

##usage: timeOut(s) pidToKill

if [ $# -lt 3 ]
then
        echo 'usage: timeOut(s) pidToKill fileDir'
        exit 1
fi

sleep $1
ps $2
if [ $? -eq 0 ]
then
kill -9 $2
cat /dev/null > $3
fi

----------------------------------------------------------------------------

如果没有Wget_SleepAndKill_run，程序不会正常结束，一直等待超时后退出。

注意：>>/dev/null 2>&1 & 用法

java httpclient4.X 无法判断文件大小问题

猜你喜欢