java httpclient4.X 无法判断文件大小问题

httpclient4.X 网页抓取代码:


InputStream is = null;
HttpGet httpGet = null;
try {

URL url = new URL(URL);
URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), url.getRef());
httpGet = new HttpGet(uri);
RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(2000).setRedirectsEnabled(true).build();//设置请求和传输超时时间
httpGet.setConfig(requestConfig);

HttpClientContext context = HttpClientContext.create(); 
           
HttpResponse response = null;
response = httpClient.execute(httpGet,context);

            long len = response.getEntity().getContentLength();
            if (len > 0 && len/(1024*1024) > 1) {//大于1M
            return false;
}
           
            // 获取所有的重定向位置 
            List<URI> redirectLocations = context.getRedirectLocations(); 
int responseCode = response.getStatusLine().getStatusCode();

if (responseCode == 200) {
if (null != redirectLocations && redirectLocations.size()>0) {
URL = redirectLocations.get(redirectLocations.size()-1).toString();
}

//过滤掉非html页面,比如:json、xml
if (!response.getEntity().getContentType().toString().contains("text/html")) {
return false;
}

is = response.getEntity().getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(
is, charset));

String line = null;
String content = "";
while ((line = br.readLine()) != null) {
content += line+"\n";
}

//过滤掉非html页面,比如:json、xml
if (!content.contains("html")) {
return false;
}

......

return true;
}
return false;

} catch (Exception localException) {
localException.printStackTrace();
} finally {
try {
if (httpGet != null) {
httpGet.releaseConnection();
}
if (is != null)
is.close();
} catch (Exception localException1) {
}
}
return null;


----------------------------------------------------------------------------


response header是服务器可以设置的,所以content-length并不能完全解决
大文件拒绝抓取的问题。

需要这样一种思路:
1.header无法读取,能不能下载页面,边下载边判断,读取超时,默认文件太大,停止下载。

于是乎:

想到,java调用wget或者curl命令的方式:


String cmd = "wget -v --output-document="+CrawlerConstants.Tmp_Dir+"/wget.txt --no-check-certificate --tries=3 "
+ url;*
//cmd = "Wget_SingleDown_run.sh"
// 执行命令
p = Runtime.getRuntime().exec(cmd);

InputStream stderr = p.getErrorStream();
InputStreamReader isr = new InputStreamReader(stderr);
BufferedReader br1 = new BufferedReader(isr);
String line = null;
String link = url;
boolean is_text_html = false;
//此处必须读取流,不然会阻塞
while ( (line = br1.readLine()) != null){
                                        //这里会输出wget下载进度、Location跳转等header信息
System.out.println(line);

}

p.waitFor();
p.destroy();


---------------------------------------------------------------------

继续看代码:

把wget命令封装在一个shell脚本中,在脚本中做超时判断;
具体有2个脚本文件:

Wget_SingleDown_run.sh:


#!/bin/bash

##usage: ./singleDown killapp dirfile url
if [ $# -lt 3 ]
then
echo 'usage: ./singleDown killapp dirfile url'
exit 1
fi
##重试次数
retryTime=2
##等待超时(s)
idleTimeOut=3
##下载限时
downTimeOut=5
##下载限速
##limitRate=128k
##每次重试的间隔时间(s)
##waitRetry=1

url=`echo "$3" | sed "s/ /%20/g"`

wget --no-check-certificate -t $retryTime -T $idleTimeOut -O $2 $url

downPid=$!
echo $downPid
$1 $downTimeOut $downPid $2 >>/dev/null  2>&1 &

clockPid=$!

wait $downPid

ps $clockPid
if [ $? -eq 0 ]
then
kill -9 $clockPid
fi
exit 1


Wget_SleepAndKill_run.sh:

#!/bin/bash

##usage: timeOut(s) pidToKill

if [ $# -lt 3 ]
then
        echo 'usage: timeOut(s) pidToKill fileDir'
        exit 1
fi

sleep $1
ps $2
if [ $? -eq 0 ]
then
kill -9 $2
cat /dev/null > $3
fi


----------------------------------------------------------------------------

如果没有Wget_SleepAndKill_run,程序不会正常结束,一直等待超时后退出。

注意:>>/dev/null  2>&1 & 用法


猜你喜欢

转载自zlr.iteye.com/blog/2296051