用java执行网页信息爬取代码

最近由于课题需要，用java写了一个简单的爬虫。期间使用HttpClient包实现了执行get请求功能，获取了网页返回的信息体。这部分代码是从网上找来的，对它进行一个简要总结，然后把代码贴在下面。
使用HttpClient下载指定网页需要执行以下几个步骤：

创建HttpClient对象。
设置必需参数（超时时间）。
执行getMethod方法。
获取结构体内容进行保存。

public static String downloadFile(String fileName,String URL){
        // 下载url指向网页
        String filePath = null;
          /* 1.生成 HttpClient 对象并设置参数*/
        HttpClient httpClient = new HttpClient();
        //设置 Http 连接超时 5s
        httpClient.getHttpConnectionManager().getParams().
                setConnectionTimeout(5000);
          /*2.生成 GetMethod 对象并设置参数*/
        GetMethod getMethod=new GetMethod(URL);
        //设置 get 请求超时 5s
        getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT,5000);
        //设置请求重试处理
        getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
                new DefaultHttpMethodRetryHandler());

          /*3.执行 HTTP GET 请求*/
        try{
            int statusCode = httpClient.executeMethod(getMethod);
            //判断访问的状态码
            if (statusCode != HttpStatus.SC_OK)
            {
                System.err.println("Method failed: "+ getMethod.getStatusLine());
                filePath = null;
            }
              /*4.处理 HTTP 响应内容*/
            byte[] responseBody = getMethod.getResponseBody();//读取为字节数组
            //根据网页 url 生成保存时的文件名

            String localPath = System.getProperty("user.dir");
            filePath = localPath + "\\src\\peculiar\\temp\\"+fileName+".html";
            saveToLocal(responseBody,filePath);
        } catch (HttpException e) {
            // 发生致命的异常，可能是协议不对或者返回的内容有问题
            System.out.println("Please check your provided http address!");
            e.printStackTrace();
        } catch (IOException e) {
            // 发生网络异常
            e.printStackTrace();
        } finally {
            // 释放连接
            getMethod.releaseConnection();
        }
        return filePath;
    }

感谢提供资源的朋友们。

用java执行网页信息爬取代码

猜你喜欢