JAVA web crawler 01-http client crawls network content


Web crawler is a program or script that automatically crawls World Wide Web information according to certain rules. We have always used the HTTP protocol to access web pages on the Internet, and web crawlers need to write programs, where the same HTTP protocol is used to access web pages. Here we use the technology of Java's HTTP protocol client HttpClient to capture web page data.

Environmental preparation

Introduce maven dependency

  <!-- HttpClient -->
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.3</version>
    </dependency>

    <!-- 日志 -->
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>1.7.25</version>
    </dependency>

Add log configuration file

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

http Get request

code show as below:

public static void getTest()throws Exception{
    CloseableHttpClient httpClient = HttpClients.createDefault();
    HttpGet httpGet = new HttpGet("http://www.itcast.cn?pava=zhangxm");
    CloseableHttpResponse response = httpClient.execute(httpGet);
    if (response.getStatusLine().getStatusCode() == 200) {
        String content = EntityUtils.toString(response.getEntity(), "UTF-8");
        System.out.println(content);
    }
}

http POST request

/**
     * java 代码发送post请求并传递参数
     * @throws Exception
     */
    public static void postTest () throws Exception{
        //创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        RequestConfig requestConfig = RequestConfig.custom()
                .setConnectTimeout(1000)//设置创建连接的最长时间
                .setConnectionRequestTimeout(500)//设置获取连接的最长时间
                .setSocketTimeout(10 * 1000)//设置数据传输的最长时间
                .build();

        //创建HttpGet请求
        HttpPost httpPost = new HttpPost("http://www.itcast.cn/");
        httpPost.setConfig(requestConfig);
        CloseableHttpResponse response = null;
        try {

            //声明存放参数的List集合
            List<NameValuePair> params = new ArrayList<NameValuePair>();
            params.add(new BasicNameValuePair("pava", "zhangxm"));

            //创建表单数据Entity
            UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");

            //设置表单Entity到httpPost请求对象中
            httpPost.setEntity(formEntity);
            //使用HttpClient发起请求
            response = httpClient.execute(httpPost);
            //判断响应状态码是否为200
            if (response.getStatusLine().getStatusCode() == 200) {
                //如果为200表示请求成功,获取返回数据
                String content = EntityUtils.toString(response.getEntity(), "UTF-8");
                //打印数据长度
                System.out.println(content);
            }

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            //释放连接
            if (response == null) {
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                httpClient.close();
            }
        }


    }

httpClient connection pool

If you have to create HttpClient for every request, there will be frequent creation and destruction problems, you can use connection pool to solve this problem.

/**

  • Use httpClient connection pool without creating a new and destroying client every time
  • @throws IOException
    */
    public static void connPoolTest() throws IOException {
    PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
    //设置最大连接数
    cm.setMaxTotal(200);
    //设置每个主机的并发数
    cm.setDefaultMaxPerRoute(20);
    for(int i=0;i<10;i++){
    CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
    System.out.println(“httpClient:”+httpClient);
    httpClient.close();
    }
    }

Guess you like

Origin blog.csdn.net/zhangxm_qz/article/details/109443783