HttpClient (a) HttpClient crawl pages Basic information

https://www.cnblogs.com/zhangyinhua/p/8038377.html
a, Introduction HttpClient
  HttpClient is a subproject of Apache Jakarta Common, can be used to provide efficient, customer-date, feature-rich support for the HTTP protocol end programming toolkit,

And it supports the latest version of the HTTP protocol and recommendations.

Official site: http://hc.apache.org/

The latest version 4.5 http://hc.apache.org/httpcomponents-client-4.5.x/

The official document: http://hc.apache.org/httpcomponents-client-4.5.x/tutorial/html/index.html

Maven Address:

org.apache.httpcomponents httpclient 4.5.2 HTTP protocol used on the Internet is probably now the most, the most important agreement, and an increasing number of Java applications need to directly access network resources via the HTTP protocol. Although the JDK java net package

Already it provides the basic functionality to access the HTTP protocol, but for most applications, JDK functionality provided by the library itself is not enough rich and flexible. HttpClient is a child under the Apache Jakarta Common

Project to provide efficient, new, feature-rich client support HTTP protocol programming toolkit, and it supports the latest version of the HTTP protocol and recommendations. HttpClient has been used in many projects,

For example, the Apache Jakarta Two other very well-known open source projects Cactus and HTMLUnit use the HttpClient. The latest version is now HttpClient HttpClient 4.5 (GA) (2015-09-11).

Summary: We are engaged in reptiles, mostly requested by a third-party site url HttpClient simulation browser, then the response, acquire web page data, and then use Jsoup to extract the information we need.

Second, the use HttpClient to obtain web content
  Here we crawl blog content source Home Park

Copy the code
package com.jxlg.study.httpclient;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;

{class GetWebPageContent public
/ **
* get request to fetch the web page information using the
* @param args
* @throws IOException
* /
public static void main (String [] args) throws IOException {
// Create instance httpClient
CloseableHttpClient httpClient = HttpClients.createDefault ( );
// Create instance HttpGet
HttpGet HttpGet new new HttpGet = ( " http://www.cnblogs.com ");
CloseableHttpResponse Response = httpClient.execute (HttpGet);
! IF (Response = null) {
the HttpEntity Entity = response.getEntity (); // get page content
String Result = EntityUtils.toString (Entity, "UTF-. 8");
System.out.println ( "web content:" Result +);
}
IF (! Response = null) {
Response. Close ();
}
IF (httpClient = null!) {
httpClient.close ();
}
}
}
copy the code
   above code can be acquired directly to web content, content acquired some Chinese distortion, which needs to be encoded according to the encoding set page , and such gb2312.

Third, the analog crawl the web browser
3.1, set the User-Agent request header message to simulate a browser
  when we use that code written above to obtain the source code pages are pushing cool ( http://www.tuicool.com ) when will return to us the following information:

Copy the code
page content:

Parental behavior detection system is not a real person, due to system resource limitations, we can only refuse your request. If you have questions, you can contact us through the microblogging http://weibo.com/tuicool2012/.

Copy the code to do this is because the site restrictions, limiting others climb. Solution may be provided User-Agent request header message analog browser. code show as below:

Copy the code
import java.io.IOException;

public class GetWebPageContent {
/**
* 抓取网页信息使用get请求
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
//创建httpClient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//创建httpGet实例
HttpGet httpGet = new HttpGet(“http://www.tuicool.com”);
httpGet.setHeader(“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36”);
CloseableHttpResponse response = httpClient.execute(httpGet);
if (response != null){
HttpEntity entity = response.getEntity(); //获取网页内容
Result = EntityUtils.toString String (Entity, "UTF-. 8");
System.out.println ( "Web content:" Result +);
}
IF (! Response = null) {
response.close ();
}
IF (httpClient ! = null) {
httpClient.close ();
}
}
}
copy the code
  to set the head message HttpGet method, can be simulated browser.

3.2, the content acquisition response Content-Type
  use entity.getContentType () getValue () to get the Content-Type, code is as follows:

Copy the code
public class GetWebPageContent {
/ **
* get request to fetch the web page information using the
* @param args
* @throws IOException
* /
public static void main (String [] args) throws IOException {
// Create instance httpClient
CloseableHttpClient httpClient = HttpClients. createDefault ();
// create httpGet instance
HttpGet httpGet = new new HttpGet ( " http://www.tuicool.com ");
httpGet.setHeader ( "the User-Agent", "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit /537.36 (KHTML, like Gecko) Chrome / 62.0.3202.94 Safari / 537.36 ");
CloseableHttpResponse the Response = httpClient.execute (HttpGet);
! IF (the Response = null) {
HttpEntity response.getEntity the Entity = (); // get page content
System.out.println(“Content-Type:”+entity.getContentType().getValue()); //获取Content-Type
}
if (response != null){
response.close();
}
if (httpClient != null){
httpClient.close();
}
}
}
复制代码
  结果:

General page is text / html course, some encoded tape, such as requesting www.tuicool.com : Output:

Content-Type:text/html; charset=utf-8

If the request js file, such http://www.open1111.com/static/js/jQuery.js

Run output:

Content-Type:application/javascript

If the request is for a file, such as http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar

Run output:

Content-Type:application/java-archive

Of course, there are a bunch of Content-Type, that this stuff for our crawler dim, we take the time to climb the page, you can

Content-Type to extract some web pages or crawling when we need to crawling, to be filtered out.

3.3, status acquisition response
  using response.getStatusLine () getStatusCode () Gets the response status code is as follows:

复制代码
public class GetWebPageContent {
/**
* 抓取网页信息使用get请求
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
//创建httpClient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
//创建httpGet实例
HttpGet httpGet = new HttpGet(“http://www.tuicool.com”);
httpGet.setHeader(“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36”);
CloseableHttpResponse response = httpClient.execute(httpGet);
if (response != null){
int statusCode = response.getStatusLine().getStatusCode();
System.out.println ( "Status Response:" + statusCode);
}
IF (! Response = null) {
response.close ();
}
IF (! HttpClient = null) {
httpClient.close ();
}
}
}
copy the code
  result:

We HttpClient request to the server, normally executed successfully return a 200 status code, not necessarily every time the request is successful,

This request such as the return address 404, internal server error, the server returns the some 500 anti-acquisition, if you frequently collect data, you reject the request returns 403 does not exist.

Of course, we are way behind will be mentioned by proxy IP.

Fourth, the crawl image
  using HttpClient crawl image, to obtain the input stream through entity.getContent (), then use the file copy method in common io picture to a local area, as follows:

4.1, was added dependent

Commons-IO
Commons-IO
2.5

4.2, the core code
copy the code
package com.jxlg.study.httpclient;

import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;

{class GetPictureByUrl public
public static void main (String [] args) throws IOException {
// Image Path
String URL = " https://wx2.sinaimg.cn/mw690/006RYJvjly1fmfk7c049vj30zk0qogq6.jpg ";
// Create instance httpClient
CloseableHttpClient httpClient = HttpClients.createDefault ();
// Create instance HttpGet
HttpGet HttpGet = new new HttpGet (URL);
// Setup request message header
httpGet.setHeader ( "user-Agent", "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 62.0.3202.94 Safari / 537.36 ");
CloseableHttpResponse the Response = httpClient.execute (HttpGet);
. // get the suffix
String fileName = url.substring (url.lastIndexOf (" . "), url. length ());

    if (response != null){
        HttpEntity entity = response.getEntity();
        if (entity != null){
            System.out.println("Content-Type:"+entity.getContentType().getValue());
            InputStream inputStream = entity.getContent();
            //文件复制
            FileUtils.copyToFile(inputStream,new File("D:love"+fileName));
        }
    }
    if (response != null){
        response.close();
    }
    if (httpClient != null){
        httpClient.close();
    }
}

}
Copy the code

Guess you like

Origin blog.csdn.net/qq_35577329/article/details/88887917