Today I said: httpclient, he is under apache, currently using it to collect synchronous websites,
So first of all, before looking at this code, you must first understand the request response, so that you can understand it better. httpclient is the encapsulation of network programming, that is, under the net package, because the URLConnection class can also collect things, but after all It is native, I first throw the exception to the main, so that the code is less
What's the introduction in a while, there are ppt and videos on our teacher
After a while, I will put the required jar package in the Baidu network disk, and of course you can add maven dependencies
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
//@Test
public void fun() throws Exception{ CloseableHttpClient httpClient = HttpClients.createDefault(); //Create instance object HttpGet httpGet = new HttpGet("https://www.tinaya.com/"); //Inside Just put the address you need to link httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36"); // Why is the request header set here?
Because some websites have anti-crawler measures, this setting is just to simulate human operations, otherwise your ip address may be blocked
/*
After getting the response, we need to analyze the response
*/
CloseableHttpResponse response = httpClient.execute(httpGet); ///Execute the get request of the http protocol
HttpEntity entity = response.getEntity(); //Get the entity
System. out.println("The source code of the webpage is as follows:");
String src = EntityUtils.toString(entity, "utf-8"); //In what form to look at the source code?? General webpages have a content-type in the meta tag The request header can correspond to that
System.out.println(src);
//close
response.close();
httpClient.close();
}
/*
Why do I need to get the content-type?
According to the specific MIME type,
why do I need to get the status code?
Because 200 means success, 404 means not found,
if it is not 200, then the following crawler code does not need to be executed
*/
//Get the MIME type of this page
String value = entity.getContentType().getValue();
System.out.println(value); //Output the value of content-type
//Get the status code of this page
StatusLine line = response .getStatusLine();
System.out.println(line);//HTTP/1.1 200 OK This is to return that piece of information
//But we generally only need the status code, use it to judge
int code = line .getStatusCode();
System.out.println(code);
2 HttpClient captures pictures:
public void fun() throws Exception{ CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet("http://xxxx/gg/sxt2.gif"); // just change the image address httpGet.setHeader ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36"); CloseableHttpResponse response = httpClient.execute(httpGet); HttpEntity entity = response.getEntity(); if(entity!=null){ System.out.println(entity.getContentType().getValue()); InputStream input = entity.getContent();//in the form of a stream (because the photo are bytes, of course transmitted and written as a stream)
FileUtils.copyInputStreamToFile(input, new File("C:\\sxt.gif"));//Now you see, how do I know what suffix the file is when I save it, I can write a String imageHref variable on it, because I Know the path of the image I passed in and there is a name in the path, just intercept it
This method is just a method in the Commons.io package
}
set proxy ip
When crawling webpages, some target sites have an anti-crawler mechanism, and IP blocking measures will be collected for frequent site visits and regular site visits.
At this time, the proxy IP comes in handy.
Regarding the proxy IP, there are also several types of transparent proxy, anonymous proxy, confusing proxy, and high-anonymity proxy
Transparent proxy: Although transparent proxy can directly "hide" your IP address, you can still find out who you are
Anonymous proxy is a bit better than transparent proxy: others can only know that you use a proxy, but cannot know who you are.
To confuse the proxy, others can still know that you are using the proxy, but you will get a fake IP address, which is more realistic in disguise:
The high-anonymity agent makes it impossible for others to find that you are using an agent, so it is the best choice. We are using a high-profile proxy,
Where does the proxy IP come from? It's very simple. Baidu, you will know a lot of proxy IP sites. Usually some free
CloseableHttpClient httpClient=HttpClients.createDefault(); // create httpClient instance
HttpGet httpGet=new HttpGet("https://www.taobao.com/"); // create httpget instance
HttpHost proxy=new HttpHost("116.226.217.54", 9999); //used to set proxy ip
/*
This class is located under the org.apache.http.client.config package and is mainly used to obtain and configure some external network environments. There is a nested class RequestConfig.Builder under it
The method of use is: first use the static method custom() of the RequestConfig class to obtain the requestConfig.Builder "configurator", and then use the various methods below to configure the network environment;
*/
RequestConfig requestConfig=RequestConfig.custom().setProxy(proxy).build();
httpGet.setConfig(requestConfig);
httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
CloseableHttpResponse response=httpClient.execute(httpGet); // execute http get request
HttpEntity entity=response.getEntity(); // Get the returned entity
System.out.println("Webpage content: "+EntityUtils.toString(entity, "utf-8")); // Get webpage content
response.close(); // response close
httpClient.close(); // httpClient is closed
Generally, if you encounter 403 on the collection website, it may be blocked. At this time, you should change the proxy ip.
Idea:
first make a small crawler, go to these free proxy ip websites, get the ip and port
Put the proxy ip and port into a map (in the queue) for the two data . You can also
add a judgment. If it is 403, take out the key-value pair of the next map and remove the ip
, then it will always be fetched and moved. In addition, add another judgment, if the ip is not enough, continue to run the crawler and crawl the ip
httpClient has a connection time and a time to read content when executing a specific http request;
The so-called connection time is the time from where HttpClient sends the request to the host address of the target url.
The smoother and faster the line, but due to the complex and intertwined routing, the connection time is often not fixed, and if you are unlucky, you will not be able to connect. The default connection time of HttpClient,
The default is 1 minute. If you continue to try to connect after more than 1 minute, there will be a problem. If you encounter a url that is always unable to connect, it will affect the threads of other threads to enter, so we need to make special settings, such as setting If there is no connection within 10 seconds, we will report an error, so that we can carry out business processing.
For example, after we have controlled the business, we will try to connect again. And this special url is written to the log4j log. It is convenient for administrators to view.
ttpClient read time
The so-called reading time is that HttpClient has connected to the target server, and then obtains the content data. Generally, reading data is very fast.
However, if the amount of data to be read is large, or there are problems with the target server itself (such as slow reading of the database, large amount of concurrency, etc..) it will also affect the reading time.
As above, we still need to make special settings, such as setting 10 seconds, if the reading is not finished within 10 seconds, an error will be reported, as above, we can handle it in business.
If the connection times out, the report is connect timed out;
if the reading timeout, the report is read timed out
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("https://www.tuicool.com/");
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(10000)//Set connection
timeout.setSocketTimeout (10000)//Set read timeout.build
();
httpGet.setConfig(config);
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet);
HttpEntity entity = response.getEntity();
System.out.println("The source code of the webpage is as follows:");
String src = EntityUtils.toString(entity, "utf-8");
System.out.println(src);
//关流
response.close();
httpClient.close();
/=====================================================
In fact, httpclient can do a lot of things, but we can do so much with it at present, and there are others that can do better than it.
you will know later
What about after parsing the source code? Can we parse the content? Of course
But have you considered it? The code of the front desk personnel of some websites is not well written, so there may be situations where tags cannot appear in pairs or short tags
This is the debut of htmlcleaner or htmlparser
I personally recommend using htmlcleaner, which is better
HtmlCleaner clean = new HtmlCleaner();
TagNode tagNode = clean.clean(src); //That's how it works
It supports xpath , we can use it to parse the content
//Title
Object[] titles = tagNode.evaluateXPath("xpath expression");
Then just get
htmlcleaner can also convert the source code into a dom tree and we can use dom4j to parse it, but it is still a bit troublesome, because it comes with xpath, why bother to create a new xpathFacotry?
------------ ---------