The use of httpclient and the quick use of htmlcleaner

Today I said: httpclient, he is under apache, currently using it to collect synchronous websites,

So first of all, before looking at this code, you must first understand the request response, so that you can understand it better. httpclient is the encapsulation of network programming, that is, under the net package, because the URLConnection class can also collect things, but after all It is native, I first throw the exception to the main, so that the code is less

What's the introduction in a while, there are ppt and videos on our teacher

 

After a while, I will put the required jar package in the Baidu network disk, and of course you can add maven dependencies

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

 

 

//@Test
    public void fun() throws Exception{         CloseableHttpClient httpClient = HttpClients.createDefault(); //Create instance object         HttpGet httpGet = new HttpGet("https://www.tinaya.com/"); //Inside Just put the address you need to link         httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36"); // Why is the request header set here?


Because some websites have anti-crawler measures, this setting is just to simulate human operations, otherwise your ip address may be blocked

/*
         After getting the response, we need to analyze the response
         */
        CloseableHttpResponse response = httpClient.execute(httpGet); ///Execute the get request of the http protocol
        HttpEntity entity = response.getEntity(); //Get the entity
        System. out.println("The source code of the webpage is as follows:");
        String src = EntityUtils.toString(entity, "utf-8"); //In what form to look at the source code?? General webpages have a content-type in the meta tag The request header can correspond to that
        System.out.println(src);
        //close
        response.close();
        httpClient.close();
    }

/*
     Why do I need to get the content-type?
    
     According to the specific MIME type,
    
     why do I need to get the status code?
     Because 200 means success, 404 means not found,
     if it is not 200, then the following crawler code does not need to be executed
     */

//Get the MIME type of this page
        String value = entity.getContentType().getValue();
        System.out.println(value); //Output the value of content-type
        
        //Get the status code of this page
        StatusLine line = response .getStatusLine();
        System.out.println(line);//HTTP/1.1 200 OK This is to return that piece of information
        //But we generally only need the status code, use it to judge
        int code = line .getStatusCode();
        System.out.println(code);

 

 

2 HttpClient captures pictures:

public void fun() throws Exception{         CloseableHttpClient httpClient = HttpClients.createDefault();         HttpGet httpGet = new HttpGet("http://xxxx/gg/sxt2.gif"); // just change the image address         httpGet.setHeader ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36"); CloseableHttpResponse response = httpClient.execute(httpGet);         HttpEntity         entity = response.getEntity();         if(entity!=null){             System.out.println(entity.getContentType().getValue());             InputStream input = entity.getContent();//in the form of a stream (because the photo are bytes, of course transmitted and written as a stream)








            FileUtils.copyInputStreamToFile(input, new File("C:\\sxt.gif"));//Now you see, how do I know what suffix the file is when I save it, I can write a String imageHref variable on it, because I Know the path of the image I passed in and there is a name in the path, just intercept it

This method is just a method in the Commons.io package
      
        }

 

 

set proxy ip

When crawling webpages, some target sites have an anti-crawler mechanism, and IP blocking measures will be collected for frequent site visits and regular site visits.

 

At this time, the proxy IP comes in handy.

 

Regarding the proxy IP, there are also several types of transparent proxy, anonymous proxy, confusing proxy, and high-anonymity proxy

Transparent proxy: Although transparent proxy can directly "hide" your IP address, you can still find out who you are

Anonymous proxy is a bit better than transparent proxy: others can only know that you use a proxy, but cannot know who you are.

To confuse the proxy, others can still know that you are using the proxy, but you will get a fake IP address, which is more realistic in disguise:

The high-anonymity agent makes it impossible for others to find that you are using an agent, so it is the best choice. We are using a high-profile proxy,

 

Where does the proxy IP come from? It's very simple. Baidu, you will know a lot of proxy IP sites. Usually some free

 

     CloseableHttpClient httpClient=HttpClients.createDefault(); // create httpClient instance

        HttpGet httpGet=new HttpGet("https://www.taobao.com/"); // create httpget instance

        HttpHost proxy=new HttpHost("116.226.217.54", 9999); //used to set proxy ip

/*

       This class is located under the org.apache.http.client.config package and is mainly used to obtain and configure some external network environments. There is a nested class RequestConfig.Builder under it

The method of use is: first use the static method custom() of the RequestConfig class to obtain the requestConfig.Builder "configurator", and then use the various methods below to configure the network environment;

 

*/

        RequestConfig requestConfig=RequestConfig.custom().setProxy(proxy).build();

        httpGet.setConfig(requestConfig);

        httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");

        CloseableHttpResponse response=httpClient.execute(httpGet); // execute http get request

        HttpEntity entity=response.getEntity(); // Get the returned entity

        System.out.println("Webpage content: "+EntityUtils.toString(entity, "utf-8")); // Get webpage content

        response.close(); // response close

        httpClient.close(); // httpClient is closed

 

 

  Generally, if you encounter 403 on the collection website, it may be blocked. At this time, you should change the proxy ip.
     Idea:
     first make a small crawler, go to these free proxy ip websites, get the ip and port
     Put the proxy ip and port into a map (in the queue) for the two data . You can also
     add a judgment. If it is 403, take out the key-value pair of the next map and remove the ip
     , then it will always be fetched and moved. In addition, add another judgment, if the ip is not enough, continue to run the crawler and crawl the ip

 

httpClient has a connection time and a time to read content when executing a specific http request;

 

The so-called connection time is the time from where HttpClient sends the request to the host address of the target url.

The smoother and faster the line, but due to the complex and intertwined routing, the connection time is often not fixed, and if you are unlucky, you will not be able to connect. The default connection time of HttpClient,

The default is 1 minute. If you continue to try to connect after more than 1 minute, there will be a problem. If you encounter a url that is always unable to connect, it will affect the threads of other threads to enter, so we need to make special settings, such as setting If there is no connection within 10 seconds, we will report an error, so that we can carry out business processing.

For example, after we have controlled the business, we will try to connect again. And this special url is written to the log4j log. It is convenient for administrators to view.

ttpClient read time

The so-called reading time is that HttpClient has connected to the target server, and then obtains the content data. Generally, reading data is very fast.

However, if the amount of data to be read is large, or there are problems with the target server itself (such as slow reading of the database, large amount of concurrency, etc..) it will also affect the reading time.

As above, we still need to make special settings, such as setting 10 seconds, if the reading is not finished within 10 seconds, an error will be reported, as above, we can handle it in business.

 

   If the connection times out, the report is connect timed out;
   if the reading timeout, the report is read timed out

 

CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet("https://www.tuicool.com/");
        
        
        RequestConfig config = RequestConfig.custom()
                            .setConnectTimeout(10000)//Set connection
                            timeout.setSocketTimeout (10000)//Set read timeout.build
                            ();
        httpGet.setConfig(config);
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);
        HttpEntity entity = response.getEntity();
        System.out.println("The source code of the webpage is as follows:");
        String src = EntityUtils.toString(entity, "utf-8");
        System.out.println(src);
        //关流
        response.close();
        httpClient.close();

 

/=====================================================

In fact, httpclient can do a lot of things, but we can do so much with it at present, and there are others that can do better than it.

you will know later

What about after parsing the source code? Can we parse the content? Of course

But have you considered it? The code of the front desk personnel of some websites is not well written, so there may be situations where tags cannot appear in pairs or short tags

This is the debut of htmlcleaner or htmlparser

I personally recommend using htmlcleaner, which is better

HtmlCleaner clean = new HtmlCleaner();
        TagNode tagNode = clean.clean(src); //That's how it works

 

It supports xpath , we can use it to parse the content

//Title
           Object[] titles = tagNode.evaluateXPath("xpath expression");

 

Then just get

 

htmlcleaner can also convert the source code into a dom tree and we can use dom4j to parse it, but it is still a bit troublesome, because it comes with xpath, why bother to create a new xpathFacotry?
------------ ---------
 

Guess you like

Origin blog.csdn.net/qq_40077806/article/details/83591911