[Java-Crawler] HttpClient+Jsoup implements simple crawler

Web Crawler

A web crawler (WEB crawler) is a program or script that automatically grabs information on the World Wide Web according to certain rules.

1. Crawler entry program

Import dependencies (The program listed below uses this dependency, which is version 5. I copied the latest version directly from Maven. I didn’t think so much, but more people use it after 4 o’clock, and there are more online materials. , so it is recommended that you use more than 4 points, but the following is 5, and the following code part is the same as the place where the status code is obtained. 4 is getStatusLine().getStatusCode() and then 5 is directly getCode( ), and 5 does not seem to have set the socket timeout option when setting the request parameters, 4 has it, or I may not have found it):

<dependency>
      <groupId>org.apache.httpcomponents.client5</groupId>
      <artifactId>httpclient5</artifactId>
      <version>5.2.1</version>
</dependency>

Although the following program uses 5, it is recommended that you use 4. Many people use it. If there is a problem, it is easy to find the error online.

<dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.5.14</version>
</dependency>
public class CrawlerFirst {
    
    

    public static void main(String[] args) throws IOException, ParseException {
    
    
        // 1. 打开浏览器,创建HttpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 2. 输入网址,发起get请求创建HttpGet对象
        HttpGet httpGet = new HttpGet("http://www.itcast.cn");
        // 3. 按回车,发起请求,返回响应,使用HttpClient对象发起请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 4. 解析响应,获取数据
        System.out.println(response.getCode());// 状态码
        if(response.getCode() == 200){
    
    
            HttpEntity entity = response.getEntity();
            String content = EntityUtils.toString(entity,"utf-8");// 得告诉编码为utf-8,不然html代码中文部分会乱码
            System.out.println(content);// 这输出结果是那个页面的html字符串
        }

    }

}

Web Crawler

1. Introduction to web crawlers

​In the era of big data, information collection is a very important task, and the data on the Internet is massive. If information collection is performed solely by manpower, it will not only be inefficient and cumbersome, but also increase the cost of collection. How to automatically and efficiently obtain the information we are interested in from the Internet and use it for the editor is an important problem, and crawler technology was born to solve these problems.

​ Web Crawler, also known as a web robot, can automatically collect and organize data information on the Internet instead of people. It is a program or script that automatically captures information on the World Wide Web according to certain rules, and can automatically collect the content of all pages that it can access to obtain relevant data.

​ In terms of function, crawlers are generally divided into 采集,处理,储存three parts of data. The crawler starts from the URL of one or several initial webpages, and obtains the URLs on the initial webpage. During the process of crawling webpages, it continuously extracts URLs from the current page and puts them into the queue until a certain stop condition of the system is met.

2. Why learn web crawlers

  • search engine
    • After we learn how to write reptiles, we can use crawlers to automatically collect information from the Internet, store or process the collected information accordingly, and when we need to search for certain information, we only need to search in the collected information, that is Implemented a private search engine.
  • In the era of big data, we can obtain more data sources
    • When performing big data analysis or data mining, data sources are required for analysis. We can obtain it from some websites that provide data statistics, or from some literature or content materials, but these methods of obtaining data are sometimes difficult to meet our data needs, and we manually search for these data from the Internet , it consumes too much energy. At this time, we can use crawler technology to automatically obtain the data content we are interested in from the Internet, and crawl these data content back as our data source, and then conduct deeper data analysis and obtain more valuable information.
  • Can be better search engine optimized (SEO)
  • conducive to employment.
    • In terms of employment, the direction of crawler engineer is one of the good choices, because the demand for crawler engineers is increasing at present, and there are fewer people who can be qualified for this position, so it belongs to a relatively scarce career direction, and as the university With the advent of the data age and artificial intelligence, the application of crawler technology will become more and more extensive, and there will be a good room for development in the future.

HttpClient

​ Web crawlers use programs to help us access resources on the network. We have always used the HTTP protocol to access web pages on the Internet. Web crawlers need to write programs, and here we use the same HTTP protocol to access web pages.

​ Here we use the technology of java's HTTP protocol client HttpClient to capture web page data. HttpClient is a sub-project under Apache Jakarta Common, which is used to provide an efficient, up-to-date, feature-rich client programming toolkit supporting the HTTP protocol, and it supports the latest version and recommendations of the HTTP protocol.

HttpClient function introduction:

  • Implemented all HTTP methods
  • Support automatic steering
  • Support https protocol
  • Proxy server support

Implementation steps:

  1. Open the browser == "create an HttpClient object, which can be obtained through HttpClients.createDefault()
  2. Create HttpUriRequestBase subclass objects according to the corresponding request method, that is, HttpGet, HttpPost, HttpPut, etc., and enter the url parameter when constructing its instance. The url can be either a string or a URL object.
  3. Send request == "Use the httpClient object to execute the corresponding request above. After execution, a response object will be returned.
  4. Use the response object to parse the response to get data.
  5. close resource

1. Get request

public class HttpGetTest {
    
    

    public static void main(String[] args) {
    
    
        CloseableHttpResponse response = null;
        // 1. 打开浏览器,创建httpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 2. 创建httpGet对象,设置url访问地址
        HttpGet httpGet = new HttpGet("https://www.51cto.com/");
        // 3. 使用httpClient发送请求,获取response
        try {
    
    
            response = httpClient.execute(httpGet);
            // 4. 解析响应
            if (response.getCode()==200) {
    
    
                String content = EntityUtils.toString(response.getEntity(), "utf-8");
                System.out.println(content.length());
            }
        } catch (IOException e) {
    
    
            throw new RuntimeException(e);
        } catch (ParseException e) {
    
    
            throw new RuntimeException(e);
        }finally {
    
    
            if(response!=null){
    
    
                try {
    
    
                    response.close();
                } catch (IOException e) {
    
    
                    throw new RuntimeException(e);
                }
            }
            if(httpClient!=null){
    
    
                try {
    
    
                    httpClient.close();
                } catch (IOException e) {
    
    
                    throw new RuntimeException(e);
                }
            }
        }

    }

}

2. GET request with parameters

  • Create an instance of httpClient first
  • Then get the uri object, which is constructed by the URIBuilder object, and create the URIBuilder object
  • Add parameters (different HttpClient depends on its method of setting parameters may be different, the lower version is setParameter, the higher version is as follows), adding parameters returns a URIBuilder object, so you can add parameters in a chain.
  • Pass the uri object (URIBuilder.build()) to the corresponding request object to access the uri access address.
  • Execute the corresponding request and get the corresponding result set
  • Data processing
  • close resource
public class HttpGetWithParam {
    
    

    public static void main(String[] args) throws IOException, ParseException, URISyntaxException {
    
    
        // 1. 创建HttpClient实例
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 2. 根据相应的请求构造相应的Uri对象
        // 先创建 URLBuilder
        URIBuilder uriBuilder = new URIBuilder("https://so.51cto.com/");
        // 设置参数
        uriBuilder.addParameter("keywords","java爬虫").addParameter("sort","time");
        // 创建httpGet对象,访问url访问地址
        HttpGet httpGet = new HttpGet(uriBuilder.build());
        // 3. 实现对应url对象,使用httpClient;然后即可获取到response
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 4. 拿到response对象即可获取数据
        if (response.getCode()==200) {
    
    
            HttpEntity entity = response.getEntity();
            System.out.println(EntityUtils.toString(entity, "utf-8"));
        }
        response.close();
        httpClient.close();
    }

}

3. Post request

The difference with the Get request is just the step of setting the uri object, which is originally HttpGet changed to HttpPost.

public class HttpPostTest {
    
    

    public static void main(String[] args) throws IOException, ParseException {
    
    
        // 1. 创建httpClient对象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        // 2. 创建HttpPost对象,设置uri访问地址
        HttpPost httpPost = new HttpPost("https://www.51cto.com/");
        // 3. 执行uri对象,然后获取response对象
        CloseableHttpResponse response = httpClient.execute(httpPost);
        // 4. 处理响应数据
        if (response.getCode()==200) {
    
    
            HttpEntity entity = response.getEntity();
            System.out.println(EntityUtils.toString(entity, "utf-8"));
        }
        // 5.关闭资源
        httpClient.close();
        response.close();
    }

}

4. Post request with parameters

There is no parameter in the uri address, and the parameter keys = java is placed in the form for submission.

A Post request with parameters is different from a Get request with parameters. Just put the uri query information on the uri directly, and we need to pass the parameter content into the request body.

  • First, you need to declare a List. The parameter type of the declaration is that we add the implementation class object NameValuePairof NameValuePair to the collection.BasicNameValuePair
  • Then convert it to HttpEntitythe following UrlEncodedFormEntity 实例(implementation class of HttpEntity), because its construction method public UrlEncodedFormEntity(Iterable<? extends NameValuePair> parameters, Charset charset) allows passing in collection parameters, and other implementation classes like StringEntity do not have this.
    • There is one point to note here: the second parameter is passed in Charset实例对象, not the corresponding encoded string. The lower version is overloaded to help us automatically convert the passed in string into an instance object. Now the version is much higher. We turned by ourselves, really speechless. Just use Charset.forName("corresponding encoding").
  • Then we set the entity object to the httpPost uri object.
  • The last is the consistent operation, execute the uri, get the result, process the result, close the resource...
public class HttpPostWithParam {
    
    

    public static void main(String[] args) throws IOException, ParseException {
    
    
        CloseableHttpClient httpClient = HttpClients.createDefault();

        // 声明List集合,封装表单中的参数
        List<NameValuePair> paramList = new ArrayList<NameValuePair>();
        paramList.add(new BasicNameValuePair("keys","java"));
        // paramList.add(new BasicNameValuePair("sort", "time"));
        // 创建表单的 Entity 实体对象
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(paramList,Charset.forName("utf-8"));
        // 设置表单的 Entity 实体对象到Post请求中
        HttpPost httpPost = new HttpPost("https://www.itcast.cn/");
        httpPost.setEntity(formEntity);

        CloseableHttpResponse response = httpClient.execute(httpPost);
        if (response.getCode()==200) {
    
    
            HttpEntity entity = response.getEntity();
            System.out.println(EntityUtils.toString(entity, "utf-8"));
        }

        httpClient.close();
        response.close();

    }

}

5. Connection pool

​ We know that the HttpClient instance object is equivalent to one of our browsers. We have to create an HttpClient object every time we send a request, and there will be frequent creation and destruction problems == "It is equivalent to opening a browser to access a page and then Close it, open the browser again, and close it again after visiting.

public class HttpClientPoolTest {
    
    

    public static void main(String[] args) {
    
    
        // 1. 创建连接池管理器
        PoolingHttpClientConnectionManager httpClientPool = new PoolingHttpClientConnectionManager();

        // 设置连接数
        httpClientPool.setMaxTotal(100);
        // 设置每个主机的最大连接数
        httpClientPool.setDefaultMaxPerRoute(10);

        // 2. 使用连接池管理器发起请求
        doGet(httpClientPool);
        doGet(httpClientPool);
        doGet(httpClientPool);

        httpClientPool.close();
    }

    private static void doGet(PoolingHttpClientConnectionManager httpClientPool)  {
    
    
        // 不是每次创建新的httpClient,而是从连接池中获取httpClient对象
        // 下面这段代码的意义:先是从HttpClients工具类中通过custom方法获取到HttpClientBuilder对象
        // 然后通过给HttpClientBuilder中的PoolingHttpClientConnectionManager连接管理对象赋值
        // 最后build一个httpClient对象出来(即使你不这样去set连接池,你通过HttpClientBuild对象去build一个HttpClient实例也会去创建一个)
        try {
    
    
            CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(httpClientPool).build();

            HttpGet httpGet = new HttpGet("https://www.51cto.com/");
            CloseableHttpResponse response = httpClient.execute(httpGet);

            if (response.getCode() == 200) {
    
    
                HttpEntity entity = response.getEntity();
                System.out.println(EntityUtils.toString(entity, "utf-8"));
            }

            response.close();
            // httpClient.close();    httpClient由于是交给了连接池管理,所以这里就不用关闭了,换句话说关连接池就行
        }catch(Exception e){
    
    
            e.printStackTrace();
        }
    }

}

The following is the address of the httpClient instance object after calling doGet again, and they are all different = " It means that every time an object is created, it is allocated by the connection pool.

insert image description here
insert image description here
insert image description here

Note: Some people don't understand why after setting the maximum number of connections, it is necessary to set a maximum number of connections for each host. In fact, the reason is very simple: after setting up this connection pool, we don’t need to create and destroy repeatedly in each class in the future, which will waste space, and the creation and destruction of httpClient instance objects are handed over to the connection pool for management. At the same time, when we are crawling webpage information, there will be multiple different hosts ("Host: www.csdn.net[\r][\n]"), such as Sohu, Sina, Tencent... Personal understanding ensures that crawling Taking it accurately and comprehensively is like shopping for vegetables. With the amount of money you have, divide it into several equal parts, and go to different dishes to buy high-quality dishes.

If you do not set the total number of connections and the maximum number of connections per host, the following parameters are the default values ​​(version 5 has the default value, 4 does not, you have to set it yourself):

insert image description here

6. Request parameters

​ Sometimes due to the network or the target server, the request takes longer to complete, and we need to customize the relevant time.

public class HttpConfigTest {
    
    

    public static void main(String[] args) throws IOException, ParseException {
    
    
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet("https://www.51cto.com/");
        // 配置请求信息
        RequestConfig requestConfig = RequestConfig.custom()
                .setConnectTimeout(Timeout.ofMilliseconds(1000L))// 创建连接的最长时间,毫秒
                .setConnectionRequestTimeout(Timeout.ofMilliseconds(500L)).build();// 设置获取连接的最长时间,毫秒
        // 给请求设置请求信息
        httpGet.setConfig(requestConfig);
        CloseableHttpResponse response = httpClient.execute(httpGet);
        HttpEntity entity = response.getEntity();
        System.out.println(EntityUtils.toString(entity, "utf-8"));

        response.close();
        httpClient.close();

    }

}

Are p

​ After we use HttpClient to grab the page data, we need to parse the page. You can use string processing tools to parse pages, and you can also use regular expressions, but these methods will bring a lot of development costs, so we need to use a technology that specifically parses html pages - " jsoup.

1. Introduction to jsoup

​,jsoup 是一款 Java 的HTML 解析器 the full name is Java HTML Parser, which can directly parse a URL address and HTML text content. It provides a set of very labor-saving API, which can fetch and manipulate data through DOM, CSS and operation methods similar to jQuery.

The main functions of jsoup:

  • Parse HTML from a URL, text or string;
  • Use DOM or CSS selectors to 查找、取出数据;
  • 操作HTML elements, attributes, text, and reptiles can be related ;

Add jsoup dependency:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.3</version>
</dependency>

Other tool class dependencies ( common-io= "need to process files, common-lang3 = "need to use StringUtils to process strings):

<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.11.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.12.0</version>
</dependency>

<dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13</version>
      <scope>test</scope>
</dependency>

2.1 Function 1.1- Parse url

    @Test
    public void testUrl() throws Exception{
    
    
        // 解析 url 地址,第一个参数是 URL对象,第二个参数是访问时候的超时时间
        Document doc = Jsoup.parse(new URL("https://www.51cto.com/"), 10000);

        // 使用标签选择器,获取title标签中的内容
        String title = doc.getElementsByTag("title").first().text();

        // 打印
        System.out.println(title);// 技术成就梦想51CTO-中国领先的IT技术网站
    }

Note: Although Jsoup can be used instead of HttpClient to directly initiate a request to grab data, it is often not used in this way, because in the actual development process, you need to use 多线程,连接池,代理other methods, and jsoup does not support these very well, so we generally use jsoup is only used as an Html parsing tool.

2.2 Function 1.2- Parse string

Use the org.apache.commons.io.FileUtils tool class to parse the html file into a string, and then use jsoup to parse the string.

    @Test
    public void testString() throws Exception {
    
    
        // 使用工具类获取文件,获取字符串
        String index = FileUtils.readFileToString(new File("C:\\Users\\myz03\\Desktop\\index.html"), "utf-8");
        // 解析字符串
        Document doc = Jsoup.parse(index);
        // 使用标签选择器,获取td标签中的内容
        Elements tds = doc.getElementsByTag("td");
        tds.forEach(td->{
    
    
            System.out.println(td.text());
        });
    }

2.3 Function 1.3-parse file

It has the same effect as parsing the string above.

    @Test
    public void testFile() throws Exception{
    
    
        // 解析文件
        Document doc = Jsoup.parse(new File("C:\\Users\\myz03\\Desktop\\index.html"));
        // 使用标签选择器,获取td标签中的内容
        Elements tds = doc.getElementsByTag("td");
        tds.forEach(td->{
    
    
            System.out.println(td.text());
        });
    }

3.1 Function 2.1 - use dom to traverse documents

Element acquisition:

  1. Query element getElementById according to id;
  2. Get the element getElementByTag according to the tag;
  3. Get element getElementByClass according to class;
  4. Get the element getElementByAttribute according to the attribute;
    @Test
    public void testDom() throws Exception{
    
    
        /*
        1. 根据 id 查询元素 getElementById;
        2. 根据标签获取元素 getElementByTag;
        3. 根据 class 获取元素 getElementByClass;
        4. 根据属性获取元素 getElementByAttribute;
        */
        // 解析文件,获取Document对象
        Document doc = Jsoup.parse(new File("C:\\Users\\myz03\\Desktop\\index.html"));
        // 获取元素,根据id
        Element div1Element = doc.getElementById("div1");
        System.out.println(div1Element.text());

        System.out.println("=====================");
        // 获取元素,根据tag
        Elements tds = doc.getElementsByTag("td");
        tds.forEach(td->{
    
    
            System.out.println(td.text());
        });

        System.out.println("===================");
        // 获取元素,根据class
        Elements a1s = doc.getElementsByClass("a1");
        a1s.forEach(a1-> System.out.println(a1.text()));

        System.out.println("======================");
        // 获取元素,根据属性
        Elements colspans = doc.getElementsByAttribute("colspan");
        colspans.forEach(colspan-> System.out.println(colspan.text()));
    }

get data from element

  1. Get the id from the element;
  2. Get the className from the element;
  3. Get the value of the attribute attr from the element;
  4. Get all attributes from the element;
  5. Get the text content text from the element;
    @Test
    public void testData() throws Exception{
    
    

        Document doc = Jsoup.parse(new File("C:\\Users\\myz03\\Desktop\\index.html"),"utf-8");
        // 根据 id 获取元素
        Element div1 = doc.getElementById("div1");

        StringJoiner content = new StringJoiner("、");
//        1. 从元素中获取id;
        content.add(div1.id());
//        2. 从元素中获取 className;
        content.add(div1.className());
//        3. 从元素中获取属性的值 attr;
        content.add(div1.attr("id"));
//        4. 从元素中获取所有属性 attributes;
        Attributes attributes = div1.attributes();
        attributes.forEach(attribute -> content.add(attribute.getKey() + ":" + attribute.getValue()));
//        5. 从元素中获取文本内容 text;
        content.add(div1.text());

        System.out.println(content.toString());
        
    }

3.2 Function 2.2-Selector Selector overview

tagname : Find elements by tags, such as: span

#id : Find elements by ID, for example: # div1

.class : Find elements by class name, for example: .a1

[attribute] : Use attributes to find elements, such as: [target]

[attr=value] : Use the attribute value to find the element, for example: [target=_blank]

    @Test
    public void testSelector() throws Exception {
    
    
        Document doc = Jsoup.parse(new File("C:\\Users\\myz03\\Desktop\\index.html"), "utf-8");

        // 通过标签查找元素
        Elements ths = doc.select("th");
        ths.forEach(th-> System.out.println(th.text()));
        System.out.println("=============");

        // 通过ID来查找元素,#id
        Element div1s = doc.select("#div1").first();
        System.out.println(div1s.text());
        System.out.println("=============");

        // 通过class名称来查找元素,.class
        Elements a1s = doc.select(".a1");
        a1s.forEach(a1-> System.out.println(a1.text()));
        System.out.println("================");

        // 通过attribute来查找元素,[attribute]
        Elements colspans = doc.select("[colspan]");
        colspans.forEach(colspan-> System.out.println(colspan.text()));
        System.out.println("====================");

        // 通过attribute=value来查找元素,[attribute=value]
        Elements rowspans = doc.select("[rowspan=3]");
        rowspans.forEach(rowspan-> System.out.println(rowspan.text()));
        
    }

3.3 Function 2.2 Combined use of plus-Selector selectors

  1. el#id: element + ID, for example: div#div1
  2. el.class: element + class, for example: a.a1
  3. el[attr]: element + attribute name, such as: td[colspan]
  4. Any combination: for example: td[colspan].xxx
  5. ancestor child: Find a child element of an element, such as: div strong
  6. parent > child: Find direct child elements under a parent element: tr > td > a
  7. parent > *: Find all direct child elements under a parent element
    @Test
    public void testSelectorPlus() throws Exception{
    
    
        Document doc = Jsoup.parse(new File("C:\\Users\\myz03\\Desktop\\index.html"), "utf-8");

        // el#id:元素+ID
        Elements div1s = doc.select("div#div1");
        div1s.forEach(div1-> System.out.println(div1.text()));
        System.out.println("============");

        // el.class:元素+class
        Elements a1s = doc.select("a.a1");
        a1s.forEach(a1-> System.out.println(a1.text()));
        System.out.println("=============");

        // el[attr]:元素+属性名
        Element colspan = doc.select("td[colspan]").first();
        System.out.println(colspan.text());
        System.out.println("===============");

        // 任意组合
        Elements tdA1s = doc.select("td[colspan].xxx");
        tdA1s.forEach(tdA1-> System.out.println(tdA1.text()));
        System.out.println("=================");

        // ancestor child:查找某个元素下子元素
        Elements divStrongs = doc.select("div strong");
        System.out.println(divStrongs.first().text());
        System.out.println("===============");

        // parent > child:查找某个父元素下的直接子元素
        Elements selects = doc.select("tr > th");
        selects.forEach(select-> System.out.println(select.text()));
        System.out.println("==========");

        // parent > *:查找某个父元素下的所有直接子元素
        Elements mytrs = doc.select(".mytr > *");
        mytrs.forEach(mytr-> System.out.println(mytr.text()));
    }

Getting Started Case

Since there is a certain amount of code, I put it in the Gitee remote library, and you can clone it if you are interested.
Java-Crawler

Guess you like

Origin blog.csdn.net/qq_63691275/article/details/130781777