Article directory

background
1. First introduction to Jsoup crawling
- - 1. Web page structure analysis
2. Java development Jsoup crawling
3. Process analysis and results
- - 1. Collection process analysis
  - 2. Operation results
Summarize
Summarize

background

With the advent of the post-epidemic era, the national economic situation in the past 2022 must have been a matter of great concern to many scholars and research subjects. These data are recorded on the website of the National Bureau of Statistics. By analyzing these data, the current economic situation can be verified and observed from a certain perspective.

A total of 1,279 county-level units across the country have disclosed their 2022 GDP and general public budget revenue data. Based on these data, Enterprise Alert has compiled the GDP ranking of China's top 100 counties and the general public budget revenue ranking of China's top 100 counties. Among them, Kunshan City continues to top the list with a GDP of 500.666 billion yuan, Jiangyin City and Jinjiang City rank second and third among the top 100 counties, and Changsha County is the only one in Hunan Province to enter the top ten in the country (Top7).

The first picture is published in the form of a picture, and the second picture is displayed in the form of an Html table. It is very inconvenient to analyze usage data offline. As a programmer, this must not be difficult for you. We can use web scraping technology to organize the data.

This article will use Java language as the programming language to explain the use of Jsoup to crawl web page knowledge. The article gives detailed sample code, I hope it will be helpful to everyone.

1. First introduction to Jsoup crawling

1. Web page structure analysis

When using Jsoup to crawl a page, it is necessary to conduct a preliminary analysis of the structure of the web page in order to formulate corresponding crawling strategies. First open the browser, enter the address of the target website, and at the same time open F12 to enter debugging and find the elements of the target web page.

Open the table table under the div in the GDP top 100 list above and find the following data

In the same way, the same method is used for data processing of general public budget revenue, which will not be described again here.

2. Java development Jsoup crawling

1. Reference Jsoup related dependency packages

Here we use Maven's jar for package dependency processing and management. Therefore, Pom.xml is defined first, and the key code is as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>com.yelang</groupId>
 <artifactId>jsoupdemo</artifactId>
 <version>0.0.1-SNAPSHOT</version>
 
 <dependencies>
  <dependency>
   <groupId>org.jsoup</groupId>
   <artifactId>jsoup</artifactId>
   <version>1.11.3</version>
  </dependency>
 
  <dependency>
   <groupId>com.alibaba</groupId>
   <artifactId>easyexcel</artifactId>
   <version>3.0.5</version>
  </dependency>
 </dependencies>
 
</project>

2. Processing of information entity classes

Comparison shows that the two tables deal with different specific indicators. The previous rankings, county names, and province names are all the same. Therefore, we use object-oriented design methods to develop information processing classes. The corresponding class diagram is as follows:

3. Data collection entity

import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class CountyBase implements Serializable {
    
    
 private static final long serialVersionUID = -1760099890427975758L;
 
 @ExcelProperty(value= {
    
    "序号"},index = 1)
 private Integer index;
 
 @ExcelProperty(value= {
    
    "县级地区"},index = 2)
 private String name;
 
 @ExcelProperty(value= {
    
    "所属省"},index = 3)
 private String province;
 
 public Integer getIndex() {
    
    
  return index;
 }
 
 public void setIndex(Integer index) {
    
    
  this.index = index;
 }
 
 public String getName() {
    
    
  return name;
 }
 
 public void setName(String name) {
    
    
  this.name = name;
 }
 
 public String getProvince() {
    
    
  return province;
 }
 
 public void setProvince(String province) {
    
    
  this.province = province;
 }
 
 public CountyBase(Integer index, String name, String province) {
    
    
  super();
  this.index = index;
  this.name = name;
  this.province = province;
 }
 
 public CountyBase() {
    
    
  super();
 }
 
}

In the above code, sorting, county-level regions, and provinces are abstracted as parent classes, and two subclasses are designed: GDP class and general public revenue class. What needs to be noted here is that since we need to save the collected data to a local Excel table, we use EasyExcel as the technical generation component. @ExcelPropertyIn this property, we define the Excel header to be written and the specific sorting.

import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class Gdp extends CountyBase implements Serializable {
    
    
 
 private static final long serialVersionUID = 5265057372502768147L;
 
 @ExcelProperty(value= {
    
    "GDP（亿元）"},index = 4)
 private String gdp;
 
 public String getGdp() {
    
    
  return gdp;
 }
 
 public void setGdp(String gdp) {
    
    
  this.gdp = gdp;
 }
 
 public Gdp(Integer index, String name, String province, String gdp) {
    
    
  super(index,name,province);
  this.gdp = gdp;
 }
 
 public Gdp(Integer index, String name, String province) {
    
    
  super(index, name, province);
 }
 
}


import java.io.Serializable;
 
import com.alibaba.excel.annotation.ExcelProperty;
 
public class Gpbr extends CountyBase implements Serializable {
    
    
 
 private static final long serialVersionUID = 8612514686737317620L;
 
 @ExcelProperty(value= {
    
    "一般公共预算收入（亿元）"},index = 4)
 private String gpbr;// General public budget revenue
 
 public String getGpbr() {
    
    
  return gpbr;
 }
 
 public void setGpbr(String gpbr) {
    
    
  this.gpbr = gpbr;
 }
 
 public Gpbr(Integer index, String name, String province, String gpbr) {
    
    
  super(index, name, province);
  this.gpbr = gpbr;
 }
 
 public Gpbr(Integer index, String name, String province) {
    
    
  super(index, name, province);
 }
}

4. Actual crawling

The following is the conversion code for processing GDP data. If you are not familiar with Jsoup, you can first familiarize yourself with the relevant syntax. If you have development experience similar to Jquery, you can get started with Jsoup very quickly.


static void grabGdp() {
    
    
  String target = "https://www.maigoo.com/news/665462.html";
  try {
    
    
            Document doc = Jsoup.connect(target)
                    .ignoreContentType(true)
                    .userAgent(FetchCsdnCookie.ua[1])
                    .timeout(300000)
                    .header("referer","https://www.maigoo.com")
                    .get();
            Elements elements = doc.select("#t_container > div:eq(3) table tr");
            List<Gdp> list = new ArrayList<Gdp>();
            for(int i = 1;i<elements.size();i++) {
    
    
             Element tr = elements.get(i);//获取表头
             Elements tds = tr.select("td");
             Integer index = Integer.valueOf(tds.get(0).text());
             String name = tds.get(1).text();
             String province = tds.get(2).text();
             String gdp = tds.get(3).text();
             Gdp county = new Gdp(index, name, province, gdp);
             list.add(county);
            }
            String fileName = "E:/gdptest/2023全国百强县GDP排行榜 .xlsx";
            EasyExcel.write(fileName, Gdp.class).sheet("GDP百强榜").doWrite(list);
            System.out.println("完成...");
  } catch (Exception e) {
    
    
   System.out.println(e.getMessage());
   System.out.println("发生异常，继续下一轮循环");
  }
 }

What needs to be noted here is how to locate and capture web page elements in jsoup. Above, we use a jquery-like Dom acquisition method.

Elements elements = doc.select("#t_container > div:eq(3) table tr");

Use this line to get each tr under the table, and then loop through each td to get the corresponding data.

3. Process analysis and results

1. Collection process analysis

Here we use the method of debugging the source program to analyze the web page. Use jsou to simulate web page access

The method used select(xxx)to obtain page elements,

Get the td cell data under tr,

2. Operation results

After the above code is executed, the following two files can be seen on the destination disk:

When you open the above two excel files, you can see that the data you want to collect has been collected, and the order of the data is completely generated according to the order on the web page.

Summarize

The above is the main content of this article. This article uses Java language as the programming language, explains in detail how to use Jsoup to crawl web page knowledge, and combines EasyExcel to convert web page tables into Excel tables. At the same time, detailed sample codes are given in the article. Due to the hasty writing, errors are inevitable. Criticisms, corrections and exchanges are welcome.

The data to be collected has been collected, and the order of the data is completely generated according to the order on the web page. [External link pictures are being transferred...(img-ApNKEilI-1686618919683)]

[External link pictures are being transferred...(img-TA0QC4Hk-1686618919683)]

Summarize

Original text: blog.csdn.net/yelangkingwuzuhu/article/details/130901172

Java+Jsoup+EasyExcel crawls the GDP data of the top 100 counties in the country in 2023 in the post-epidemic era

Article directory

background

1. First introduction to Jsoup crawling

1. Web page structure analysis

2. Java development Jsoup crawling

1. Reference Jsoup related dependency packages

2. Processing of information entity classes

3. Data collection entity

4. Actual crawling

3. Process analysis and results

1. Collection process analysis

2. Operation results

Summarize

Summarize

Guess you like