Actual Combat: Manually Build Java Crawler - Based on JDK11 Native HttpClient (1)

Table of contents

Required environment and components

Development environment configuration 

pom.xml

application.properties

relational database script

program structure planning 


        Most Java crawlers used Apache's HttpClient in the past. Since the release of HttpClient5 in 2020, there has been a big difference compared with HttpClient4. The jar package structure has changed, and there are certain problems in upgrading old programs. In fact, since JDK11, the HttpClient function of JDK has been gradually improved. We might as well encapsulate the HttpClient of JDK11 and replace the third-party plug-in. It should be enough for daily use.

Required environment and components

Basic environment:

JDK11

SpringBoot 2.6.4

Other (not required):

Jsoup 1.14.3 (HTML content parsing tool)

Netty 4.1.74 (Online Messages)

MySQL 8.0.28 (persistent database)

SqlToy 5.1.32 (data persistence middleware)

Thymeleaf 2.6.4 (HTML page template engine)

Development environment configuration 

Let's take a look at the pom.xml configuration first

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.6.4</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.vtarj</groupId>
    <artifactId>Pythagoras</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>Pythagoras</name>
    <description>Pythagoras</description>
    <properties>
        <java.version>11</java.version>
        <sqltoy.version>5.1.32</sqltoy.version>
        <jsoup.version>1.14.3</jsoup.version>
        <mysql.version>8.0.28</mysql.version>
    </properties>
    <dependencies>
        <!-- SpringBoot Web -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!-- Thymeleaf模板引擎 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>

        <!-- Netty 提供Websocket服务的优秀框架 -->
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty-all</artifactId>
        </dependency>
        <!-- Jsoup 提供HTML资源解析的java库 -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>${jsoup.version}</version>
        </dependency>

        <!--开发环境热部署-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <scope>runtime</scope>
            <optional>true</optional>
        </dependency>

        <!-- MySQL驱动 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>

        <!-- jdbc基础包,springboot自带的数据库连接池管理工具hikari依赖此包 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
            <version>${jdbc.version}</version>
        </dependency>
        <!-- 数据库操作工具 sagacity-sqltoy-starter -->
        <dependency>
            <groupId>com.sagframe</groupId>
            <artifactId>sagacity-sqltoy-starter</artifactId>
            <version>${sqltoy.version}</version>
        </dependency>

        <!--自动生成Get/Set方法-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <!-- SqlToy提倡将sql.xml文件放在业务下,便于模块化管理和产品化抽离,因此编译时就需要全包扫描-->
        <resources>
            <resource>
                <directory>src/main/java</directory>
                <excludes>
                    <exclude>**/*.java</exclude>
                </excludes>
                <includes>
                    <include>**/*.xml</include>
                </includes>
            </resource>
            <resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>**/*.xml</include>
                    <include>**/*.properties</include>
                    <include>**/*.yml</include>
                    <include>**/*.sql</include>
                    <include>**/*.jpg</include>
                    <include>**/*.html</include>
                    <include>**/*.js</include>
                    <include>**/*.png</include>
                </includes>
            </resource>
        </resources>
        <testResources>
            <testResource>
                <directory>src/test/java</directory>
                <excludes>
                    <exclude>**/*.java</exclude>
                </excludes>
                <includes>
                    <include>**/*.xml</include>
                </includes>
            </testResource>
            <testResource>
                <directory>src/test/resources</directory>
                <includes>
                    <include>**/*.xml</include>
                    <include>**/*.properties</include>
                    <include>**/*.yml</include>
                    <include>**/*.sql</include>
                    <include>**/*.jpg</include>
                    <include>**/*.html</include>
                    <include>**/*.js</include>
                    <include>**/*.png</include>
                </includes>
            </testResource>
        </testResources>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <version>2.6.4</version>
                <configuration>
                    <excludes>
                        <exclude>
                            <groupId>org.projectlombok</groupId>
                            <artifactId>lombok</artifactId>
                        </exclude>
                    </excludes>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

Look at the application.properties configuration

application.properties

###服务器配置
server.servlet.encoding.enabled=true
server.servlet.encoding.charset=UTF-8

###服务器配置
server.port=8000

#静态资源访问路径,默认会在/public /static /resources 目录下查找静态资源,故无需配置这些
spring.mvc.static-path-pattern=/**
#静态资源访问路径设置 多个静态资源路径使用逗号分隔
spring.web.resources.static-locations=classpath:/META-INF/resources/, classpath:/resources/, classpath:/static/, classpath:/public/, classpath:/views/

#Thymeleaf模板引擎配置
spring.thymeleaf.prefix=classpath:/views/page/
spring.thymeleaf.suffix=.html

#Netty的自定义配置
netty.websocket.ip=0.0.0.0
netty.websocket.port=7251
netty.websocket.max-size=10240
netty.websocket.path=/channel

###Hikari数据库连接池配置(Springboot默认Hikari管理连接池)
##数据库基本配置
spring.datasource.type=com.zaxxer.hikari.HikariDataSource
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/pythagoras?serverTimezone=Asia/Shanghai&useUnicode=true&character_set_server=utf8mb4&useSSL=false&allowPublicKeyRetrieval=true
spring.datasource.username=root
spring.datasource.password=root
##Hikari性能配置
# 数据库连接池名称
spring.datasource.hikari.pool-name=DefaultDataSource
# 数据库最小空闲连接
spring.datasource.hikari.minimum-idle=5
# 数据库空闲连接超时时间
spring.datasource.hikari.idle-timeout=180000
# 数据库连接最大生命周期
spring.datasource.hikari.max-lifetime=600000
# 数据库最大连接数
spring.datasource.hikari.maximum-pool-size=10
# 数据库自动提交
spring.datasource.hikari.auto-commit=true
# 数据库连接超时时间
spring.datasource.hikari.connection-timeout=30000
# 数据库连接测试语句
spring.datasource.hikari.connection-test-query=SELECT 1 FROM DUAL

###SqlToy数据库操作工具配置
# 启用debug模式,可打印执行sql语句
spring.sqltoy.debug=true
# 数据库跨库函数自动转换,默认启用
spring.sqltoy.function-converts=default
# SQL打印超时时间
spring.sqltoy.print-sql-timeout-millis=30000
# SQL文件扫描根路径,从根路径下扫描所有包
spring.sqltoy.sql-resources-dir=classpath:com/vtarj/pythagoras
# SqlToy缓存翻译配置,使用默认配置文件路径(classpath:sqltoy-translate.xml)时可以不配置
spring.sqltoy.translate-config=classpath:sqltoy-translate.xml

Let's use MySQL8.0 as the data container, aiming at the administrative division of the National Bureau of Statistics, let's start our crawler journey

For MySQL8.0 installation, please refer to: "MySQL Automatic Installation Batch Script" for automatic installation, which will not be introduced here.

The latest administrative divisions of the National Bureau of Statistics can be found in: "National Statistical Division Codes and Urban-Rural Division Codes (2021)"

relational database script

/*
 Navicat Premium Data Transfer

 Source Server         : MySQL
 Source Server Type    : MySQL
 Source Server Version : 80028
 Source Host           : localhost:3306
 Source Schema         : pythagoras

 Target Server Type    : MySQL
 Target Server Version : 80028
 File Encoding         : 65001

 Date: 15/04/2022 10:10:41
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for sc_region
-- ----------------------------
DROP TABLE IF EXISTS `sc_region`;
CREATE TABLE `sc_region`  (
  `R_ID` varchar(22) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL COMMENT '主键ID',
  `R_CODE` varchar(35) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '行政区划编码',
  `R_TYPE` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '行政区划类型',
  `R_NAME` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '行政区划名称',
  `R_LEVEL` int(0) NULL DEFAULT NULL COMMENT '行政区划等级',
  `R_URL` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '行政区划链接',
  `P_CODE` varchar(35) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '上级行政区划编码',
  `P_URL` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '上级行政区划链接',
  PRIMARY KEY (`R_ID`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci COMMENT = 'System Center Region Table,系统行政区划信息表' ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

program structure planning 

        Please ignore the coded content, it has nothing to do with this article

To be continued~~~ 

Next: Actual Combat: Building a Java Crawler by Hand - Based on JDK11 Native HttpClient (2)

Guess you like

Origin blog.csdn.net/Asgard_Hu/article/details/124592738