NIFI realizes database data incremental synchronization

illustrate

nifi version: 1.23.2 (docker image)

Demand background

Synchronize data in the database to another database, requiring incremental synchronization of new data and historical modified data

simulated data

Create table statement

The structure of the source database and the target database must be consistent to avoid separate conversion later.

-- 创建测试表
CREATE TABLE `sys_user` (
  `id` bigint NOT NULL AUTO_INCREMENT COMMENT '用户ID',
  `name` varchar(50) NOT NULL DEFAULT '' COMMENT '姓名',
  `age`  int NOT NULL DEFAULT 0 COMMENT '年龄',
  `gender` tinyint NOT NULL COMMENT '性别,1:男,0:女',
  `create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  `modify_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改时间',
  `is_deleted` tinyint NOT NULL DEFAULT '0' COMMENT '是否已删除',
  PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB DEFAULT  CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC COMMENT='用户表';

Test Data

-- 模拟数据
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据1', 20, 1);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据2', 21, 1);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据3', 21, 0);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据4', 18, 0);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据5', 22, 1);

complete test data

Configure database connection pool

Right-click on an empty space on the canvas and select Configure

New configuration 

Click the + sign on the pop-up interface to add a new database connection pool configuration. If you already have a configuration, you can skip this step.

 Filter the corresponding type of connection pool in the pop-up interface. I select DBCPConnectionPool here , and then click ADD.

Click the small gear to the right of the newly added piece of data to configure the connection pool.

Configure connection pool related properties 

Mainly configure the following contents. Others need to be modified according to the situation. The password will not be displayed after it is entered.

check attribute

Verify that the configuration is correct, click the check mark in the upper right corner, and then click VERIFY in the pop-up interface to verify

 If the verification is passed, everything will be displayed in green. If a certain item is not passed, there will be a prompt. Finally, click APPLY.

(Optional) Give the configuration a name

In order to facilitate subsequent use, give the connection pool a name, otherwise it will be unclear if there are too many configurations in the future.

Activate the configuration of the connection pool

Click the lightning logo on the right to activate the configuration, click ENABLE on the new page to activate, and finally click CLOSE to close

activated configuration

In the same way, add the connection pool configuration of the target database. The steps are the same as above and will not be repeated here. After the final configuration, there will be two connection pool configurations. as follows:

Get database table data

Add processor: QueryDatabaseTable

Click Processor on the toolbar , drag it to the canvas, filter the QueryDatabaseTable processor, and then click ADD to add it to the canvas.

Configuration Processor: QueryDatabaseTable

Double-click the processor, switch to the PROPERTIES tab, and configure the following content

Maximum-value Columns: The official documentation explains it this way: a comma-separated list of column names. The processor will keep track of the maximum value returned for each column since the processor began running. Using multiple columns means that the column list is sequential, and the value of each column is expected to increase more slowly than the values ​​of previous columns. Therefore, using multiple columns implies a hierarchical structure of columns, often used to partition the table. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types (such as bit/boolean) are not conducive to maintaining maximum values, so columns of these types should not be listed in this property and will cause errors during processing. If columns are not provided, all rows in the table will be considered, which may have an impact on performance. NOTE: It is important to use consistent maximum column names for a given table so that incremental extractions work properly.
Support expression language: true

check attribute

Give the processor a name to indicate the current role of the entire workflow

Split data

Add Processor: SplitAvro

Configuration Processor: SplitAvro

Double-click the processor and switch to the PROPERTIES tab. All contents will be defaulted.

Data storage

Add processor: PutDatabaseRecord

Configuration Processor: PutDatabaseRecord

Double-click the processor and switch to the PROPERTIES tab

Added Record Reader

Configure AvroReader

Click the arrow on the right, select the Reader you just configured in the pop-up interface, and then click the small gear on the right

 In the pop-up interface, configure it according to your own needs. Just follow the default configuration here.

 Activate Reader

Click on the lightning logo on the right to activate

 The activated status changes to Enabled

other configuration

check attribute

Connect all processors

Connect processor

Connect the two processors QueryDatabaseTable and SplitAvro, and check success under For Relationships

Connect the two processors SplitAvro and PutDatabaseRecord, and check split under For Relationships

Handling SplitAvro processor alarms

Double-click the SplitAvro processor, switch to RELATIONSHIPS , check the two options below, and click APPLY

 Handling alarms from the PutDatabaseRecord processor

Double-click the PutDatabaseRecord processor, switch to RELATIONSHIPS , check the options below, and click APPLY

 full configuration

 start all processors

The QueryDatabaseTable processor is executed once a minute by default. It can be configured under the SCHEDULING tab. It is executed according to the default time here.

 Right-click on a blank spot on the canvas and select Start to start all processors.

 

View target database data

After waiting for one minute, check the target database data and find that 5 pieces of data from the source database have been synchronized to the target database.

 Modify data in source database

UPDATE sys_user SET is_deleted = 1 WHERE id = 1;
UPDATE sys_user SET is_deleted = 1 WHERE id = 4;
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据6', 22, 1);

View the target database data again

After waiting for the processor to execute, check the target database data and find that the new data has been synchronized.

 You can see that the last processor finally has 8 records flowing into it.

Conclusion

The above is the whole process of using NIFI to incrementally synchronize database data. If you have any questions, please leave a comment in the comment area.

Guess you like

Origin blog.csdn.net/LSW_JAVADP/article/details/132691708