illustrate
nifi version: 1.23.2 (docker image)
Demand background
Synchronize data in the database to another database, requiring incremental synchronization of new data and historical modified data
simulated data
Create table statement
The structure of the source database and the target database must be consistent to avoid separate conversion later.
-- 创建测试表
CREATE TABLE `sys_user` (
`id` bigint NOT NULL AUTO_INCREMENT COMMENT '用户ID',
`name` varchar(50) NOT NULL DEFAULT '' COMMENT '姓名',
`age` int NOT NULL DEFAULT 0 COMMENT '年龄',
`gender` tinyint NOT NULL COMMENT '性别,1:男,0:女',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
`modify_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '修改时间',
`is_deleted` tinyint NOT NULL DEFAULT '0' COMMENT '是否已删除',
PRIMARY KEY (`id`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci ROW_FORMAT=DYNAMIC COMMENT='用户表';
Test Data
-- 模拟数据
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据1', 20, 1);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据2', 21, 1);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据3', 21, 0);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据4', 18, 0);
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据5', 22, 1);
complete test data
Configure database connection pool
Right-click on an empty space on the canvas and select Configure
New configuration
Click the + sign on the pop-up interface to add a new database connection pool configuration. If you already have a configuration, you can skip this step.
Filter the corresponding type of connection pool in the pop-up interface. I select DBCPConnectionPool here , and then click ADD.
Click the small gear to the right of the newly added piece of data to configure the connection pool.
Configure connection pool related properties
Mainly configure the following contents. Others need to be modified according to the situation. The password will not be displayed after it is entered.
check attribute
Verify that the configuration is correct, click the check mark in the upper right corner, and then click VERIFY in the pop-up interface to verify
If the verification is passed, everything will be displayed in green. If a certain item is not passed, there will be a prompt. Finally, click APPLY.
(Optional) Give the configuration a name
In order to facilitate subsequent use, give the connection pool a name, otherwise it will be unclear if there are too many configurations in the future.
Activate the configuration of the connection pool
Click the lightning logo on the right to activate the configuration, click ENABLE on the new page to activate, and finally click CLOSE to close
activated configuration
In the same way, add the connection pool configuration of the target database. The steps are the same as above and will not be repeated here. After the final configuration, there will be two connection pool configurations. as follows:
Get database table data
Add processor: QueryDatabaseTable
Click Processor on the toolbar , drag it to the canvas, filter the QueryDatabaseTable processor, and then click ADD to add it to the canvas.
Configuration Processor: QueryDatabaseTable
Double-click the processor, switch to the PROPERTIES tab, and configure the following content
Maximum-value Columns: The official documentation explains it this way: a comma-separated list of column names. The processor will keep track of the maximum value returned for each column since the processor began running. Using multiple columns means that the column list is sequential, and the value of each column is expected to increase more slowly than the values of previous columns. Therefore, using multiple columns implies a hierarchical structure of columns, often used to partition the table. This processor can be used to retrieve only those rows that have been added/updated since the last retrieval. Note that some JDBC types (such as bit/boolean) are not conducive to maintaining maximum values, so columns of these types should not be listed in this property and will cause errors during processing. If columns are not provided, all rows in the table will be considered, which may have an impact on performance. NOTE: It is important to use consistent maximum column names for a given table so that incremental extractions work properly.
Support expression language: true
check attribute
Give the processor a name to indicate the current role of the entire workflow
Split data
Add Processor: SplitAvro
Configuration Processor: SplitAvro
Double-click the processor and switch to the PROPERTIES tab. All contents will be defaulted.
Data storage
Add processor: PutDatabaseRecord
Configuration Processor: PutDatabaseRecord
Double-click the processor and switch to the PROPERTIES tab
Added Record Reader
Configure AvroReader
Click the arrow on the right, select the Reader you just configured in the pop-up interface, and then click the small gear on the right
In the pop-up interface, configure it according to your own needs. Just follow the default configuration here.
Activate Reader
Click on the lightning logo on the right to activate
The activated status changes to Enabled
other configuration
check attribute
Connect all processors
Connect processor
Connect the two processors QueryDatabaseTable and SplitAvro, and check success under For Relationships
Connect the two processors SplitAvro and PutDatabaseRecord, and check split under For Relationships
Handling SplitAvro processor alarms
Double-click the SplitAvro processor, switch to RELATIONSHIPS , check the two options below, and click APPLY
Handling alarms from the PutDatabaseRecord processor
Double-click the PutDatabaseRecord processor, switch to RELATIONSHIPS , check the options below, and click APPLY
full configuration
start all processors
The QueryDatabaseTable processor is executed once a minute by default. It can be configured under the SCHEDULING tab. It is executed according to the default time here.
Right-click on a blank spot on the canvas and select Start to start all processors.
View target database data
After waiting for one minute, check the target database data and find that 5 pieces of data from the source database have been synchronized to the target database.
Modify data in source database
UPDATE sys_user SET is_deleted = 1 WHERE id = 1;
UPDATE sys_user SET is_deleted = 1 WHERE id = 4;
INSERT INTO sys_user (name, age, gender) VALUES ('测试数据6', 22, 1);
View the target database data again
After waiting for the processor to execute, check the target database data and find that the new data has been synchronized.
You can see that the last processor finally has 8 records flowing into it.
Conclusion
The above is the whole process of using NIFI to incrementally synchronize database data. If you have any questions, please leave a comment in the comment area.