RDBMS change data design, collection and access to big data platform

Changed data processing and capture

In the era of explosive data growth, recording data changes and evolution, exploring internal laws and applying them to production practices, and driving business growth have become the main theme of this era. This article talks about how to record data changes and deal with data changes to talk about their own understanding

Storage of changed data

1.1. Override

The property to be changed is always kept up-to-date, i.e. overriding, but this technique destroys the historical situation. It needs to be processed in other ways, which we will talk about later in this article.

1.2. Adding a new line

When an attribute change occurs, the original row is not modified, but a new record row is added. Therefore, in the design of the original table, the primary key needs to have a general type, because there will be multiple rows that describe an object together and describe the same member (attribute) of an object. Using this approach requires at least three additional columns: the row valid timestamp, the row invalid timestamp, and the current row identifier.

1.3. Add new properties

The original modified value remains unchanged. For the newly changed value, use a new column to record. This kind of application is limited, eg, freeschema type database, and the frequency of change is limited and low.

1.4. Add new table

Add a new table to record changes. This is generally used for tables with large amounts of data in the source table and rapid attribute changes. The new table needs to maintain a mapping between the attributes and the source table. The advantage is that there is no invasive modification to the source table, and it is friendly to write. The query needs to be connected to the table, which will have a certain impact. ### 1.5. Add a new table and rewrite the source table at the same time. Add a new table to record changes, and rewrite the records that need to be modified in the original table. , that is, the new table is purely used to record the history of changes. The advantage is that the query on the source table only needs to query the source table, and the writing speed will have a certain impact

Capture of change data

In 变化数据的存储the section, we talked about storing data for changes. History can be captured from methods 2-5. If a system has processing requirements for the original changed data, the above method can be referred to at the beginning of the system design. Designing from the source will bring great convenience to the subsequent data processing . If it is an existing system, and the processing of changing data is not considered at the beginning of the design. You can use the following methods.

2.1 Increase the flag bit

On the basis of 1.1, an effective flag bit for line changes is added. Allow downstream systems to capture. Advantages The places that need to be modified are relatively simple: 1. Adjust the physical design of the database, 2. Simply adjust the business logic of the existing application system

update source_table 
set update_col=col_value,valid=1 
where pk_col=pk_col_value

Things to consider:

  1. The original system records the same update interval for two times, but the downstream system does not sense and capture it in time. How to deal with the update operation? Considering the design principle of not affecting the original business system as much as possible, the update operation is carried out normally, but some data may be lost for data collection
  2. The write permission of the business library is open to the system unrelated to the downstream business (the data acquisition system is generally designed to be unrelated to the business in order to avoid the complexity of the architecture and the expansion performance of the later response to changes), which will bring intrusive risks, that is, modify the mark columns other than bits.
  3. Considering the system performance, the downstream system scans the flag bit, and there is no design that affects the database performance on the existing RDBMS system. There are basically feasible ways, 1. Establish a B +/-Tree index, but it is not a friendly design for a large amount of tag bit value repetition. 2. Create a bit-map index. Bit-map is most suitable for scenarios with many repeated values, but it will greatly affect the write performance, and is suitable for cases where the frequency of table modification is small. 3. From the technical point of view in terms of technology, focus on combining with the actual business. There is no universal principle, and each business system needs to be analyzed, but this violates the acquisition system and needs to adapt to and minimize access costs. irrelevant principle. If it is difficult to access at the beginning of data generation, the system has a great possibility of premature death. Like having a baby?

2.2 Using off-the-shelf database technology

2.2.1 ORACLE

Mode 1: ORACLE provides metadata for complete system description as a commercial data. All changed operations are queried by reading the metadata table to record the records. by the following statement

-- SQL_FULLTEXT操作的sql语句
-- COMMAND_TYPE 命令类型,2-insert,6-update,7-delete
-- ROWS_PROCESSED 影响的行数
-- last_load_time 最近一次执行的时间
-- first_load_time 第一次执行的时间
select  SQL_FULLTEXT,DISK_READS,BUFFER_GETS,ROWS_PROCESSED,COMMAND_TYPE,CPU_TIME,USER_IO_WAIT_TIME,
PHYSICAL_READ_REQUESTS,PHYSICAL_READ_BYTES,last_load_time from V$SQL
Where SQL_FULLTEXT like '%TBL_TEST%' and COMMAND_TYPE in(2,6,7)and ROWS_PROCESSED>0

REF: ORACLE docs
Method 2: Use table triggers to complete the identification and analysis of update actions by writing and triggering trigger actions each time. The existing open source framework -databus, oracle's parsing principle is in this way

2.2.2 SQLSEVER

sqlserver also has a similar table structure sys.dm_exec_sql_text
REF: SQL SERVER docs

Oracle method 1, sqlserver method, take advantage of these methods, 1. Completely reuse existing technology, use jdbc, select query operation, you can find all modifications. 2. While ensuring the scalability within the library, it will not affect the existing design of the system. Because all table update operations can be found in v$sql, there is no need to redesign and process a single table when accessing data, and all update queries use a set of SQL. Disadvantages: 1. Need to continuously train v$sql, the delay is at the level of seconds and minutes. See system settings. 2. Requires v$sql authority, usually administrator authority. The disadvantage of oracle mode 2 is that the use of triggers will increase the overhead of the system and affect the throughput of the system, especially in the case of frequent updates (update, insert, delete). Trigger use requires careful evaluation of the table

2.3 Complete with log

2.3.1 Simple Analytical Type - MySQL

With the plaintext log of binlog, you need to set the following two options

set binlog_rows_query_log_events=1
set binlog_format=ROW
在my.cnf中配置
log-bin=binlog的目录和binlog文件前缀

All updated operations will be printed in plaintext to the file set by log-bin. Transfer binlog plaintext sql to kafka with the help of kafka connector-filesystem source.

2.3.2 Complex Parsing-MYSQL

Through the mining of database logs, the analysis is completed. The existing open source frameworks OpenReplicator, databus, and mysql are used to analyze the binlog with the help of OpenReplicator.

The common advantages of the above two methods are that you only need to turn on the binlog printing, which has a small burden on the system, and the downstream program will not have an impact on the existing system. In addition, the simple log is used, and the plaintext SQL is parsed. Due to the general standard of SQL, The parsing program has good versatility, and it has a small maintenance burden in the later period. For complex analytical SQL, the parsing of binlog needs to be continuously upgraded with the upgrade of the software version, and the follow-up maintenance cost is high.

Floor plan design

In 变化数据的捕获the section, we share how to capture change data without prior consideration of storing historical changes. Combining the advantages and disadvantages of the above methods,

  1. For SQL Server&Oracle, make a stored procedure (read permission of v$sql, if you need a library or machine, you can provide it as a service application, use jdbc for connection), put it into the database for scheduled tasks, and write the data to the history_log table, Provide the read permission of the history_log, the downstream system uses the kafka connector jdbc source to connect, access to kafka, need to record the last read offset, history _log table design, as follows
table schema:fino_id,sql_fulltext,exec_time,command_type
fino_id:auto-inc
sql_fulltext:执行更新的sql脚本
exec_time:执行时间
command_type:sql语句类型
  1. For mysql, it is relatively simple. Deploy the kafka connector localfilesource on the binlog local disk and transfer the binlog log to kafka.

Summary: The adoption of this scheme is mainly based on the following considerations:

  1. Based on the client's sensitive database permission requirements, and the client's precipitation of relational database operation and maintenance technology to ensure the stability of the data source
  2. The upstream and downstream systems are weakly dependent. Even if there is a problem with the downstream system, the data at the source still exists and is continuously produced. Realize strong fault tolerance of source data
  3. It can achieve strong scalability, and it is not necessary to do a separate business design for a single table in the library and different database products (specifically, sql server and oracle). Reduce access costs.
  4. The ETL of the data can be placed on the data platform for unified cleaning and mining.
  5. history_log, using the IOT table, the read and write requests are converted into sequential read and write, achieving high read and write performance

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325239906&siteId=291194637