Data Warehouse Series (5): data synchronization

First, the technology roadmap

Second, direct database synchronization

Since for a data warehouse, data service system varied, but for technical reasons inertia, most of the data stored in the existing business Mysql Postgresql or in a structured manner and stored. Specifications defined by the interface, in the form of synchronization API, and is directly connected to a common database synchronization, database calls the standard method can be realized. In this way configuration is simple, very easy to achieve, but for large business systems impact. Typically, the service system will be used to achieve standby policy framework, and thus to extract data from a database backup, the online business to avoid an effect on the system. Since the standby period there is data to be synchronized, and thus need to be considered in this way to synchronize on the standby more free time.

Third, the database file synchronization

Due to direct synchronization database, regardless of how to deal with, it will have some impact on the business system, which in many cases we use data files directly to synchronized way to optimize, by appointment to file encoding, size, format, etc., directly from the original system generating a data text files, by a special plug-in synchronization to the processing server, file transfer efficiency can be improved. This way will encounter two problems, one is likely to fluctuate network packet loss, the other is a large source files need to be compressed transmission. Thus generally simultaneous transmission of the data file, a message will upload verification document, detecting the amount of data, file size, etc., in order to ensure the accuracy of the synchronization data.

Fourth, log parsing database synchronization

As technology advances, most mainstream database system can be recovered by way of the log file, and because the information is very complete log file format parsing is also very stable, which can be obtained by analyzing the occurrence of database log files change data, then update the offline system to enhance the maximum efficiency.

Canal to Ali component as an example, since the early Alibaba in Hangzhou and the United States to deploy double room, there is room across business needs synchronization, starting in 2010, the business gradually try to resolve the log database to obtain incremental changes are synchronized, thus derived from a large number of database subscriptions and incremental consumer business. MySQL replication master and slave principle: Mysql Master data changes are written to the binary log binary log, Mysql Slave to Master copy of the binary log to its relay relay log log, Mysql Slave replay relay log events, it will reflect data changes own data. On this basis, Canal works as follows: Canal analog Mysql Slave interactive protocol, disguised himself as Mysql Slave, sending dump protocol to Mysql Master, Mysql Master dump request is received, began to push binary log Canal, Canal parsing binary log objects obtain data change information.

Since the database log parsing the ability to achieve near real-time synchronization, the impact on the business system is also very small, which is widely used in the incremental data from operational systems into the data warehouse applications in sync. It is noteworthy that, due to the relatively poor data warehouse operations support for updates, usually adopt delete, and then insert a way to simulate the update operation.

There is a more common situation is due to the source operating system more frequently, the presence of multiple updates to the same data, it is necessary, to change the situation with a flashback data were sorted according to the database id, to get a last status change. In fact, sometimes the traffic data is not deleted in the true sense, but do delete business systems on logic as needed, but the data will remain, this time the specific business logic processing, depending on product demand statistics make a decision.

This way there will be some problems, mainly: data latency, processing large amount of data and data drift, thus building an intermediate system also requires some coding was developed to eliminate data inconsistencies.

Five sub-library sub-table deal

As the business continues to grow, data service system is also increasing, design sub-library sub-table manner became almost inevitable database design. When you synchronize data sub-library sub-table issues, the need to introduce plug the middle, through the establishment of an intermediate form of state to sub-unified database access points table.

Alibaba also has a TDDL plug-ins to achieve this intermediate state, Ali had to say something open-source plug-in is still very good. Similar plug-ins also include Amoeba, Cobar, MyCAT, the specific technology choices need to look at their business needs.

Sixth, bulk data synchronization

Bulk data synchronization has two features, a variety of data sources, Web logs and Web client log just referred to, is a common source of data in all types of real images and videos; another is the amount of data is very large, slightly larger the company points above 10TB scale every day is very normal, like Ali can reach PB level.

In Hadoop system, for example, due to rows and columns of memory storage in two forms, thus requiring the data format is highly uniform, it will require synchronization tools can process the data into a structured, semi-structured shape capable of supporting standard query Sql All unified data type is a string type, and thus can be processed by secondary processing tools such as Hive.

Due to the large data platform is now basically do not support the Update operation, thereby further data consolidation, usually merge manner, often with full outer join + reload, namely full outer join + insert overwrite manner, that is, incremental data day and full-day volume of data before doing a full outer join, and then reload.

Seven real-time data synchronization

For part of the business data, need real-time statistical data items, such as the total amount of electricity during a large pro-business platform statistics, we need to source data without off-line system, directly into the Storm and other computing platforms streaming through the message queue, the direct output data. But at the same time need to get some data from Mysql configuration information, and thus binlog log also needs to be done in real time synchronization to changes in configuration information. For real-time data, although at this stage PV, the total amount of statistics and other data is relatively easy, but the UV data becomes difficult. This time can approximate statistical way to calculate a rough, then update accurate data through off-line system. The specific approach can be implemented by the Bloom filter or the like model, the data can be kept in memory statistics.

Eight, Flume tools introduced

Most companies are used for bulk data synchronization framework such as Flume, a relatively simple to configure, and secondly, the reliability is better, three to the good compatibility with Hadoop. Flume data flow model is as follows: Event agent is flowing through the Flume (Agent) of a data unit. Event Sink flows from Source via Channel, this process is represented by the realization of the Event interface. A transmission Event a load (number of bytes) and an optional header accompanying (string attribute). A Flume agent is a process (JVM), which manages the components to allow Event external target from an external source stream.

Nine, Kafka message queue description

Message Queuing mainly two advantages, first, by asynchronous processing to improve system performance (peak clipping, reduce the response time required, etc.), and second, to reduce the coupling system. When not in use of the server message queue, requesting the user data directly into the database, in case of a high pressure surge concurrent database, so that response speed becomes slower. However, after using message queues, the user data returned after a request message sent to the queue immediately before the data acquired from the message queue by a consumer process the message queue, asynchronously written to the database. Since message queue database server processing speed (message queue has better scalability than the database), so that the response speed is greatly improved.

We should mention that the eventual consistency of realization of ideas: are the "record" and "compensation" manner. Local transactions and maintain business change notification message, the floor together, then RPC arrive broker, broker after a successful landing, RPC returns successfully, you can delete the local news. Otherwise, local news has been relying on timed tasks constantly polling retransmission, thus ensuring reliable message floor broker. broker to the consumer to send a message similar to the process, it has been sending the message until the consumer sends the consumer confirmation of success. Let's ignore the problem of duplicate messages, messages through two floor plus compensation is downstream will be able to receive messages. The state machine then rely on the version number, etc. to do the heavy sentence, update their business, to achieve the ultimate consistency. If the process is too slow consumer consumption, however, to allow a consumer initiative ack error, and can be delivered the next time the broker agreed. For the broker delivered to the consumer of news, the loss is due to uncertainty in the business process or under circumstances message is lost, it is necessary to record the IP address of the next delivery. Before deciding to ask the IP retransmission, message handling success? If you are asked to no avail, and then re-issued.

The following is a comparison chart of different message queues:

Ten, drift data

Data drift is usually due to delayed processing system caused, for example, fell the previous day's data partition in today's data processing method is also more variety:

The first is the day after more than access to data, if the data is broken down by hour, then in calculating the middle of the table, get more than one hour back data partition, you can basically guarantee to get the full amount of data the previous day.

Followed by obtaining accurate data according to the log a timestamp field, which is determined according to log_time data field, and then written to the corresponding partition.

Published 19 original articles · won praise 0 · Views 901

Guess you like

Origin blog.csdn.net/gaixiaoyang123/article/details/103801871