Big Data project implementation case

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq_34401027/article/details/85236023

First, the project objectives

The more than 30 core data system, the implementation of synchronous replication, to large centralized data platform.

1) The data synchronization embodiment, the amount of data. --- more complex data

2) real-time data replication, accuracy

3) copying the data tag needs to be increased (operation time, operation type, operator, etc.), a rear end to facilitate the identification data.

4) how to extract the data, reduce the impact on the production database. As means of view, a temporary table, dg libraries.

5) How to better fit the back-end application, data specifications ensure flexibility, allow ample field.

Consider creating a code management, metadata repository.

6) is provided with an operation control flow management, data extraction, data cleaning, the data comparison function, to facilitate tracking traceability.

Second, the project difficulty coping with

Many 1) business system, data source of a more complex sqlserver, mysql, essbase, oracle, sql server. At the same time data rules vary.

It recommended the establishment of a data processing center, metadata repository. Data format conversion, and good scalability.

2) large amounts of data, time data initialization starting point is very important. General financial accounting data retention two years. But because of the special nature of the business, some data

We need dating back 10-30 years, or even longer.

3) data storage space, storage room location, the need for a dedicated cable, if preemption systems and other resources.

4) Copy the core business system data does not exceed the 10S-20S. Effectiveness demanding. Accuracy is required, otherwise we can not guarantee the accuracy of the data.

5) washing the data, sharing, makeup. Provide a unified interface handmade makeup.

Third, the solution:

1) data platform supports heterogeneous database replication data needs, large amounts of data, real-time, modular.

Consider the full amount of initial data synchronization to hdfs, incremental data synchronization to kafka.

2) Copy the data on the backup repository. Be sure to reduce the pressure on the production database.

3) In order to conserve network resources, and the need to back up the database in the same room. ,

4) the need to establish a control data flow to facilitate data verification. Provisional check digit data item, the total number of data.

The purpose is to support the follow-up business data operations back to check, data verification. In particular the financial data, may need to drill

Document data decimation stage.

5) need to synchronize a plurality of channels. It can quickly synchronize data to the database, and supports incremental way synchronization.

We need to select the most efficient replication products. As can support multiple threads, multiple concurrent, specific data formats, data compression techniques,

And fast data extraction and loading techniques.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_34401027/article/details/85236023