Data warehouse ETL design ideas and design ideas

Original: https://www.cnblogs.com/MR-zhang-01/p/9180787.html

First, the idea of ​​building a data warehouse

Data warehouse construction in two ways: First, top-down, bottom-up one is.

Mr. Bill Inmon respected "top-down" approach, that is, to establish a unique corporate data centers, like a data warehouse where the data is consolidated, after clean 
wash, remove the dirty data, standards, and to provide a uniform of view. To create such a data warehouse, which does not start from the application needs to support it, but from the entire 
business environment, analyzes of these concepts, what kind of data should reach the completion of the whole concept; (it will take into account the full design) 

Mr. Ralph Kimball praised the "bottom-up" approach, he believes building a data warehouse should be based on actual application requirements, data loading needs, not unwanted data 
to be loaded into the data warehouse. In this way a short construction period, customers can quickly see the results. (For customer needs, what it needs to do) 

both have to reach the same goal: enterprise data warehouse. In fact, when building a data warehouse, are generally used in combination with reference to these two methods is not mandatory.


二、ETL(Extract/Transformation/Load)

User to extract the required data from the data source, after data cleaning, conversion, according to the final pre-defined good data warehouse model, the data is loaded into the data warehouse to;

ETL data warehouse system is one of the most important concepts, ETL spend more than half the time in a data warehouse project.


1) ETL scheduling target

Source: database, database files, text files, program generating (Derived column) 

number of systems: single system, multiple systems (excessive system may consider interface) 

type databases: Database isoform / a variety of databases


2) ETL scheduling parameters Design

Scheduling priority / scheduling order / interrupt flag / rollback sign / symbol of success / scheduled start and end time, etc.


3) ETL scheduling log management

Documentation / database records 

the job name / job execution start - end time / job execution result / exception information capture / job number, etc.


4) ETL scheduling design JOB

Text data file to load / SQL calls / Stored Procedures / ETL tool in the program WORKFOLW


5) ETL scheduling policy design

The total amount of data load: a user information type data, update status data changes of 

incremental data loading: water batch scheduling design, data extraction is generally carried out in comparison leisure time, more time in the morning, and according to the data analysis cycle, 
It is also divided into daily, monthly transactions; since the amount of data relating to large business system, needs batchwise extraction and back extraction processes data series. 

Scheduling Concurrent design: JOB concurrent involved, complicated conflict design, exception handling design, success / error exit strategy


Third, storage management and model design

The real key is the data warehouse to store and manage data. Data warehouses are generally encountered several problems:

1) a large amount of data storage and management

Database design, installation, integration of data extraction according to design requirements of the detailed design database application program;


2) optimized for decision support queries

Partition tables, indexes, clustered indexes, MQT, SQL optimization mode.


3) Support query multidimensional analysis

Is there a relevant reporting software and optimize the query method.

Guess you like

Origin www.cnblogs.com/weiyiming007/p/12356894.html