Hive developers need to know four levels of data warehouse design

 Data Warehouse: a comprehensive data warehouse system receives data source, ETL processes for data standardization, validation, cleaning, and finally loaded into the data mart, data queries via data marts support system analysis, the entire data warehouse contains four levels.

1. The operation of four data warehouse
       after ETL (extractiontransformation loading) responsible dispersed heterogeneous data sources to extract data in the temporary intermediate layer was washed, conversion, integration, and finally loaded into the data warehouse or data mart. ETL is the heart and soul of embodiment of a data warehouse, ETL design and implementation of accounting rules to build the entire data warehouse 60% to 80% of the workload.
      1. Data Extraction (Extraction) comprises initialization data loading and data refresh: primary initialization data loading concern is how to create dimension tables, fact tables, and the corresponding data into the data tables; and data refresh concern is how when the corresponding data in the data warehouse is additionally maintaining and updating the source data changes ( For example, you can create scheduled tasks, or in the form of trigger timing data refresh).

      2. Data cleansing is mainly for ambiguity appears in the source database, duplicate, incomplete, or violate business logic data issues rules unified process. That is cleansed do not meet business or useless data. Hive prepared such as by washing or MR field length data does not meet the requirements.

      3. The data conversion (transformation) mainly in order to convert the data into the data warehouse data cleaning required: dictionary data from the data format of the same or different data fields may be different source systems (such as A table called id , B table called ids), in the data warehouse need to provide them with a unified data dictionary and format of the data content is normalized; on the other hand, the contents of certain fields of data warehouse may be needed source system It does not have, but need to work together to determine the contents of fields in a plurality of source system.

    4. The data loading (loading) is introduced above the last processed data to the corresponding storage space (mysql, etc.) to provide convenience to the data marts, further visualization.

     For large companies generally safe operation of data and convenience, encapsulated data are their own platform and platform task scheduling, the bottom package hadoop large data clusters such as a cluster, spark cluster, sqoop, hive, zookeepr, hbase provide only a web interface, and for be different employees with different permissions, and then cluster different operations and calls. Data warehouse, for example, the data warehouse is divided into several levels of logic. So for different levels of data manipulation, create different levels of tasks, can be placed at different levels in the task flow execution (a cluster of large companies usually have daily scheduled task awaiting execution thousands, even tens of thousands, it is divided into different level task flow, different levels of tasks into a corresponding task flow execution, management and maintenance will be more convenient).

2. The four-level logic architecture of a data warehouse
       can be divided into four layers on standard data warehouse. But note that this is not the only division and named, the number of positions are usually four, but different companies may different names. For example, here's a temporary layer called copy layer SSA, Jingdong is called BDM. Alibaba is the same number of five-story warehouse structure, in more detail, but the core idea is the data coming from the four-layer model. Respectively, show the following naming schema level and the number of Jingdong and Alibaba warehouse.

 


1. replication layer (SSA, system-of-records -staging-area)
      data copied directly SSA source system (such as reading data from the mysql all introduced into the same structure of the table in the hive, not treated), and try to keep original service data; a data source system with the only difference is, the data of the SSA is added to the time stamp information in the source data base system, a plurality of historical data versions.

2. atomic layer (the SOR, System-of-Record)
     the SOR table structure based on a compliance rule 3NF paradigm developed model, it stores the finest level of the data warehouse, and in accordance with different subject areas for data classification storage; such as college statistics service platform based on the current needs of the school portion of the data layer by SOR personnel, students, teaching, research four main themes storage; SOR is the core and foundation of the entire data warehouse, in the design process should have sufficient flexibility to be able to respond to add more data sources to support more analysis needs, while being able to support further upgrades and updates.

3. Summary layer (the SMA, Summary-Area)
    the SMA and the intermediate SOR DM (fair layer) transition, since the height is normalized data SOR, to complete this query requires a lot of associated work, while the DM data granularity tends SOR much higher than on aggregated data to give birth in the DM needs a lot of work summary, this, SMA data to SOR modest anti-Fan (for example, designed a wide table structure information personnel, cadres and other information table on demand the merged data) and aggregated (e.g., some common header summary summary mechanism, etc.); thereby increasing data warehouse query performance.

4. Market layer / presentation layer (DM, data mart)
    data DM saved for the user to directly access: DM can be understood as the end-user access data eventually want to see; DM main thing is all kinds of data granularity, by statistical data on data services platform DM; providing data of different particle size, the number of visits to adapt to different needs

 

Guess you like

Origin www.cnblogs.com/panchangde/p/11572663.html