In-depth explanation of data warehouse layered architecture

Preface

         We often need to layer it when we are in a data warehouse project, but have you really understood why layering, and what are the benefits of layering. Well, we will explain this topic today. If you don't know the model in the warehouse, you can read this ( detailed explanation of the design of the warehouse model ). If it is not easy to write, please give me a one-key triple connection.

1. Why stratify

         The main reason for layering is to have a clearer control over the data when managing data. In detail, there are mainly the following reasons:

Clear data structure:

         Each data layer has its scope, so that we can locate and understand more easily when using tables.

Convenient data blood relationship tracking:

         To put it simply, what we finally present to the business is a business table that can be used directly, but there are many sources of it. If a source table has a problem, we hope to quickly and accurately locate the problem and know its scope of harm .

Reduce repetitive development:

         Standardizing data stratification and developing some common middle-tier data can reduce great repetitive calculations.

Simplify complex issues:

         将一个复杂的任务分解成多个步骤来完成,每一层只处理单一的步骤, Relatively simple and easy to understand. And it is easy to maintain the accuracy of the data. When there is a problem with the data, you do not need to repair all the data, just start the repair from the problematic steps.

Shield the anomalies of the original data:

         Shield the impact of business, you don’t need to change the business once and you need to re-access the data

Second, the data warehouse layered thinking

         Each enterprise data stratified according to their business needs can be divided into different levels, but the most basic hierarchical thinking, in theory, the data is divided into three layers , 数据运营层, 数据仓库层and 数据服务层. Based on this basic layering, new layers are added to meet different business needs.

Data Operation Layer (ODS)

         Operate data store(操作数据-存储),是最接近数据源中数据的一层,数据源中的数据,经过抽取、洗净、传输,也就说传说中的ETL之后,装入ODS层. The data at this layer is generally classified according to the classification method of the source business system.
For example: a table in MySQL can be extracted between sqoop to
the source of ODS layer data:

  • Business libraries
             often use sqoop to extract, for example, we regularly extract once a day. In terms of real-time, you can consider using canal to monitor the binlog of mysql and access it in real time.
  • The
             online system of the buried log will enter various logs. These logs are generally stored in the form of files. We can choose to use flume to extract regularly, or use spark streaming or Flink to access in real time. Of course, kafka will also be a The key role.
  • The message queue
             comes from ActiveMQ, Kafka data, etc.

Data warehouse layer (DW)

         Data warehouse(数据仓库). Here 从ODS层中获得的数据按照主题建立各种数据模型. For example 以研究人的旅游消费为主题的数据集中, you can combine airline boarding and travel information and UnionPay system card swiping records for combined analysis to generate a data set. Here, we need to understand four concepts: dimension (dimension), fact (Fact), index (Index) and granularity (Granularity).

DW数据分层,由下到上为 DWD,DWB,DWS

DWD:data warehouse detail 细节数据层,是业务层与数据仓库的隔离层。

DWB:data warehouse base 基础数据层,存储的是客观数据,一般用作中间层,可以认为是大量指标的数据层。

DWS:data warehouse service 服务数据层,基于DWB上的基础数据,整合汇总成分析某一个主题域的服务数据,一般是宽表。

Data service layer/application layer (ADS):
         Application Data Service(应用数据服务)。该层主要是提供数据产品和数据分析使用 The data is generally stored in ES, MySQL and other systems for online systems, and may also be stored in Hive or Druid for data analysis and data mining.
For example: the report data we often talk about, or that kind of large and wide table, is generally placed here.

3. Alibaba data warehouse layered architecture


ODS data preparation layer
Function:
         ODS层是数据仓库准备区,为DWD层提供基础原始数据,可减少对业务系统的影响
Modeling methods and principles:,
         从业务系统增量抽取Retention time is determined by business requirements 可分表进行周期存储、数据不做清洗转换与业务系统数据模型保持一致,, Logically divided by subject

DWD data detail layer
function:,
         为DW层提供来源明细数据,提供业务系统细节数据的长期沉淀Provide historical data support for the expansion of future analysis requirements

Modeling methods and principles:
         data model 与ODS层一致,不做清洗转换处理, 可额外增加数据business date field to support data rerun, table can be divided by year, month and day, merge processing with incremental ODS layer data and DWD related tables of the previous day

DW (B/S) data summary layer
Function:
         provide fine-grained data for DW and ST layer, refined into DWB and DWS;
         DWB是根据DWD明细数据进行转换such as dimension transfer proxy key, ID card cleaning, clear member registration source, field merging, null value processing, Dirty data processing, clear IP conversion, account balance cleaning, fund source cleaning, etc.;
         DWS是根据DWB层数据按各个维度ID进行高粒度汇总聚合such as converging by transaction source and transaction type

Modeling methods and principles:

         Aggregate, summarize and increase derived facts;

         Associating fact tables with other subjects, the DW layer may cross subject domains;

         DWB maintains low-granularity summary processing data, and DWS maintains high-granularity summary data;

         The data model may adopt anti-paradigm design, merge information, etc.

Data Market (data mart) layer
functions:
         可以是一些宽表,是根据DW层数据按照各种维度或多种维度组合把需要查询的一些事实字段进行汇总统计并作为单独的列进行存储;
         meet certain queries, data mining
         application marketplace data storage

Modeling methods and principles:
         尽量减少数据访问时计算(optimized retrieval)
         维度建模,星型模型;
         分表存储

ST data application layer (ADS layer)
functions:,
         ST层面向用户应用和分析需求including front-end reports, analysis charts, KPIs, dashboards, OLAP, thematic analysis, 面向最终结果用户
         suitable for OLAP, report models, such as ROLAP, MOLAP

联机事务处理OLTP、联机分析处理OLAP。
OLTP是传统的关系型数据库的主要应用,主要是基本的、日常的事务处理,例如银行交易。
OLAP是数据仓库系统的主要应用,支持复杂的分析操作,侧重决策支持,并且提供直观易懂的查询结果。 
  
联机分析处理的用户是企业中的专业分析人员及管理决策人员,他们在分析业务经营的数据时,从不同的角度来审视业务的衡量指标是一种很自然的思考模式。例如分析销售数据,可能会综合时间周期、产品类别、分销渠道、地理分布、客户群类等多种因素来考量。   

         根据DW层经过聚合汇总统计后的粗粒度事实表

Modeling and principles:
         保持数据量小;
         维度建模,星形模型;
         you measure the dimension surrogate keys +;
         increase data service date fields, support for data re-run;
         不分表存储

summary

         This article mainly explains why the data warehouse project is layered. For example, when we complete a required requirement, we may only need a complex SQL statement to complete it. But is a complex SQL statement easy to maintain later? Is it easy to track down when something goes wrong? At this time, the benefits of stratification are reflected. By the way, I would like to share with you what Ali's data warehouse model looks like. Believe in yourself, hard work and sweat will always be rewarded. I am big data brother, see you next time~~~

To obtain Flink interview questions, Spark interview questions, essential software for programmers, hive interview questions, Hadoop interview questions, Docker interview questions, resume templates and other resources, please go to GitHub to download by yourself https://github.com/lhh2002/Framework-Of- BigData

Scan QR code to follow

Guess you like

Origin blog.csdn.net/qq_43791724/article/details/112143401