Combined with the company's business analysis, offline data warehouse construction

Go to the homepage, click " Set as a star " in the upper right corner

Receive good articles faster than others


foreword

Technology is for business, business is to create value for the company, technology without business is meaningless

Business introduction

The company is a financial technology ToC enterprise and develops different products for users with different needs. Therefore, there are many business lines within the company, but for the data department, the data of all business lines are data sources. The division of data is not only carried out according to the business, but also based on the attributes of the data.

early planning

In the previous development, different business lines correspond to different data teams, and each data team does not interfere with each other. This model is relatively simple, and it only needs to build data warehouses and report development for its own business lines.

However, with the development of the business, there are more and more vertical business units with frequent iteration and cross-department, and there is a coupling situation between businesses. At this time, there is a problem with this kind of chimney development:

For example, the problem of permissions, the company is strict with data management, different data development groups do not have permission to share data, and the data permissions of other business lines need to be reported for approval, which is relatively time-consuming;

There is also the problem of repeated development. Different business lines will have the same report requirements. If each business party develops their own reports, it will be a waste of resources.

Therefore, for data development, it is necessary to manage the data of each business line in a unified manner, so there is the emergence of a data center.

data center

I think that the data center is built according to the specific business needs of each company. Different businesses have different understandings of the center.

The agile data middle platform developed internally by the company mainly includes the reuse of data technology and computing power, to the reuse of data assets and data services. The data middle platform enables data to directly empower business with greater value bandwidth, speed, accuracy and precision. Provide a unified management, break data silos, trace data lineage, and achieve self-service and high reusability.

As follows:

data center

The above explanation is relatively abstract. Let’s take a look at the convenience of the data center from the actual project development.

For example, we used to do the report development process. First of all, we need to collect data. Different data sources are collected to the big data platform through sqoop and other tools, and then we build a data warehouse. Finally, report data is produced, which is displayed in the visualization system, and finally the whole process is displayed. Write a script and put it on the scheduling platform for automated execution.

而有了数据中台之后就不需要那么繁琐,直接进行数仓搭建,产生报表即可,无需将精力过多放在数据源、可视化展示及调度。并且可以直观的查看数据血缘关系,计算表之间血缘。像下面图中,表之间的依赖关系很明确:

表之间血缘关系

另一点,数据中台的异构数据系统可以非常简单的进行关联查询,比如hive的表关联mysql的表。
可透明屏
蔽异构数据系统异构交互方式,轻松实现跨异构数据系统透明混算。

跨异构数据系统原理是数据中台提供虚拟表到物理表之间的映射,终端用户无需关心数据的物理存放位置和底层数据源的特性,可直接操作数据,体验类似操作一个虚拟数据库。

数据中台还额外集成可视化展示,提供一站式数据可视化解决方案,支持JDBC数据源和CSV文件上传,支持基于数据模型拖拽智能生成可视化组件,大屏展示自适应不同大小屏幕。

调度系统是公司内部自写集成到数据中台的,在编写完sql语句之后可以直接进行调度。

数仓建设

到这才真正到数仓建设,为什么前面要占那么大篇幅去介绍公司业务及所使用的数据中台系统,因为下面的数仓建设是根据公司的业务发展及现有的数据中台进行,数仓的建设离不开公司的业务。

智能数仓规划

数仓建设核心思想:从设计、开发、部署和使用层面,避免重复建设和指标冗余建设,从而保障数据口径的规范和统一,最终实现数据资产全链路关联、提供标准数据输出以及建立统一的数据公共层。
有了核心思想,那怎么开始数仓建设,有句话说数仓建设者即是技术专家,也是大半个业务专家,所以采用的方式就是需求推动数据建设,并且因为数据中台,所以各业务知识体系比较集中,各业务数据不再分散,加快了数仓建设速度。 
数仓建设主要从两个方面进行,模型和规范,所有业务进行统一化

  • 模型

所有业务采用统一的模型体系,从而降低研发成本,增强指标复用,并且能保证数据口径的统一

  • 模型分层

结合公司业务,后期新增需求较多,所以分层不宜过多,并且需要清晰明确各层职责,要保证数据层的稳定又要屏蔽对下游影响,所以采用如下分层结构:

数据分层架构
  • 数据流向

遵循模型开发时分层结构,数据从 ods -> dw -> dm ->app 这样正向流动,可以防止因数据引用不规范而造成数据链路混乱及SLA时效难保障等问题,同时保证血缘关系简洁化,能够轻易追踪数据流向。
在开发时应避免以下情况出现:

  1. 数据引用链路不正确,如 ods -> dm ->app ,出现这种情况说明明细层没有完全覆盖数据;如 ods -> dw -> app ,说明轻度汇总层主题划分未覆盖全 。减少跨层引用,才能提高中间表的复用度。理想的数仓模型设计应当具备:数据模型可复⽤,完善且规范

  2. 尽量避免一层的表生成当前层的表,如dw层表生成dw层表,这样会影响ETL效率。

  3. 禁止出现反向依赖,如dw表依赖于dm表。

  • 规范

  • 表命名规范

    1. 对于ods、dm、app层表名:类型_主题_表含义,如:dm_xxsh_user

    2. 对于dw层表名:类型_主题_维度_表含义,如:dw_xxsh_fact_users(事实表)、dw_xxsh_dim_city(维度表)

  • 字段命名规范  
    构建词根,词根是维度和指标管理的基础,划分为普通词根与专有词根

    1. 普通词根:描述事物的最小单元体,如:sex-性别。 

    2. 专有词根:具备行业专属或公司内部规定的描述体,如:xxsh-公司内部对某个产品的称呼。

  • 脚本命名规范  
    脚本名称:脚本类型.脚本功用.[库名].脚本名称,如 hive.hive.dm.dm_xxsh_users  
    脚本类型主要分为以下三类:

    1. 常规Hive sql:hive

    2. 自定义shell脚本:sh

    3. 自定义Python脚本:python  

  • 脚本内容规范

#变量的定义要符合python的语法要求
#指定任务负责人
owner = "[email protected]"
#脚本存放目录/opt/xxx
#脚本名称 hive.hive.dm.dm_xxsh_users
#source用来标识上游依赖表,一个任务如果有多个上游表,都需要写进去
#(xxx_name 是需要改动的,其余不需要改)
source = {
        "table_name": {
        "db""db_name",
        "table""table_name"
        }
}
#如source,但是每个任务target只有一张表
target = {
        "db_table": {
                "host""hive",
                "db""db_name",
                "table""table_name"
        }
}
#变量列表
#$now
#$now.date 常用,格式示例:2020-12-11


task = '''
写sql代码
'''

数据层具体实现

使用四张图说明每层的具体实现

  • 数据源层ODS

数据源层

数据源层主要将各个业务数据导入到大数据平台,作为业务数据的快照存储。

  • 数据明细层DW

数据明细层

事实表中的每行对应一个度量,每行中的数据是一个特定级别的细节数据,称为粒度。维度建模的核心原则之一是同一事实表中的所有度量必须具有相同的粒度。这样能确保不会出现重复计算度量的问题。

维度表一般都是单一主键,少数是联合主键,注意维度表不要出现重复数据,否则和事实表关联会出现数据发散问题。

有时候往往不能确定该列数据是事实属性还是维度属性。记住最实用的事实就是数值类型和可加类事实。所以可以通过分析该列是否是一种包含多个值并作为计算的参与者的度量,这种情况下该列往往是事实;如果该列是对具体值的描述,是一个文本或常量,某一约束和行标识的参与者,此时该属性往往是维度属性。但是还是要结合业务进行最终判断是维度还是事实。

  • 数据轻度汇总层DM

数据轻度汇总层

此层命名为轻汇总层,就代表这一层已经开始对数据进行汇总,但是不是完全汇总,只是对相同粒度的数据进行关联汇总,不同粒度但是有关系的数据也可进行汇总,此时需要将粒度通过聚合等操作进行统一。

  • 数据应用层APP

数据应用层

The table of the data application layer is provided to users. The construction of the data warehouse is coming to an end. Next, different data acquisitions will be performed according to different needs, such as direct report display, or provided to data analysis colleagues. data, or other business support.

Summarize

A picture summarizes the overall process of building a data warehouse :

database

Matters needing attention in actual production

The operation in the production environment cannot be as random as when we test it ourselves, and it may cause production accidents if we are not careful. Therefore, every step of the operation must be very careful, and you need to pay full attention to control your brain and control your right hand.

List only the following but not limited to the following considerations:

  • Do not operate other database tables other than your own management and authorization tables;

  • Do not operate other people's scripts and files in the production environment without authorization;

  • Before modifying the production environment script, be sure to back it up locally;

  • Please confirm that your modification operations can be rolled back quickly;

  • Please follow the naming rules for all naming of table names and fields in the production environment.



Welcome friends to like, watch, forward, favorite

This article is shared from WeChat public account - five minutes to learn big data (gh_d4a7af3ecd50).
If there is any infringement, please contact [email protected] to delete it.
This article participates in the " OSC Yuanchuang Project ", you are welcome to join and share with us.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324083955&siteId=291194637