By far the most full data warehouse construction guidelines, come quickly collection! !

By far the most full data warehouse construction guidelines, make haste collection


Before lecturing, let's look at the definition of the data warehouse.

Data Warehouse (Data Warehouse) is a subject-oriented, integrated, relatively stable, reflecting the historical changes in data collection to support management decisions. This concept was first proposed by the father of the data warehouse door grace of Bill (Bill Inmon) in 1990 in the "data warehouse" a book made in recent years been increasingly widely mentioned and applied, do not believe look:

By far the most full data warehouse construction guidelines, make haste collection


What in the end is to make a concept put forward from the 1990s, it is indeed getting hot in recent years? With this issue, we take a look at the real changes in the industry.

According to Bureau of Statistics figures show that in recent years the overall size of the digital economy increasingly high proportion of GDP, as of 2018, nearly 35%; the digital gap between economic growth and GDP growth gradually widened, far higher than the GDP growth rate .

By far the most full data warehouse construction guidelines, make haste collection


By far the most full data warehouse construction guidelines, make haste collection


In 2014, the term "new normal" was first put forward, pointed out from the stage characteristics of the current economic development in China, to adapt to the new normal, normal state of mind to keep on strategy. In the new normal, the information behind the data is creating economic data plays a huge value, the future will be the same.

By far the most full data warehouse construction guidelines, make haste collection


By far the most full data warehouse construction guidelines, make haste collection


In this context, "data", "data analysis", "artificial intelligence", "IOT" These industries all the way up keywords in the search index trend Baidu. With the in-depth transformation, artificial intelligence and networking technology to be more widely accepted and applied, the data generated was behind this massive growth, the extent of the data is dependent on increasing.

So, back to the beginning of the article in question "data warehouse, a concept from the 1990s put forward in recent years, why does getting hot?" The answer is that with the development of the times, the value of data being infinitely the demands, mining and amplification. Behind its value requires data collection, storage, exchange, management, use of a set of mechanisms.

So the question again, how to do to properly build an enterprise data warehouse?

Do not panic! Dry goods coming! The next step is to build a data warehouse from a set Detailed methodology applied, do not blink quit, if after reading all find it useful to remember thumbs save, and share!

First look Zhang system diagram:

By far the most full data warehouse construction guidelines, make haste collection


We are talking here of a data warehouse is based on big data system, which contains the label categories, different from the traditional data warehouse. Let's break down this figure will, one by one to do a brief analysis.

First, the preliminary investigation

Research is the basis of the number of warehouses built, according to the goal of building, we will survey is divided into three categories: business research, business systems research, business data research.

业务调研内容:

  • 项目承载的业务是什么,业务的特征和性质

  • 当前的业务流程,有真实流程表格和报告最好,用一个实例的方式来展示整个业务流程

  • 业务专业术语、产品资料、规则算法、逻辑条件等资料

  • 关注用户对流程中存在的问题和痛点描述、以及期望

业务系统调研内容:

  • 清楚了解项目有哪些系统,每个系统对接人,重点系统详细介绍功能和交互

  • 整体系统架构,调用规模,子系统交互方式,并发和吞吐量目标

  • 系统技术选型和系统当前技术难点

数据调研内容:

  • 可提供的数据

  • 数据源类型、环境、数据规模

  • 数据接口方式:文件接口、数据库接口、web service接口等

  • 数据目录,数据字段类型、字典、字段含义、使用场景

  • 数据在业务系统中流向等

二、数据建模

数据建模是数仓搭建的灵魂,是数据存储、组织关系设计的蓝图。

分层架构是对数据进行逻辑上的梳理,按照不同来源、不同使用目的、不同颗粒度等进行区分,使数据使用者在使用数据的时候更方便和容易理解,使数据管理者在管理数据的时候更高效和具有条理。我们推荐的分层架构是:

By far the most full data warehouse construction guidelines, make haste collection


维度建模是Kimball在《数据仓库工具箱》中所倡导的数据建模方法,也是目前在大数据场景下我们推荐使用的建模方法。因为维度建模以分析决策的需求出发来构建模型,构建的数据模型为分析需求服务,因此它重点解决用户如何更快速完成分析需求,同时还有较好的大规模复杂查询的响应性能。

维度建模的核心步骤如下:

  • 选择业务过程:对业务生命周期中的活动过程进行分析

  • 声明粒度:选择事实表的数据粒度

  • 维度设计:确定维度字段,确定维度表的信息

  • 事实设计:基于粒度和维度,将业务过程度量

设计原则:

  • 易用性:冗余存储换性能,公共计算下沉,明细汇总并存

  • 高内聚低耦合:核心与扩展分离,业务过程合并,考虑产出时间

  • 数据隔离:业务与数据系统隔离,建设与使用隔离

  • 一致性:业务口径一致,主要实体一致,命名规范一致

  • 中性原则:弱业务属性,数据驱动

三、标签类目

标签,是数据资产的逻辑载体。数据资产,指的是能够给业务带来经济效益的数据。所以,标签类目的建设在整个数据中心的建设过程中具有核心地位。

标签的设计需要结合数据情况和业务需求,因为标签值就是数据字段值,同时标签是要服务于业务的,需要具备业务意义。假如,标签的设计仅基于业务方以往的经验得出,那么最终开发出来的标签值可能会失去标签的使用意义,比如值档次分布不均、有值的覆盖率低等。

基于标签开发方式,我们将标签分为以下三类:

  • 基础标签:直接对应的业务表字段,如性别、城市等

  • 统计标签:标签定义含有常规的统计逻辑,开发时需要通过简易规则进行加工,如年增长率、月平均收益率等

  • 算法标签:标签定义含有复杂的统计逻辑,开发时需要通过算法模型进行加工,如企业信用分、预测年销量等

基于标签应用场景,我们将标签分为以下二类:

  • 后台标签:开发场景下,面向开发人员,不涉及业务场景,聚焦标签设计、开发、管理。

  • 前台标签:应用场景下,面向业务人员,结合业务场景,聚焦对后台标签的直接使用或组合使用。

随着大量的标签产生,为了更好的管理和使用,我们需要将标签进行分类。所有的事物都可以归类于三类对象:人、物、关系,所以我们可以对标签按照人、物、关系来划分一级类目,再按照业务特性对每个一级类目进行二级、三级的拆分,通常我们建议将标签类目划分到三级。

By far the most full data warehouse construction guidelines, make haste collection


四、开发实施

经过前期调研、数据建模、标签设计之后,接着会进入到开发阶段,开发实施的关键环节由以下几部分组成:

  • 同步汇聚

  • 清洗加工

  • 测试校验

  • 调度配置

  • 发布上线

工欲善其事,必先利其器。一个好的开发工具对开发进度、成本、质量等具有举足轻重的影响。目前市面上很多开源,如Kettle、Azkaban、Hue等多多少少具有部分功能,但是要形成一个从端到端的数据自动化生产,需要将多个开源工具进行组合并通过复杂甚至人工方式进行衔接,整个过程复杂、低效和可靠性低。数栖云一站式离线开发平台,就是为了解决上述问题而生的。

51cto base map .png


开发落地,规范先行,遵守一套标准规范是整个开发质量和效率的保障。该套数据开发规范应该具备以下几个核心内容:

  • 公共规范

    • 层次调用约定

    • 数据类型规范

    • 数据冗余拆分

    • 空值处理原则

    • 刷新周期标识

    • 增量全量标识

    • 生命周期管理

    • ......

  • ODS层模型开发规范

    • 表命名规范

    • 任务命名规范

    • 数据同步方式

    • 数据清洗规范

    • ODS层架构

    • 数据同步及处理规范

    • 命名规范

  • DW层模型开发规范

  • ......

通过工具+规范,促使我们的开发实施快速做好。

By far the most full data warehouse construction guidelines, make haste collection


五、治理维护

随着调度作业和数据量的增长,管理和维护会成为一项重要任务。

数据管理的范围很大,贯穿数据采集、应用和价值实现等整个生命周期全过程。所谓的数据管理就是通过对数据的生命周期的管理,提高数据资产质量,促进数据在“内增值,外增效”两方面的价值表现。数据管理的核心内容为:

  • 数据标准管理

  • 数据模型管理

  • 元数据管理

  • 主数据管理

  • 数据质量管理

  • 数据安全管理

数据监控是数据质量的保障,会根据数据质量规则制定监控策略,当触发规则时能够自动通知到相关人。基础的数据质量监控维度有以下几部分:

  • 完整性

    • 特定完整性:必须有值的字段中,不允许为空

    • 条件完整性:根据条件字段值必须始终存在

  • 唯一性

    • 特定唯一性:字段必须唯一

    • 条件唯一性:根据业务条件,字段值必须唯一

  • 有效性

    • 范围有效性:字段值必须在指定的范围内取值

    • 日期有效性:字段是日期的时候取值必须是有效的

    • 形式有效性:字段值必须和指定的格式一致

  • 一致性

    • 参照一致性:数据或业务具有参照关系的时候,必须保持其一致性

    • 数据一致性:数据采集、加工或迁移后,前后的数据必须保持一致性

  • 准确性

    • 逻辑正确性:业务逻辑之间的正确性

    • 计算正确性:复合指标计算的结果应符合原始数据和计算逻辑的要求

    • 状态正确性:要维护好数据的产生、收集和更新周期

当出现数据异常后,需要快速的进行恢复。基于异常和修复场景,有以下几种数据运维方式:

  • 平台环境问题引起的异常

    • 重跑:当环境问题解决后,重新调度作业,对当天的数据进行修复

    • 重跑下游:当环境问题解决后,重新调度某一个工作流节点的作业及其下游,对当天该作业及其下游的数据进行修复

  • 业务逻辑变更或代码 bug 引起的异常

    • 补数据:对应作业代码更新并重新发布到生产后,重新生成异常时间段内的该作业数据

    • 补下游:对应作业代码更新并重新发布到生产后,重新生成异常时间段内的该作业及其下游的数据

  • 其他

    • 终止:终止正在被执行的作业

Data security is mainly to protect data from being stolen, vandalism and abuse, including core data and data privacy, data systems and to ensure safe and reliable operation. Need to build a framework for system-level data security, data plane and service levels, from technical support, management, security, process security and operations support large multi-dimensional data to protect applications and data security.

  • System level

    • Technology Architecture

    • network transmission

    • Tenant isolation

    • authority management

  • Data Plane

    • Data evaluation: data sources, uses, and so on to assess the legality

    • Desensitization data: data privacy desensitization treatment

    • Data permissions: According to the data of different roles and needs of the user, open a different authority

    • Blood retrospective: Create kinship, the ins and outs of production traceability data

    • Download limit: limit the number of downloads the result data set, preventing data leakage

  • Service level

    • Application Monitoring: monitoring data terminal use, frequency of use, usage flow rate, etc.

    • Interface management: the production and management of data output interface

    • Data desensitization

Sixth, data applications

To enabling business, it is the ultimate expression of the value of data, that is, we are talking about operational data. Data traffic in the direction of two ways: business optimization and business innovation. In the process of operational data in order to more easily serve the upper application, we first form data service interface, and then let the business applications directly call the service interface, that form of service + data service operational.

How to complete a business optimization and business innovation through existing products + + methodology best practice? Here is a complete map that helps you understand the whole process faster.

By far the most full data warehouse construction guidelines, make haste collection


Above, it is what we practice for data warehouse construction accumulated summed up the experience to share, welcome to discuss with us to crash! Refuses to accept contributions! And if you find this article helpful to you, do not forget to share this article out to more people to see ~

If you are interested in the number of Qiyun, are also welcome to enter: dtcloud.dtwave.com


Guess you like

Origin blog.51cto.com/14463231/2458413