Data Governance and Data Center Architecture

With the advent of the Industry 4.0 era, the digital transformation of traditional industries is the general trend; improving data to the level of data elements and allowing traditional technologies to play new roles in new scenarios is the focus of recent research and discussion. Shuyu Technology has supported and served traditional industries for many years, focusing on traditional data modeling and data architecture design. This article focuses on the part of data asset modeling, introduces the relevant technologies of Shuyu Technology in data governance and data middle platform architecture, and shares relevant enterprise practice cases.

Today's article revolves around the following three points:

  1. Data Architecture and Data Model Overview
  2. Data Architecture and Model Solutions
  3. Large enterprise practice case

01. Overview of data architecture and data model

  1. DAMA DMBOK Data Architecture and Data Governance

Data architecture and data model management are important components of the data governance system. Similar to PMI and PMP in project management, DAMA (Data Asset Management Association) was established internationally in 1980. DAMA condenses the experience of hundreds of experts and finally forms the industry-wide data management framework (DMBOK). The DAMA-DMBOK data management framework (also known as the DAMA wheel diagram) is mainly constructed from 11 knowledge areas, among which data architecture and data model are the two most important dimensions of this methodology.

The data architecture is mainly used to identify the data requirements of the enterprise, design the blueprint, and finally output the data architecture design and implementation roadmap, as shown in the figure below.

insert image description here

  1. The process of building a data model

insert image description here

The establishment of the data model, the general methodology of the industry is as follows:

① The previous design mainly focused on the business, and completed the design of the conceptual model and logical model based on customer needs;

② Further, based on the existing technical environment and performance requirements of the enterprise, convert the conceptual model and logical model into a practical physical model;

③ Further, transform the physical model into the database table structure (and create the DDL script corresponding to the table structure) with the actual data, and finally form the database table field;

④ For important nodes in the process of model design and implementation, a set of corresponding enterprise standards is often formed to achieve standardization.

Regardless of whether the source system has model design or not, the data schema exists, and can be extracted and refined into models through reverse engineering. These models describe more about the scope of data covered by the business system and the relationship between data; if the model is of high quality , can better help enterprises understand the value of data assets. Therefore, it can be considered that all systems have data models, but some models are easier to understand and more likely to generate value for enterprises.

  1. All models are for business development, different perspectives, different stages

insert image description here

For today's popular big data concepts, people generally focus on the analysis side (that is, the AP side). In fact, the big data model includes not only the AP side, but also the TP side (that is, the source-end business system of the enterprise) in the process of informatization or digitization. Various data products (or systems) will also be constructed, and finally applied to the enterprise internal or external customers.

For the underlying design of the database, most enterprises still use the traditional database construction paradigm at this stage:

① On the TP side, an Inmon model such as the three-paradigm model is usually used;

② In the data mart on the AP side, Kimball models such as dimensional models (such as snowflake models and star models) are usually used;

In addition, more and more new data model paradigms have been iterated recently, such as Data Vault model, unified star model, etc., which cover a wider range and can be more widely applied to the TP side and the AP side.

  1. Data Models Classified by Stage

insert image description here

① Business system model, usually choose the three-paradigm model;

② The ODS model is usually directly accessed from the business system, so the three-paradigm model is also selected;

③ The DWD model and the DWS model, as enterprise-level data warehouses, can be built using either the traditional three-paradigm model or the modern Data Vault model, both of which support many-to-many relationships;

④ The bazaar model generally uses a dimensional model, which is convenient for data analysis operations such as roll-up and drill-down.

  1. Data Model Introduction

insert image description here
The data relationship is intricate, and thousands of tables are interconnected through various relationships or constraints to form a complex structure. Taking common scenes in life as examples, such as house floor plans and maps, different symbols are used to clearly display relevant information to relevant users.

Through the data model, users can clearly see the structure of the existing database and understand key concepts more intuitively. The data model mainly includes three levels: conceptual model, logical model and physical model.

① Conceptual model: It is mainly used to describe the conceptual structure of the world. It is a high-level data model consisting of core data entities or their collections, and the relationship between entities;

② Logical model: further decompose and refine the conceptual data model, describe entities, attributes and entity relationships;

③Physical model: A model that is oriented to a specific database and combined with the characteristics of the database, which is convenient for computer implementation.

In the process of model design, developers usually focus most of their time and energy on the design and iterative optimization of conceptual models and logical models; the physical model is similar to the "compilation" operation of conceptual models and logical models, by generating And execute the DDL script to finally realize the creation of the database and the corresponding schema.

02. Data architecture and model solutions

  1. Solution 1——Integration of model design and development platform

Through the visualization of ER diagram, the design of logical model or physical model can be realized. Take the following figure as an example. The data includes three core concepts of hub, link, and Satellite; using the Data Vault model, more flexible data warehouse automation operations can be realized, and the decoupling of the model can be realized in a more convenient way to build complex, business-oriented In-depth industry models.

insert image description here

After completing the design of the model, generate the corresponding DDL script, and finally realize the management and iteration of the model through the Create function or the Alter function.

insert image description here

  1. Solution 2——data standard control, data standard inspection

(1) Data standard control

In the model design stage, the model fields involved should be standardized; by specifying or citing relevant enterprise-level data standards, and using intelligent recommendations, it is more convenient to realize the selection of data table fields.

Data modeling tools generally have the function of data standards. During model design, developers can directly refer to data standards by dragging and dropping, or use intelligently recommended data standards in the entity designer to optimize data application modes and improve models. design efficiency.

As shown in the figure below, taking the power system model as an example, in the process of table structure design, keywords (such as transformers) can be directly associated with the corresponding data standards, and then query the standard field names, physical types, length precision, business Definition and other information, and then introduce standards into entity attributes, and at the same time realize the specification of field names, data types, and data precision, and then realize the quality control of the data model of the source-end business system.

insert image description here

(2) Naming dictionary construction

If the relevant enterprises or departments do not formulate strict enterprise data standards, the enterprise can build a unified terminology dictionary (that is, naming dictionary) based on business terms; with the help of this dictionary, it can solve the common problem of "multiple names for the same indicator" when R&D personnel model "This kind of problem that is prone to ambiguity; in the process of model construction, the developer automatically translates the names of model entities and attributes based on the dictionary library, realizes the naming specification of the data model, and makes the design quality of the physical model higher.

insert image description here

(3) Central model library

The multi-person collaborative integration model involves complex version management issues such as version iteration and version comparison. Therefore, a central model repository similar to git can be established to realize online management of data model design specifications, data standards, and model design results based on the data model server; provide model design tools to realize model design specifications, data standards, and online application of models. Provide means for the implementation of data standards; support design state and operation state model matching monitoring, and realize online management of data models from standardized design to application.

(4) Data specification tool

Build development rules into the modeling process, and develop corresponding data specification tools and data standard consistency check tools to solve business pain points such as irregular design of R&D personnel and lack of data standard constraints, and minimize the cost of data governance:

insert image description here

① The data specification tool can detect the following content: the Chinese names of tables and fields cannot be empty; the physical names of tables and fields cannot be empty, etc.

② The data standard consistency check tool can detect: whether the data type, Chinese name, and English abbreviation are consistent with the standard, etc.

insert image description here

  1. Solution 3 - Model change automation and intelligence

Build a data model library based on the data model server, and the database carries information such as data standards, naming dictionaries, and standard reports; the iteratively optimized model is published through a unified release system (such as jira, confluence, etc.) to realize the storage management of the data model and version change management, and provide functions such as online model viewing and editing and multi-person collaboration.

insert image description here

Its core functions are:

① Unified model storage, Web model sharing and query;

②Realize model version management and complete historical records of model changes;

③ Automatically conduct model compliance checks and standard bidding reports;

④ Multi-person collaboration, editing and modifying the model at the same time;

⑤ Automatically generate database building scripts and manage data dictionaries.

insert image description here

Using a code management method similar to git, the model design tool manages the model from the three levels of model, branch, and version, and finally effectively solves the model version management of R&D personnel and realizes collaborative sharing.

  1. Solution 4 - Correspondence between data model and business scene business object

In addition to data model design, large enterprises also need to integrate a large number of business scenarios. Business architecture includes business processes, business activities, etc., involving a large number of business forms and corresponding business objects. On the data entity page of the data model, each entity is bound to each business object in the business scenario, and then the blood relationship is tracked and analyzed through Datablau's self-developed model management and control system.

insert image description here

  1. Introduction to Datablau model management and control system

insert image description here

The Datablau model management and control system includes three parts: before, during and after the event:

① Beforehand: Design the model through a unified modeling tool;

② In progress: Add model review link, domain architects and enterprise architects are responsible for model review, and completeness check through asset platform;

③ Afterwards: After deploying the production environment, check and monitor the consistency and integrity of the model through the data asset platform and issue relevant reports.

  1. Datablau model control system and data development

After incorporating the Datablau DDM tool into the development and production process, each business module needs to perform corresponding model migration, and use the typical capabilities provided by the platform for model design, development testing and production.

(1) Model import

① Model import: Import the models of tools such as PD and ERWin into DDM through the import tool.

② Reverse engineering: Reversely generate models by directly connecting to the database.

③ Information completion: Supplement the missing field information in the model, such as the Chinese name of the field.

(2) Design stage

① Model design: use the client designer for module design and maintenance.

② Impact analysis: the design phase can show the impact of model modification on downstream systems.

③ Field indexing: Data standards can be referenced in the design tool.

(3) Evaluation stage

① Task management: When submitting a model, it needs to be associated with a task.

② Branch management: manage branches according to recommended best practices, and merge content between branches according to tasks.

③ Model review: Model changes must be reviewed online.

(4) Put into production stage

① DDL verification: compare the production DDL with the DDL exported by the model tool. For the unmatched part, it can be manually confirmed in the near future, and it can be confirmed by the system in the long term.

insert image description here

  1. Datablau model branch management strategy

Version branch management includes two parts, the design state and the running state. The data model is managed according to the corresponding version of the development and test environment, and the corresponding management is carried out based on the different release statuses of each branch, such as development, SIT, UAT, version, etc., and finally a unified branch management strategy is formed.

insert image description here

  1. Model design and development platform integration

Construct the integrated management process of model design and development platform, realize model designers from model design to data architect approval model, and then enter the model script into the business system library, and generate code to embed data standards to the development platform.

This set of data modeling management process can effectively transform data models into enterprise data assets. Compared with directly extracting technical metadata, the data assetization model greatly improves the quality of data on the one hand, and on the other hand increases the relationship between data and the business definitions behind various data, making data information more comprehensive and systematic.

insert image description here

03. Practical cases of large enterprises

  1. Enterprise Data Architecture - Conceptual Model of Manufacturing Industry

Taking the manufacturing industry as an example, the figure below presents a high-level conceptual model of the manufacturing industry, involving business sectors such as management, operation, and support.

insert image description here

  1. Building an Enterprise Data Architecture - Development Roadmap - Subject Domain Model

insert image description here

Transform the above business segments into a high-level subject domain model. Taking an automobile factory as an example, the first thing is to conduct product research and development, and output the product parts, namely the BOM list; assemble and produce based on the BOM list, and associate the sales list; at the same time, the BOM will also be associated with sales project management, and finally related to customer management, order management, and sales. A series of data such as management and financial management are associated multiple times to build a high-level subject domain model.

  1. business status

(1) Business status review: results (1) L1-L3 high-level process architecture

insert image description here

The above subject domain model is further refined, taking the procurement department as an example, based on the organizational function positioning of the procurement department, and input from business interviews, to comprehensively sort out the high-level business structure contained in the procurement domain.

① L1 Category Domain: The highest level of enterprise business, which can be defined based on business capabilities or end-to-end scenarios.

② L2 Process Group Process group: the lower-level capability or process set of the enterprise-level domain.

③ L3 Process: A series of interrelated activities that transform input into output. Processes consume resources and require repeatable standards; processes need to comply with a control system oriented towards quality, speed, cost performance requirements.

(2) Business status review: results (2) L1-L3 business side data catalog

Based on the functions of the procurement department, sort out the standardized business information/forms contained in different information domains in the procurement domain, and convert them into a data asset catalog on the business side to support data accountability.

insert image description here

(3) Business status review: results (3) L1-L3 business panorama

Based on the procurement business value chain, draw a business information flow diagram: examine the overall picture of the procurement business from an end-to-end perspective, and identify the ins and outs of business information.

insert image description here

  1. data assets

(1) Data asset sorting: results – data catalog (L1-L5 asset list)

insert image description here

The data asset catalog shown in the above figure is an example, which is divided into five levels: subject domain group, subject domain, business object, data entity, and attribute; each additional level can be understood as adding a leaf node.

  1. data standard

(1) Data standard formulation: Outcome – Data standard (L5 attribute standard)

For the standardized definition of L5 layer attributes in the data directory, by completing the business attributes (name, business rules, etc.), technical attributes (data type, length, etc.) and management attributes (data maintenance responsible person, data steward, etc.) Form data standards.
insert image description here

  1. data model

insert image description here

Build data models based on data standards. The figure above shows the data model of the procurement domain, and each field in the model forms a mapping relationship with the data standard.

(1) Data model design: ONE ID logic design

insert image description here

Based on the above data model, build data applications in combination with actual business. Taking the procurement domain as an example, a comprehensive portrait of each supplier, including financial information, operating status, business information and other dimensions, constitutes a service model of supply chain finance.

(2) The data model is the core position of the data platform

The data model is the core data asset of the data center, which is related to basic data integration, development efficiency, and data quality. The data middle platform mainly includes ODS layer, DWS/DWD layer, and data mart layer, etc. The standardization and flexibility of these middle layer model design determine the management and application efficiency of data assets. Therefore, how to integrate the data model is a sign of the success of the data center.

insert image description here

(3) Comprehensively manage and upgrade model data assets

In traditional data model construction, developers often implement corresponding functions through SQL scripts based on business logic, convert them into stored procedures, and then realize data transformation through task scheduling. This method is flexible and easy to implement, but it will bring troubles to subsequent data asset sorting, data quality inspection, and data repair and other related work.

insert image description here

Therefore, with the data model as the core, through the management of the platform model in the data, the transformation from the silo-style code development to the model-driven code development stage is realized. Realized the transformation of model-driven data model assetization, reviewable development process, code quality and reliability, etc., making the middle office become the precipitation and release center of enterprise data assets, and then form the influence of industry models.

(4) Integrated modeling architecture

insert image description here

From the perspective of data strategy, related modules such as business process, business architecture, data responsibility, data security, and entry standards are carried on the business model; further, the business model is realized through the implementation of the data model, and the model review is carried out in combination with the corresponding enterprise standards , and the data model that passes the review is published as a data asset catalog, and finally enters the data lake.

insert image description here

Due to the periodicity of iterative updates in the data model, the maintenance of data standards is very important in the process of model design. All models are assembled from data standards; model review and model release are important intermediate control nodes, and finally realize self-service entry into the lake, and periodically compare with production metadata.

(5) Four components of enterprise-level information architecture

insert image description here

Enterprise-level information architecture is essentially based on a set of core information architecture, which is presented in four different forms: data asset catalog, data standard, data model, and data distribution:

① Data asset catalog

1) Expressed through a layered architecture;

2) classification and definition of data;

3) Clarify data assets;

4) Establish the input of the data model.

② Data standard

1) Specification of business definition;

2) Unify the language and eliminate ambiguity;

3) Provide standard business meaning and rules for data asset sorting.

③ Data model

1) Realize the description of data and its relationship through ER modeling;

2) Guiding IT development is the basis for application system realization.

④ Data distribution

1) A panoramic view of the flow of data across business processes and IT systems;

2) Identify the "in and out" of the data;

3) Navigation for locating data issues.

This core information architecture essentially interprets enterprise data asset information from four perspectives:

As the initial design prototype, the data model forms a data asset catalog after review and release, and is finally opened to business departments; the most fine-grained specification within the model forms a data standard; data distribution reflects a specific table or field in the entire business The position in the process system locates the corresponding specific business object and intuitively reflects the upstream and downstream relationship of the business object.

(6) Six entry standards

The evaluation criteria for data entry into the lake roughly include the following six aspects:

① Clear data Owner

The owner of the process corresponding to data generation is responsible for the end-to-end management of the data under its jurisdiction. It is responsible for defining data standards and confidentiality levels for the data entering the lake, undertaking data quality problems in data consumption, and formulating roadmaps for data management. Improve data quality

② Release data standards

The data entering the lake must have corresponding business data standards. Business data standards describe the meaning and business rules of the "attribute layer" data that must be complied with at the company level. It is a common understanding of a certain data at the company level. Once these understandings are clarified and released, they need to be complied with as standards within the enterprise.

③ Authentication data source

By authenticating the data source, it can ensure that the data enters the lake from the correct data source. The certified data source should follow the requirements of the company's data source management. The general data source refers to the application system that officially releases a certain data in the business for the first time, and has been certified by a professional data management organization. The certified data source is called by the data lake as the only data source. When the application system carrying the data source is merged, split, or offline, the data source should be invalidated in a timely manner, and the new data source certification process should be started.

④ Define data confidentiality level

Defining the data confidentiality level is a necessary condition for data to enter the lake. In order to ensure that the data in the data lake can be fully shared without causing information security issues, the data entering the lake must be encrypted. The main body responsible for data encryption is the data owner, and the data steward has the responsibility to examine the integrity of the confidentiality level of the data entering the lake, and promote and coordinate the work of data encryption. The data grading density is at the attribute level, and different levels are defined according to the importance of the asset. Data with different confidentiality levels has corresponding data consumption requirements. In order to promote the consumption of company data, the data in the data lake has a corresponding declassification mechanism. Data that reaches the declassification period or meets the declassification conditions should be declassified in time, and the classified information should be refreshed .

⑤ Formulate a data quality plan

Data quality is the guarantee of data consumption results. When data enters the lake, it is not necessary to clean the data, but it is necessary to evaluate the data quality, so that data consumers can understand the quality of the data and understand the quality risks of consuming the data. At the same time, data owners and data stewards can promote the improvement of source data quality according to the data quality assessment and meet the consumption requirements of data quality.

⑥ Register metadata

Metadata registration refers to associating the business metadata and technical metadata of inbound data, including the correspondence between logical entities and physical tables, and the correspondence between business attributes and table fields. By connecting the relationship between business metadata and technical metadata, data consumers can quickly search for data in the data lake through business semantics, lower the threshold for data consumption in the data lake, and allow more business analysts to understand and consume data.

(7) Data model management and control organization

From the perspective of the organizational structure of the company's department, the advancement of data model management and control needs to be supervised and supported by a corresponding organizational structure. On the one hand, based on the DAMA methodology, enterprises build different data governance system dimensions, such as data standards, data quality, data models, data asset catalogs and other related content; on the other hand, based on the various project teams under the traditional IT-related departments, it is recommended Arrange some developers to take on some data governance roles in a part-time manner, making the data governance structure more three-dimensional. In addition, an enterprise architecture office (generally including a 4-tier architecture of data architecture, application architecture, technical architecture, and business architecture) can be specially established to cooperate with the project team to achieve more comprehensive and in-depth data model management services.

Therefore, establishing a data organization setting that combines virtuality and reality is the key to ensuring that data work can be fully integrated into the business and can be effectively implemented in the application system.

insert image description here

Taking Bank of Communications as an example, the company has more than 500 business systems in total, all of which realize model management and control through the collaboration of the above-mentioned organizational structure.

04. Q&A session

Q1: Implementing enterprise-level data governance according to a complete set of combined architectures often brings high time costs; therefore, how to balance data governance and development efficiency?

A1: ① The development of the data governance framework requires a certain opportunity; the newly constructed system of the enterprise can be used as a pilot; especially the financial system, which is often updated every five years or so. Therefore, it is possible to select the appropriate system update node to promote the data governance architecture.

② If the data asset demand of the enterprise is relatively strong and urgent, then source-end control is a necessary work. On this basis, some departments or project groups can be promoted through small-scale pilots first, and then large-scale promotion will be gradually carried out later. In addition, some more efficient tools can be used to improve development efficiency.

Q2: How is master data reflected in the data model?

A2: This kind of problem has caused extensive discussion in the industry. For the financial industry, the customer management system is the customer's master data; for enterprises with a long business chain, such as manufacturing companies, the common method is to model the master data. For master data modeling, the more traditional way is to develop the corresponding MDM (Master Data Relational System). Quantification is usually to reserve a small amount of area in the database corresponding to each system (such as organization, customer, material, product, etc.) to store the corresponding master data model, so as to realize the connection between the master data model of the system and each system. In short, the core lies in the construction of the master data model, and lightweight is the trend.

Q3: How to solve data quality and data standards?
A3: If the model design of the enterprise has already passed the bid, the work of quality management will be relatively much easier; since the standards corresponding to each physical field have been determined, basic data quality inspection rules can often be automatically generated, while complex data quality inspection The rules are linked to the accountability section in the data standard, and the corresponding departments provide their own business rules related to data quality testing, and finally the business rules are converted into technical rules, which are embedded in the system for periodic operation. (Editor Wang Jidong, produced by the community DataFun)

Guess you like

Origin blog.csdn.net/weixin_39971741/article/details/130323202