【DataOps】- NetEase Shufan Data Governance 2.0 Practice Sharing of Data Development Governance Integration

foreword

Finally came across a practical sharing of data development and governance integration. The cost of reverse data governance is indeed very high. Most companies have data first and then build data development, data assets, data scheduling, data monitoring, data integration...etc. Wait for the management and development platform and then do data governance.

To do a good job in data governance, I think there are two directions to try: 1. Like NetEase's integrated solution, directly from the data development, the data modeling source starts to control 2. First define the standard, data governance platform It is the data abstraction layer (standard layer), which implements a data registration mechanism to abstract the original development process + design + requirements into a standard data governance platform.

The construction cost of the first solution direction is relatively high. At present, it is the unified direction of all ToB manufacturers. The
construction cost of the second solution direction is slightly smaller. It can decouple and manage various source data, and the modularization is flexible

No matter what direction it is, the ultimate goal of data governance is to reduce the cost of implementation and the complexity of its use, and at the same time complete the management and standardization of the entire life cycle of enterprise-wide data, helping enterprises to better realize the data-driven future (market value of grow~~~)


Next, let's read the sharing of NetEase Shufan-Guo Yi:

  • This article is produced on the platform: DataFunTalk
  • Sharing guest: Guo Yi Netease Shufan

Guided reading

With the further development of big data, NetEase Shufan big data team put forward the concept of data productivity. Adhering to the vision of "everyone uses data, always uses data", the NetEase Shufan data middle-stage support technology system has been constructed, supporting the construction of data middle-stage projects such as NetEase Cloud Music, Yanxuan, Media, Youdao, and Mailbox. There is a very close relationship between data center and data governance. If data management is not done well, data center is like a castle in the air, and various problems will occur. Therefore, data governance is very critical to the construction of data center. This article will share NetEase Shufan's practical experience in data governance, including data middle-office and data analysis, and will focus on the following five points:

  • NetEase Shufan Big Data

  • Why data governance projects often fail

  • NetEase Shufan Data Governance 2.0

  • Practical case of NetEase Shufan data governance

0X01 NetEase Shufan Big Data

First, let me introduce the background of NetEase Shufan big data.

1. The development history of NetEase Shufan big data

insert image description here
NetEase Shufan is a commercialized brand of ToB business incubated by NetEase Hangzhou Research Institute, mainly to provide enterprises with technologies and services required for digital transformation. NetEase Hangzhou Research Institute was established in 2006 and is positioned as the public technology department of NetEase's Internet business. At the beginning of its establishment, we mainly built three distributed systems: distributed database, distributed file system, and distributed search engine. The horse-drawn carriage supported a series of NetEase products in the Internet 2.0 era, including the well-known NetEase blogs and photo albums.

In 2009, NetEase took the lead in data analysis and operation and maintenance based on Hadoop. NetEase's technical system is very open. We are optimistic about the momentum brought by the open source community to the sustainable development of a basic software. In 2014, the NetEase big data platform (internally known as NetEase Mammoth) and NetEase Youshu BI were launched, which promoted the large-scale application of NetEase data analysis, including NetEase Koala, Yanxuan, Music, News, Youdaodu It is based on this platform to build its own data analysis system.

In 2017, NetEase big data began to be officially commercialized. In 2018, with the rapid growth of NetEase's internal data analysis scale, NetEase encountered many problems and challenges in the field of data analysis, mainly in the fields of data utilization efficiency, quality, cost and security. Facing huge pressure from the business, we began to The entire data architecture was reshaped by the data middle-end method, and the "full-link data middle-end" solution was proposed and released. In 2020, NetEase Shufan proposed the concept of "data productivity", emphasizing the construction of a data product matrix for business scenarios based on the data center, and further refining the methodology of "data productization", which are also the three cores of data productivity One of the methodologies. The large-scale use of data has accelerated the urgent need for data governance solutions. In 2022, the concept of "integration of data development and data governance" was proposed, which is also the core connotation of NetEase Shufan's "Data Governance 2.0".

2. NetEase Shufan Big Data Product Matrix

insert image description here

The picture above shows the technical system of Shufan big data products, including a four-layer architecture:

(1) Infrastructure layer

Big data computing and storage engines include some of the current hottest technologies, such as storage-computing separation technology, real-time data lake technology, offline and online business hybrid scheduling technology, etc. At NetEase News, we have implemented offline data analysis tasks and Online transaction processing services use k8s scheduling in a unified manner. During low peak periods, some offline services are scheduled to the servers of online services, and resource utilization has been significantly improved. In the overseas business of NetEase Cloud Music, we have cooperated with AWS and have taken the lead in adopting the technology of separating storage and computing, replacing HDFS with cloud object storage, and building a cloud-native data platform architecture. In Cloud Music, we have used NetEase's open-source arctic real-time data lake solution, which enables the data lake to have minute-level real-time update capabilities.

(2) Data development platform based on the full life cycle of DataOps

A complete set of DataOps tool chain including data integration, development, testing, release, operation and maintenance can realize efficient testing and seamless release between multiple environments of DEV/SIT/UAT/PRD.

(3) Data governance technology platform

The data governance system of NetEase Shufan includes not only the three major elements of traditional data governance that we often see: data quality, metadata management, and data standards, but also related systems in the data center, such as the indicator system and the model design center. And data services, we integrated it into the NetEase Shufan Data Governance 2.0 system.

(4) Data product layer

BI is the most important window for data analysis, including one-stop data portal, self-service data retrieval, large data screen, and some general data products, such as CDP. In addition, we also put the machine learning platform into the data product layer, mainly on the data, can access some intelligent algorithms to improve the accuracy of decision-making.

3. NetEase Shufan big data commercialization positioning

After the long-term practice of NetEase Group's internal business, it has a leading methodology, and has accumulated many industry landing cases. At the same time, it also clarifies the commercialization positioning of NetEase Shufan big data.

  • We are a basic software provider, we are not a cloud vendor;

  • We must support a cross-cloud strategy;

  • We believe that a healthy big data software market must be stratified.

4. User Story Wall

insert image description here

0X02 Why data governance projects often fail

Next, focus on sharing why data governance is needed?

1. Why do we need data governance

insert image description here
We divide the digital transformation of an enterprise into two stages. The first stage is online, which mainly uses information systems to replace offline processes. In this stage, a lot of business systems will be formed. The second stage, which we define as digital intelligence, is the use of data and algorithms to replace brainstorming decisions. To achieve digital intelligence, data productivity must be achieved, and we define data productivity as the improvement of organizational productivity through the use of data. After observing a lot of companies, we found that all companies that can really achieve data productivity have the same feature, that is, they can use data for everyone, and use data from time to time, so we take it as the vision of data productivity. . To realize this vision, NetEase Shufan proposed that three major methodologies must be relied on:

  • Data R&D (DataOps): full data life cycle R&D system

  • Data Governance (DataFusion): Data Governance 2.0

  • Data Product: Data is productized, making it easy for users to use the data

2. NetEase Shufan Data Productivity Architecture

insert image description here

In the entire data productivity architecture, there are three roles, business system, data center and data product. The business system is mainly responsible for the management of the process. Different business systems create data islands. When we want to analyze the data according to the entire business process, we must aggregate the data of these different business systems into a unified data. Taichung, forming a public data base for an enterprise. The most important responsibility of the data center is to build an enterprise's public data layer, produce high-quality, consistent indicators, and present them on the data products. Data products are mainly responsible for converting data into business decisions, making the operation of business processes more intelligent. Therefore, in the whole architecture, data comes from the business, and finally the data is transformed into decision-making, and then returned to the business. This cycle is what we call the digital-intelligence cycle.

So what does this have to do with the data governance we are going to talk about today? What role does data governance play in this? This also starts with the problems we encountered. We said earlier that some business people can really use the data effectively, but can the business people really use the data? What's wrong with using the data?

insert image description here

We boiled down the problem to: can't find, understand, trust, and control . In fact, behind it is the low efficiency and poor quality of the entire data production.

3. Traditional Data Governance 1.0

insert image description here
Traditional data governance, which we refer to as three big things, includes metadata management, data quality, and data standards. The general data governance process starts with data standards, and the process of formulating data standards is called calibration. After the standard is set, it is necessary to complete the bid. In this process, metadata collection, metadata registration, and metadata approval and release are required. After the bid is completed, the connection between the data model and the data standard is completed. Next, we can use the data element constraints defined in the data standard to audit the data quality, catch out the data quality problems that do not meet the standard, and promote rectification. . This is a pretty standard data governance process. This process has a significant improvement effect on existing data, but ignores the long-term governance of incremental data, which leads to the need for enterprises to continuously maintain the effect of data governance through data governance projects.

Therefore, NetEase Shufan believes that in order to achieve long-term data governance, it is necessary to solve problems from the production link of data to ensure that the produced data itself meets the standards.

The problems existing in traditional data governance 1.0 are summarized as follows:

(1) Data development and data governance are disconnected

Specifically in:

  • Disconnection between data quality and data development: People often ask how to ensure the completeness of data quality auditing rules. We found that only 10% of the core tables of the produced data have auditing rules. The same data items have different development settings. Audit rules are inconsistent.

  • Data standards and data modeling are out of touch: sharing a set of data, in NetEase, 37% of the tables have non-standard naming problems, and there are more than 8 field names for the same field.

  • Data standards and data security disconnect: Data security policies are inconsistent with data standards.

  • Data development is disconnected from data standards: dictionary mapping is inconsistent with ETL

  • Metadata is disconnected from task operation, maintenance and development: tasks cannot be effectively managed according to asset registration

The cost of reverse data governance is very high, because the table has been built and the task has been launched, and the cost of urging them to change is relatively high. If we can design the model before the table or analysis task goes online, standardize the data first, and then model, the resulting table must meet the standard, and the cost is the lowest, so we emphasize the integration of data development and governance .

(2) Lack of unified management of different platforms

Different computing and storage engines increase the cost for users to find, understand, and use data.

insert image description here
(3) Efficiency and quality issues in the data development process are ignored. The
insert image description here
above pictures are two real cases. It can be seen that data governance should be integrated into the data production process, rather than govern after going online.

(4) The chimney-style data development is not solved. The chimney-
insert image description here
style data development will cause inconsistent index calibers, efficiency problems caused by repeated data development, and resource usage problems caused by repeated data calculation.

(5) Insufficient assessment of data value and cost

insert image description here

(6) The process of data governance lacks quantitative means
insert image description here
There should be some quantitative means to monitor the entire governance process.

(7) The process of data governance lacks a closed loop of continuous feedback

Metadata lacks a closed loop for continuous improvement

Data quality lacks a closed loop for continuous improvement

Refinement management of resources lacks a closed loop of continuous feedback

0X03 NetEase Shufan Data Governance 2.0

1. What exactly is data governance?

insert image description here

The industry authority DAMA stipulates 11 functional quadrants of data governance, but it lacks specific implementation methods and experience.

insert image description here
DCMM is my country's first national standard in the field of data governance. It provides objective evaluation methods, but still lacks specific action methods.

2. NetEase Shufan's understanding of data governance?

insert image description here

NetEase divides it into two parts according to the purpose of data governance:

  • Business system-oriented data governance: solve the cross-business, cross-system, and cross-process enterprise core data consistency, authority and correctness issues of business systems.

  • Data governance for data analysis: It solves the problems of efficiency, quality, security, cost, standard and value in the process of data analysis.

3. NetEase Shufan data governance methodology DataFusion

NetEase's data governance methodology integrates traditional data governance methods into the full life cycle of data development, based on the DataOps full life cycle data development base, adopts the data architecture of the data center, and combines NetEase's characteristic ROI-based data assetization practice , which we call Data Governance 2.0

Core highlights:

  • Development and Governance Integration

  • Logical data lake based on DataFabric

  • Data development base using DataOps

  • Data middle-end architecture to solve chimney data development

  • ROI-based data asset precipitation

(1) Integration of data development and governance

insert image description here

  • Generate range constraints through data exploration

  • Data standards bind audit rules to data elements and meta-models

  • Data Modeling Reference Data Elements and Metamodels in Data Standards

  • According to the audit rules associated with the data standards bound to the table, the audit monitoring is automatically added to the table

(2) Based on DataFabric logical data lake

insert image description here

The core idea of ​​the logical data lake based on DataFabric is to build a cross-platform unified data mart. Build HIVE, MySQL, and Greenplum into a unified aggregation layer, and directly deliver it to BI on top of this, and complete the out-of-the-box effect of business through circle selection of data sets and materialized views. For users, it can shield the underlying different data sources. The data realization process between.

(3) Data development base based on DataOps

insert image description here
The data development base based on DataOps is to apply the CI/CD methodology in software engineering to the field of data development, covering sustainable integration, sustainable delivery, and sustainable deployment. Specifically, it includes six stages of coding, orchestration, testing, code review, release review and deployment.

(4) The architecture of the data center

insert image description here
The data center includes three cores: unified indicator management system, high reuse, standardized public layer model, and data service.

(5) ROI-based data asset precipitation

insert image description here

Based on the precipitation of data assets based on ROI, we can see the refined scene management of each task through the visual analysis page, which enables business personnel to continuously manage and offline the useless data.

  • Calculate the calculation and storage resource consumption of each task, query, table, convert it to money, and allocate it to each data report and data service API application level;

  • "Peeling the onion" data offline: Starting from the data applications that are no longer used in the downstream, archive the upstream tasks and data offline layer by layer to release resources.

  • Estimate the cost of tasks and queries, and conduct approval control for high-consumption tasks and queries

4. Quantitative indicator monitoring and analysis

insert image description here
By monitoring the data governance health score in the dashboard, points can be deducted in different dimensions. In the end, based on this health score, we can make red and black lists between different businesses, which is also a means of performance management.

5. Ongoing Operations - Metadata Quality Discovery and Feedback

insert image description here

In the process of continuous operation, when data asset consumers find problems with data quality, they can apply for data governance. The data management department can assign a work order to require the business department to complete the repair of the corresponding data problem at the specified time and place.

6. Enterprise data culture construction

Data culture:

  • Data Analysis Competition, Data Governance Competition, Data Visualization Competition

  • Data development engineer, data visualization analysis engineer qualification certification

Organization building:

  • Data Governance Department, as Data Governance Operations Department

  • Business departments are equipped with data governance specialists

  • Develop a data governance score, as a red and black list, to promote the attention of business departments

  • Combine with the company's internal process engine to realize the tool-based flow of data governance processes

7. Data Productivity Organizational Structure

insert image description here

8. Governance-Oriented System Construction

insert image description here
Technology is the foundation of data governance, but technology is not enough. The above organizations, processes, assessments, and policies are also needed to improve the entire system, so as to finally realize the vision of everyone using data and using data all the time.

9. Data Strategy

insert image description here

10. Enterprise data asset portal - one-stop data consumption platform

insert image description here

Through a one-stop data consumption platform and portal, business personnel can see what data, core reports, and core data governance applications the enterprise has on the portal.

0X04 Practical case of NetEase Shufan data governance

1. A large operator

insert image description here
Problems faced before the introduction of the NetEase Shufan one-stop tool platform:

  • Data standards, data quality and data development are seriously out of touch. Specifications can only stay at the dictionary level and cannot be integrated into the data production process, and cannot be effectively implemented and supervised.

  • Different manufacturers and different tools are seriously fragmented, data quality audit rules cannot be connected with the range constraints of data elements in data standards, data elements in data standards cannot be linked with data modeling tools, and data security levels in metadata management It cannot be linked with the data desensitization of the security center.

In the end, it led to repeated data governance, which did not fundamentally solve the problem.

2. Data development and governance integration

insert image description here

Netease Shufan is introduced, and the data center provides integrated capabilities such as data collection, modeling, development, scheduling, and governance for warehouses, economic divisions, and network clusters. In the production process, online and streamlined operations are implemented for procedures such as online and offline operations, table building, etc. On the one hand, labor is reduced to improve efficiency, and on the other hand, the process of data management and control is improved.

The key point is to integrate the whole process of data governance into the whole link of data development. Before designing, we should standardize data, then do data modeling, and focus on data quality, data security, and data assets. The implementation of the data governance scenario of the entire development and governance integration.

3. List of achievements

insert image description here
The image above shows the results of our data governance. We can also find problems in terms of quality, value, safety, cost, standard and efficiency.

0X05 QA

Q1: How to coordinate business-oriented data governance and analysis-oriented data governance?

A1: This question is a very good question. Should we do business-oriented data governance? Do we want to do data governance for data analysis? Should I do business data governance first or analysis data governance first? In fact, there is a strong connection between them, because the data comes from the business system and will eventually return to the business system. Therefore, we have done business-oriented data governance. In fact, there are corresponding data standards on the business system side. In the data standards, it also has corresponding data quality rules and data asset levels.

Of course, if I do business data governance, do I not need to do data analysis governance? no! I actually mentioned a very important point just now that the modeling methods of the analysis system and the business system are different. The modeling method of the business system is entity relationship modeling, and the modeling method of the analysis system is the Dimensional modeling, there is a connection between the two, which can be connected by means of business entities. If you do data governance in the business system, the data governance of the business system can be directly applied to the data governance of the analysis system. We can synchronize the standards and the data quality rules corresponding to the standards. These rules are analyzed in the analysis system. In the system, it will form different data quality audit tasks, but a process such as calibration can actually greatly reduce the complexity and difficulty of the work, so a synergistic relationship between the two is that business-oriented systems can be These data quality rules, data standards, and data models for data governance are synchronized to give us analysis-oriented data governance, managed on the same platform, and analytical and business data on the same platform can be associated through business entities. stand up. This is a synergistic process between the two, which is actually realized by a tool, technology, and product.

Q2: How did you carry out the data testing process, and what kind of concept did you implement and implement?

A2: Data testing is a very important part of our entire CI/CD. That is, we do data testing. It is a very important means of testing. We will do a lot of stuck points. How to ensure that this thing can be implemented in place? In fact, there needs to be some stuck points, that is, some can be stuck. One of the points that makes it necessary for him to implement is that we do data testing, which is a very important means of testing. We will do a lot of stuck points, that is, how to ensure that this thing can be implemented in place, in fact, it is There needs to be some point where it can get stuck so that it can have to execute. What are the card points here actually based on? All your data is stuck, not realistic. Therefore, data needs to be designed first and then developed. In the design process, you will do the classification and classification of data assets and define the security level of the data. We can formulate the corresponding approval process according to the scope of data impact and data level. For example, for the launch of core data, we must have corresponding data test reports. Including some business rules and technical rules corresponding to data testing, such as whether the primary key is unique, whether there is a null value and other related conditions for auditing, we will automatically pass these corresponding data quality reports through the platform in the task. When the submission is online, it is mixed into the submission and online process of the business. At this time, the online process will automatically trigger the approval flow according to the scope of its influence on the downstream and the corresponding data asset level, and the approver will go to see him. Whether the data test report matches the code, and whether there is a corresponding data test result? If the data test result meets expectations, the task can be launched. In this way, all our core data can be compulsorily guaranteed to be carried out. experimental.

Q3: What do you think is the most successful application of data governance 2.0 in financial scenarios?

A3: To be realistic, I have actually seen many cases, including the securities industry, banking wealth management, and asset management industries. The integration of data development and governance is just beginning to explore the stage, including some time ago, we have cooperated with many securities industries. The CIO and the person in charge of data governance communicated with each other. They all hope that data governance can be implemented. Of course, there will be many problems in the process of implementation. For example, our tool development platform may have existed many years ago, and the data governance platform is another, so there will be a lot of problems in getting through between different platforms, bringing very high costs, and ultimately leading to no The method is implemented, just like the case of the operator I just shared. But on the whole, I think this is a trend and direction that everyone recognizes, which is to complete the implementation of the entire governance process in the production data and production links. Instead of doing this kind of governance process repeatedly after the fact. I would like to share with you an experience. For new data, it may have greater value to the business. Old data may have limited value, so we should pay more attention to the generation of new data and the process of new data governance.


appendix

  • This article is produced on the platform: DataFunTalk
  • Sharing guest: Guo Yi Netease Shufan
    insert image description here

Guess you like

Origin blog.csdn.net/qq_31557939/article/details/126902873
Recommended