Data quality: the core of data governance

background

With the advent of the era of big data, mobile data has become a carrier connecting the world and a source of power to promote economic and social development and facilitate people's lives. With the flow of data, especially to solve a series of problems arising from the flow, "data governance" has become popular. To understand data governance and data quality, you have to start with the basic concepts of data, data governance, and data quality.


What is data

Data is a very vague concept, and so far, information scientists have no unified definition of data. The author here gives the definition of data on Wikipedia. Data (English: data) refers to raw records that have not been processed. But from the verse in Elliott's "The Rock": "Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?" It is favored by informatics. Eliot seems to have proposed a hierarchical structure, which is arranged in order from high to low in demand as wisdom, knowledge, and information. But we often use the same hierarchical structure and add data layers. This hierarchical structure is data, information, knowledge, and wisdom. In this view, data is the original data collected by tools or machines. To be precise, data is raw, unprocessed information, and no one has even touched, viewed or thought about it. For example, the information flow of the "Chang'e-4" probe returning from the moon to the earth is data. Information is data that has been processed and made available to people. Such as bit stream conversion to image and so on. Knowledge refers to what you know, that is, information that has been internalized, while wisdom refers to understanding how to use knowledge.
Insert picture description here
The hierarchical structure of data, information, knowledge, and wisdom is the various information stages in the field of human cognition.

With the deepening of the understanding of data, data can also be classified from different perspectives: data structure, whether it is reproducible, data confidentiality level, storage level, etc.
Insert picture description here
Similarly, from the perspective of data value, it can be divided into data resources, data assets, and data capital. In the course of the development of the digital economy, data has played a core and key role, and people's understanding of the value of data has also gone from shallow to deep, from simple to complex. In general, the development of data value is mainly divided into three stages: the first stage is the data resource stage, data is a resource that records and reflects the real world; the second stage is the data asset stage, data is not only a resource, It is also an asset, an important part of personal or corporate assets, and the foundation for wealth creation; the third stage is the data capital stage, where the characteristics of data resources and assets are further utilized, combined with value, and through various flows such as transactions The way eventually becomes capital.

Insert picture description here


Data governance

  1. There are many versions of data governance definitions. Here is the definition of data governance by DAMA International Data Management Association. Data governance is a collection of activities (planning, monitoring and execution) that exercise power and control over the management activities of data assets.
  2. Data governance is a high-level and planned data management system activity. Its key management activities include formulating data strategies, improving data policies, establishing data architecture, etc., focusing on the compliance formulation of data users, usage methods, and usage rights. Emphasize the basic work before the full life cycle management of data assets, and pay attention to relevant safeguard measures in data asset management.
  3. In 2015, DAMA expanded it into 11 management functions in the DBMOK2.0 knowledge field, namely data architecture, data model and design, data storage and operation, data security, data integration and interoperability, files and content, and reference data And master data, data warehouse and business intelligence, metadata, data quality, etc. Because data governance involves more content, this article first talks about the more important data quality management functions.
  4. In fact, the key to big data processing is to solve the problem of data quality. Tony Fisher, author of "Big Data Assets: How Smart Companies Win Data Governance," once mentioned: "If basic data is unreliable, most companies’ The big data plan either fails or the effect is lower than expected. The key reason for the above results is that inconsistent, inaccurate, and unreliable data flows into the data life cycle.” Poor data quality often means poor business decisions. It will directly lead to inaccurate statistical analysis of data, difficult to supervise business, and difficult decision-making for senior leaders.
  5. Data quality management is the core of data governance, and data governance is ultimately to ensure the production, supply and use of high-quality data within an organization.

Data quality management

  • Data quality issues as early as 1957, when the computer was just invented, everyone realized the impact of data on computer decision-making, and raised the Garbage In Garbage Out warning. In 2001, the United States promulgated the "Data Quality Act" and put forward guidelines for improving data quality. In 2018, the China Banking and Insurance Regulatory Commission issued the "Data Governance Guidelines for Banking Financial Institutions", emphasizing the importance of high-quality data in bringing the value of data into play. Data quality is the basis for ensuring the effectiveness of data application. Data quality is an index that describes the value of data. Just like the gold content of ore, the quality of data determines the value of data.

  • Lack of data quality management can lead to poor data quality problems such as dirty data, duplicate data, redundant data, data loss, data inconsistency, inability to integrate, lack of responsibility, and poor user experience. Therefore, for enterprises, the need to improve data quality has become increasingly strong.

  • Ideally, data quality management should develop and implement a process improvement plan and process that covers the entire data life cycle. From the initial creation, collection, storage, system integration, archiving and destruction of data. But in fact, it is impossible to do everything at once. Priorities can be set for which processes to improve, and they can be improved in an orderly manner.


Data quality control methodology

Improving data quality requires the attention of management, and more resources can be obtained in terms of promoting the establishment of a data quality management mechanism, the realization of a data quality inspection system, and the construction of a data quality culture.


1. Get management's attention to data quality

Does the data support the company's vision and mission

  • The vision embodies the position and beliefs of entrepreneurs, and is the vision of these top managers for the future of the company. Mission refers to the tasks that an enterprise undertakes by its social responsibilities, obligations or by its own development. Such as Alibaba's mission: to make the world have no difficult business; its vision: to become a good company that can live for 102 years. Jack Ma: "Alibaba is not a retail company, but a data company." Without a data quality strategy for such a large data company, the overall iterative planning and improvement process to improve data quality cannot accomplish and realize such a mission and vision.
  • Data assetization is inseparable from high-quality data
  • Data assetization refers to the process of realizing the controllable, quantifiable, and realizable attributes of data and reflecting the value of data. However, the quality of data determines the value of data and affects the effectiveness of data assets. Moreover, current data has penetrated into all walks of life, and has increasingly become an indispensable strategic asset for enterprises.
  • Data-driven relies on high-quality data
  • In a data-driven organization where so much work today is, whether there is accurate and available high-quality data directly affects whether the leadership makes correct decisions and achieves strategic goals.
  • Obtaining the management's commitment to data quality not only means obtaining the resource support required by the data quality project, but also means that the management recognizes the value of high-quality data and is willing to invest in improvements and rewards that help.

2. Establish a data link management mechanism

Data producer

Source system:

  • The source system data entry interface imposes better restrictions to eliminate data quality problems. For example, on the APP, when users need to fill in monthly income information, set the range of grades and let users choose, instead of input boxes, data entry in English or Chinese.
  • For data exchange between systems, establish interface calling specifications that comply with data quality standards. Invoke a third-party system, and both parties agree on data interface specifications.
  • The source system takes data consumer usage into consideration in the system design and production process. If you store a large number of json types regardless of the needs of data consumers, it will not only involve data security but also make it impossible to use these data better.

ETL development:

  • Establish a complete and relatively comprehensive desensitization production data environment
  • Raise awareness of data developers.
  • Column filtering validates the data in a single column. Such as judging whether there is a null value, numerical range, enumeration value content, etc. in this column.
  • Structural screening will verify the relationship of the data across columns. Such as verifying the hierarchical relationship (one-to-many relationship) between two or more columns.
  • Business rule screening enables more complex verification. Such as bank loan loan date and value date and other complicated logical relationship tests.

Data knowledge management:

  • System construction and data development are integrated into the life cycle of project development, and it is managed and maintained along with system evolution. For example, metadata management, system documentation and training materials are shared among stakeholders, including data quality testing results.
  • Data managers: formulate data quality standards and data management and control assessments, analyze data quality issues and formulate and promote data quality iterative rectification plans, and manage data use, etc.
  • Data consumers: Data consumers still have the responsibility to use the data correctly, and they have the responsibility to understand the knowledge of the data. They must know what the data they use and how to represent it, and how to use the data correctly.
  • From data managers, data producers to downstream data consumers need to be connected in order to create better data through the data link.

3. Detection and quantification of data quality

Data quality inspection system

"Workers must first sharpen their tools if they want to do their jobs well." To measure data quality, a data quality inspection system is needed. The author has posted two technical implementation instructions on batch and stream data quality inspection systems before, and links are given here, so I won’t repeat them.
Data Governance Series: Self-cultivation of a Data Quality Monitoring System
Measuring data quality and monitoring key data detection indicators are as follows:

Effectiveness

  • Field length is valid:
  • Field content is valid
  • Field value range is valid
  • The number of enumeration values ​​is valid
  • Enumeration value set is valid

Uniqueness

  • Monitoring indicators for whether there is duplicate data in the primary key.

Completeness

  • Whether the field is empty or NULL
  • Whether the number of records is lost
  • The number of records fluctuates
  • Range of recording fluctuations.
  • Record number variance test

accuracy

  • Value year-on-year
  • Numerical chain ratio
  • Numerical variance test
  • Table logic check

consistency

  • Table level consistency check

Timeliness

  • Table-level quality monitoring indicators, whether the data is output on time

Data analysis

  • Maximum check
  • Minimum check
  • Average check
  • Summary value check

Custom rule check

  • User-defined monitoring rules implemented by SQL

Data quality is measured from several dimensions such as validity, uniqueness, completeness, accuracy, consistency, timeliness, data analysis, and custom rule checking, but for the current super-large data level monitoring all data is Not cost-efficient. Therefore, knowing which data is the most critical, and performing full-link data quality on these critical data can help prevent errors or reveal opportunities for improvement.

Quantification of data quality issues

  • Analyze and quantify the test results of data quality, find the data link links where quality problems occur, locate data problems, and implement an accountability mechanism.

4. Remain responsible to data producers for their own data quality

Data quality accountability

  • The data producer is the owner of the data creation process. To produce high-quality data, you need to understand the needs and expectations from the consumer. Once these needs and expectations are defined, the management must ensure that the data producer is responsible for their data chain links. . At the same time, data producers also need to provide system-related data knowledge, system documentation, metadata, and training materials. These knowledge, like the data itself, must be shared and managed. There is also a mechanism for them to communicate any changes in the system that may affect downstream users. Data consumers still bear the responsibility of understanding data knowledge and the responsibility of using data correctly. Incorporate data quality goals into performance evaluation.

5. Build a culture that focuses on data quality

  • Data is the key to corporate success because it provides the basis for corporate decision-making. Successful implementation of data quality processes requires a governance structure (data management, data quality accountability, advocacy for improvement projects). Effective use of data requires a supporting structure (data knowledge management, metadata management, employee training, master data management), and a process for managing and solving problems (upgrading and setting priorities, between data producers and consumers Establish an effective communication mechanism), support structures and processes must be mature to ensure that the enterprise can obtain value from data assets. It can be seen that the production of high-quality data requires the attention of management and the commitment of the entire enterprise.

  • Data quality should be effectively improved throughout the data life cycle. Source system producers know the downstream uses of the data they produce, data storage teams have methods or systems to measure and monitor the quality of the data they are responsible for, and data consumers have the data knowledge, metadata, and data they need to use the data effectively. Other supporting structures can provide input and feedback related to data and usage, so that those who are responsible for storage and data access personnel can improve the quality of data usage for the continuously evolving business. Data quality goes beyond the data itself and also depends on management’s commitment to a quality culture.


to sum up

  • This article starts with the explanation of the basic concepts of data, data governance and data quality management, and describes the origin of data quality, the negative impact and its importance step by step. It gives the author's previous implementation of the two flow and batch data quality inspection systems. Link to the article, and finally give the data quality control methodology. From the design and implementation of the data quality system, coupled with the data quality control methodology, this article can be regarded as a relatively complete explanation of data quality issues.

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/112441844