Data quality monitoring notes

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/fxbin123/article/details/89819451

Foreword

影响数据质量的因素是什么,数据质量问题类型有哪些,如何设计数据质量监控流程

aims

解决常见数据质量监控需求

First, the concept of quality-related data

1. What is the quality of the data:

(1) data quality name suggests is the quality of the data
(2) Data quality is based on the analysis of data and the accuracy of the validity of the conclusions, the most important prerequisite and guarantee
(3) Data quality is the basis for data analysis applications, in order to obtain reliable data, companies must pay close attention to data quality, data quality will be the key to the success of a direct impact on system applications

2. What is Data Quality Management

(1) data quality management, refers to data from the planning, acquiring, storing, sharing, maintenance, application, the demise of all kinds of data quality problems in each stage of the life cycle that may arise, identification, measurement, monitoring, early warning a series of management activities
(2) data quality management includes not only improve the quality of data, but also includes the improvement of the organization. For the improvement and management of data, including data analysis, data assessment, data cleansing, data content monitoring, early warning and other errors. For improved organization and management, including the establishment of organizational data quality improvement goals, assess organizational processes, specify the organizational process improvement plan, specify organizational oversight audit mechanism, implement improvements, evaluate multiple aspects of improving effects.

3. Why Data Quality Management

There is a direct link between (1) the enterprise data quality and business performance, quality data can be a company to remain competitive
(2) the era of big data, if there is no good data quality and big data would be misleading to decision-making, even the results generated immeasurable
(3) At present, data transfer and processing chain analysis system more and more, more and more complex data management, data quality error of the more significant part of the increase
(4) in order to ensure better data provide correct, and strong support for the company's business strategy, must be necessary to ensure the quality of the data is accurate, then the data must be strict quality control to ensure the reliability of data quality

Second, the data quality factors:

1, the demand for process initiated

(1) problems caused mainly refers to the process of demand data quality problems as demand design, development, testing, on-line, etc. caused by
(2) the cause of such problems mainly due to the renewal process management mechanism and process is not perfect lead

2, the data source initiator

(1) Problems caused by the data source is due to incomplete or no upstream data source specifications, leading to a downstream system affected, the data quality problem occurs
(2) the data source is the data quality issues caused by the main source of the problem of data quality analysis system, mainly reflected in the following aspects:

  • Incorrect information
  • Incomplete information
  • Inconsistent information
3, statistical triggered

Question (1) initiation of major statistical value of the index caliber KPI, statements and other existing data quality issues, including indicators of accuracy, consistency and integrity issues.
The reason (2) of these problems mainly in the following situations:

  • Original or different systems in different business sectors, different indicators to define and caliber of the same name, resulting in the final discrepancy index statistics
  • In describing the business or custom metric caliber, there are many unclear and imperfections, resulting in ambiguity index caliber
4, the system itself

(1) its own problems system mainly refers to data quality problems in the system caused by the development and construction, operation and routine maintenance, such as the quality of the data model, the system upgrade data loss problem, ETL data cleaning problem is not thorough enough and brought data warehouse processes such as scheduling issues, the reasons are:

  • The process of building a data warehouse, due to the lack of standardized, systematic thinking of building, resulting in system architecture, data models, process flow and other reasonable and less than optimal, leading to data quality issues

  • In the operating system, due to a general lack of sound daily management and maintenance processes, when dealing with the monitoring data, no basis and standardized processing means, resulting in operational errors or omissions cause of data quality problems

Third, the data quality Types

1. Error value:

Since the actual data field type stored difference data, the information input error, or errors caused by

2. Repeat Value:

Present in the data record exactly the same repetitive recording, or there can be appreciated the key information from the traffic occurs repeatedly recorded

3. Data inconsistency:

Recording data for compliance, whether the front and rear and other data sets to maintain unity.
Consistency of the data and specifications data include the logical consistency of the data record

4. Data integrity:

Data recording and information is complete, whether there is a missing case

The missing values:

Normal data information record, there is information missing
missing data and lack of major record summary records missing a field of information, both of which can cause inaccurate statistics, integrity is the most basic guarantee data quality

6. outliers

Significant deviation data recording or data recording error data

Guess you like

Origin blog.csdn.net/fxbin123/article/details/89819451
Recommended