"Low Cost Data Quality Management" WhaleDI Data Governance Tool

In the digital age, data has become a key element of enterprise management. With the increasing and gathering of data, enterprise data quality issues have become the key to data governance. Data quality directly affects whether data analysis can drive enterprise production, operation, service efficiency improvement and innovation. High-quality data plays an extremely important role in management decision-making and business support. Only continuous data quality improvement can promote the improvement of the data governance system, maintain data quality levels, and provide a solid guarantee for enterprise data strategies.

Let me talk about what is data first?

The word data is alive and well, and everyone is familiar with it. Its relatively clear definition is: data refers to symbols that record and identify objective events, and are physical symbols that record the nature, state, and relationship of objective things or a combination of these physical symbols. In the carrier industry, the relevant IT data involved usually include asset data, order data, archived business information data, customer data, etc. in the support system. These data are generated throughout the entire process of business support, and the included Very informative.

The information recording function of data determines that it will play a role in different periods in terms of query, statistics, analysis, etc. It not only provides information to the owner, but usually covers the business production of the entire enterprise according to the agreement, and even the outside.

Common problems in data application development

Whether it is informatization or digitalization, there are many data-based applications after all. Whether data applications can achieve the expected business results not only tests the product experts who plan and design application scenarios, but also depends on whether the data problems that have long puzzled enterprises and supporting manufacturers can be solved. has been solved.

Incomplete, that is, lack of sufficient and complete information for an object that needs to be described. Often the composition originates from different systems and different periods, and there are always various reasons that make it impossible to converge.

Incorrect, this has a lot to do with the way the data is generated. Incorrect process data is usually caused by business logic or business rules. Data data errors are usually caused by mistakes in the input process. Such mistakes are usually difficult to manage. If the customer's birth information is recorded in the last century, a century-old monster will be added directly.

I don't understand that the business system grows arbitrarily at the beginning of its creation, which brings about confusion in model management. Many years later, data developers and business personnel often watch helplessly as some paradoxical data increase.

Factors Affecting Data Quality

Time to learn again, the definition of data quality is: the degree to which a set of inherent attributes of data meet the requirements of data consumers. Accordingly, business people meet the definition of consumers in the concept, probably because they are representatives of business needs. Data quality directly affects the support of data business requirements, but what are the factors that affect data quality?

1 Management

Lack of Effective Management Strategies

The status quo of informatization in many enterprises is to build first and then manage. In the early days, there was a lack of overall data planning, unified data standards and clear data quality goals, which led to data conflicts or contradictions prone to occur when different business departments were processing business.

The lack of an effective data accountability mechanism and the lack of clear data management departments and job responsibilities lead to the inability to clarify responsibilities when data quality problems arise, and various business departments shirk each other.

Lack of uniform data standards

A major challenge in data quality management lies in the construction of business systems based on consensus data standards in various departments. If there is a lack of unified data standards, it will be difficult to understand the data consistently, and the collaboration and communication between businesses will be like "chicken and duck talking".

2 Business aspects

Data entry is not standardized

The business department is not only the proposer of data requirements, but also the producer of data. The human factor in the business unit is a very important reason for the low quality of enterprise data. Common human factors include spelling mistakes, data input mismatch fields such as capitalization, special characters, etc., which will lead to irregular data.

3 Technical aspects

Data design is not standardized

Insufficient attention to the quality of the data model in the design stage, insufficient understanding of requirements, unreasonable design and development of database table structure, database constraints, and data verification rules, will result in data entry that cannot be verified or improperly verified, resulting in data corruption. Duplicate, incomplete, inaccurate.

Data transmission is not standardized

Data transmission includes data acquisition, data conversion, data loading, data storage and other links. Many sources of data collection are inconsistent, data collection interfaces are inefficient, data conversion rule configuration errors, data loading and storage mechanisms are unreasonable, etc., resulting in problems such as slow data output, inaccurate data, and incomplete data.

"Low-cost data quality management" as a tool for data governance

As the saying goes, "If a worker wants to be good at his work, he must first sharpen his tools." A good set of data governance tools can make an enterprise's data governance work twice as effective. Therefore, how to implement low-cost, high-efficiency, full-link, and closed-loop control and audit , Monitoring data quality has become an important proposition.

WhaleDI "data quality management tool" is one of the important tools for the implementation of the enterprise data governance system. Through the quality standard management of data warehouse/source data, it is based on the whole process coverage, easy to use, complete rules, intelligence, closed-loop management, etc. The goal is to establish an operation mechanism of pre-event standard definition, mid-event full-link monitoring, and post-event analysis to form a PDCA quality management cycle and promote continuous improvement of data quality.

1 Data standard management, quality governance with evidence and laws to follow

Through the unified definition of data standards, the data management department and responsible subjects are clarified, which provides a basic guarantee for the data quality governance of enterprises. By defining unified standards, data mapping relationships, and data quality rules for data entities, data relationships, and data processing, data quality governance has laws to follow and evidence to follow, providing guarantee for the improvement and optimization of enterprise data quality.

The core capabilities of data standard management mainly include business term management, field library management, term field standard mapping, etc.

(1) Business terms can define enterprise-level public business glossaries, establish a consensus on public business terms among various departments, and manage various business terms from a business perspective, including customer names, customer codes, ID numbers, mobile phone numbers, Mailboxes, etc., while cataloging, standardizing, and process-based management of these business terms.

(2) The field library refers to the management of the logical fields of the data model, and standardizes and unifies the fields of the data model in terms of field names, field codes, field classifications, field types, associated business terms, associated data dictionaries, and associated synonyms Definition, subsequent model development can be directly referenced to ensure the standardization of model development. Field library field standard configuration, including field naming rules, field coding rules, field value range, whether it is a primary key, whether it is unique, whether it is non-empty, etc., can be used to map physical field data standards and realize standard inspection of instance data.

(3) By associating business terms with field library fields, a 1:1 correspondence between business terms and field library fields (that is, logical fields) is realized, and the mapping between business terms and logical fields is realized. Through script parsing (such as a.cust_id=b.customer_id in the script, data in field a copying data in field b, etc.), scheduling task field mapping, synonyms, etc., establish the 1:N association relationship between logical fields and physical fields. Through the 1:N relationship between the logical fields of the business term and the physical fields of the corresponding physical database table, the data standards of the business term field can be automatically mapped to the corresponding physical database table fields, such as field naming rules, field coding rules, and field value ranges , Whether it is a primary key, whether it is unique, whether it is not empty, etc., realizes low-cost configuration and efficient application of data standards, and provides an effective basis for subsequent data quality governance.

2 Quality rule management, precipitation rule base, low cost configuration

Rich quality rule base, covering all scenarios

The data quality management tool supports the provision of a variety of visual rule template configuration capabilities, covering various audit scenarios in data quality management, including 20+ audit rules such as data timeliness, integrity, consistency, accuracy, and logic; The audit of complex scenarios supports custom rule template configuration capabilities, which can be customized and flexibly configured through SQL, Shell, and stored procedures.

  • Timeliness rules: mainly used to check whether the data at the interface layer arrives on time, including table data, table field data, file data, etc.

  • Integrity rules: mainly used to check whether the records from the database tables or files collected from the business system to the interface layer database tables or files are complete, and the system will judge whether the number of table records or file names/sizes on both sides are complete according to the access conditions unanimous.

  • Consistency rules: mainly used to check whether the database table data collected from the business system is synchronized to the data warehouse interface layer. Detailed consistency, index value consistency, etc.

  • Accuracy rules: mainly used to check whether the data field format of the database table conforms to the definition of the data standard specification, including the uniqueness of the primary key, non-repeatability, non-nullness, foreign key accuracy, value range, coding rules, etc.

  • Logical rules: Support the comparison between the data of the current account period of the target table and the data of the historical account period, whether it meets the requirements of a certain volatility, threshold, and balance formula, and judge the fluctuation of the data.

  • Custom rules: For the audit of complex scenarios, it can be customized and flexibly configured through SQL, Shell, stored procedures, etc.

Diversified configuration methods, low-cost configuration capabilities

According to different configuration scenarios, it supports multi-view quality rule configuration, including rule-based view, table view, task view, etc., which can be selected on demand and applied flexibly. In addition to multi-view configuration capabilities, the product continues to improve configuration-free, low-configuration, and batch configuration capabilities, which can reduce configuration costs and improve configuration efficiency.

  • Rule-free configuration: including the consistency comparison of the total number of key source tables and target tables, etc., through the switch control script to parse the log to obtain the number of database table records.

  • Low configuration of rules: including the non-nullness check of the primary key, key dimension, and key measure of the database table, etc., and configure the corresponding data quality rules according to the data standard recommendation.

  • Batch configuration of rules: including batch configuration of data warehouse directory (library tables under the directory can be automatically inherited and configured), batch configuration of table perspective, batch configuration of EXCEL import, etc.

3 Quality audit engine, automatically generate audit results

Data quality audit is to establish a data quality management organization, formulate quality management specifications, determine the corresponding workflow methods, and realize quality inspection, correction, and assessment functions in the system to form a closed-loop mechanism for data quality correction; The process of performing legality and other aspects of inspection by itself, through the configuration of quality rules and the implementation of the audit engine to realize the inspection of data attributes, data attribute relationships, and data table relationships.

The quality audit engine automatically parses and generates executable audit tasks based on configured quality rules and strategies, and automatically outputs audit results.

  • Audit object data source types: including MySQL, Oracle, GP, GBase, Hive, ES, HBase, FTP, etc., basically covering the data source types of business systems.

  • Quality audit task frequency configuration: the calendar supports the Gregorian calendar and the lunar calendar, and the audit frequency can be selected from month, day, hour, minute, non-period, etc., which can be flexibly selected according to actual needs.

  • Efficient execution of quality audit tasks: Task sharding can be set based on the audit object field to implement task partition sharding, multi-task multi-threading, and distributed execution methods to improve quality audit efficiency.

  • Automatic output of quality audit results: policies configured based on quality rules, such as threshold ranges such as audit pass, audit warning, and audit failure, and the audit engine automatically output audit results and detailed audit difference data, which can be viewed and tracked by business personnel.

4 Quality audit report, precipitation template, quick reuse

Provides visual analysis capabilities for data quality audit results, supports drag-and-drop and component-based custom quality reports, including overview of data quality audit results, overall data quality score, hierarchical and domain-based quality rule type scores, and hierarchical Domain-based quality rule type quality trend graph and other multi-dimensional analysis, so that the data quality situation can be seen.

For the generated data quality report, it supports report sharing for organizations, users, roles, etc. The report can be converted into pictures, PDF, xlsx, html, txt, etc., and the quality report can be pushed by SMS, email, FTP, etc. Set the push frequency to let the data quality situation be paid attention to.

The defined data quality reports can be precipitated into report templates, which can be quickly reused and reduce the cost of report configuration.

5 Quality problem management, process-based and closed-loop problem handling

For data quality rules, template dispatching configuration is supported, including configuration of work order recipients, work order processing procedures, work order processing time limits, etc. For problems that fail the data quality audit, the system automatically sends early warning work orders according to the configuration to notify the source or The person in charge of the problem conducts quality rectification. At the same time, the configured dispatch process can be precipitated into a template, which can be directly referenced in subsequent configurations, reducing the configuration workload.

When the work order processor completes the quality rectification and returns the order, the system will automatically start the quality audit task to conduct a second audit on the data that has been rectified, and the entire work order process can only be archived after the second audit is passed. Build a closed-loop system of data quality governance through the quality work order process to empower platform operations to reduce costs and improve efficiency.

At present, data quality management tools have been implemented in many projects in domestic and foreign telecom operators, government and enterprise industries, etc. Among them, Chongqing Telecom's big data platform conducts timeliness, completeness, consistency, accuracy, and logic audits on metadata and instance data every day, and has accumulated 570+ quality audit rules. Through automation, process, and closed-loop data quality management, less configuration investment, faster data problem discovery, lower enterprise data quality management costs, help enterprises improve data quality in an all-round way, and lay a core foundation for enterprise data governance.

6 Quality monitoring and management, creating a visual comprehensive quality monitoring

Enterprise data processing is often cross-system. From data collection to application, it involves multiple systems, multiple links, and multiple processes. The data link is long and the data processing is complicated, often without forming a full-link blood relationship. In addition, there are tens of thousands of data collection and scheduling tasks, which are scattered overall, and it is impossible to check the impact of upstream and downstream quality. Therefore, it is particularly important to focus on visual comprehensive quality monitoring of business applications.

Hierarchical business application

The ultimate goal of enterprise data quality governance is to enhance the value of data and better serve the business. Therefore, it is more in line with actual business demands to view the overall link data quality with the ultimate focus on business applications.

Full Link Data Lineage

Data blood relationship refers to the formation of various relationships between data and data throughout the entire data link during the entire life cycle of data. Data blood relationship mainly includes table-level blood relationship and field-level blood relationship. Through data blood relationship analysis, various information generated and recorded in the process of data circulation are automatically collected, processed and analyzed, and the blood relationship between data is systematically sorted out, correlated, and analyzed. The combed information will be stored, and finally visualized in a full-link manner, which helps to efficiently locate quality problems and quickly assess the impact area.

Quality Impact Visual Analysis

In the process of data production and processing, data changes may have a certain impact on subsequent data links. Therefore, visually monitor data changes such as table structure changes, script changes, task changes, etc., and then analyze upstream or downstream data based on data blood relationship analysis. The impact of relevant data links helps to predict problems in advance, avoid or reduce the impact on business applications.

Visual monitoring of application quality

By monitoring and analyzing the entire data link, the timeliness of data output can be predicted in advance, and quality information such as data fluctuation and data distribution can be monitored, which helps to quickly find and locate problems, so as to intervene in time and reduce The occurrence of quality problems, reducing the impact of problems on business and operation and maintenance costs.

Guess you like

Origin blog.csdn.net/whalecloud/article/details/128375690