Data quality construction of data warehouse (in-depth good article)


The real difficulty of data warehouse construction is not the data warehouse design, but the data governance after the subsequent business development and the huge business line , and the scope of data governance is very wide, including data management, data security, data quality, data cost, etc. Among so many governance contents, what do you think is the most important governance? Of course, it is data quality governance , because data quality is the basis for the validity and accuracy of data analysis conclusions, and it is also the premise of all this. Therefore, how to ensure data quality and ensure data availability is a link that cannot be ignored in data warehouse construction.

The scope of data quality is also very wide. It runs through the entire life cycle of a data warehouse, from data generation -> data access -> data storage -> data processing -> data output -> data display . Quality governance is required at each stage.

At all stages of system construction, data quality inspection and specification should be carried out according to standards, and governance should be carried out in time to avoid subsequent cleaning.

This article was first published on the public account [Five Minutes to Learn Big Data], and the complete data governance and data warehouse construction articles are available on the public account.

1. Why do data quality assessments

Many data people who are just getting started will immediately start various explorations and statistical analysis of the data after they get the data, in an attempt to immediately discover the hidden information and knowledge behind the data. However, after being busy for a while, I realized that I couldn't extract too much valuable information, and I wasted a lot of time and energy. For example, in the process of dealing with data, the following scenarios may occur:

Scenario 1 : As a data analyst, you need to count the purchases of users in the past 7 days. As a result, after collecting statistics from the data warehouse, it is found that many data are repeatedly recorded, and even some data statistical units are not unified.

Scenario 2 : Looking at the business report, it is found that the transaction gmv of a certain day has plummeted. After investigation, it is found that the data of the day is missing.

An important factor causing this situation is that the objective assessment of data quality is ignored, and reasonable measurement standards are not established, resulting in no problem with the data. Therefore, it is very necessary and very important to carry out scientific and objective data quality measurement standards.

2. Data Quality Metrics

How to evaluate the quality of data, the industry has different standards, I summarize the following six dimensions for evaluation, including completeness, standardization, consistency, accuracy, uniqueness, timeliness .

  1. data integrity

Integrity refers to whether the data information is missing. The missing data may be the entire data record is missing, or the record of a certain field information in the data may be missing.

  1. data norm

Normative refers to describing the degree to which data follows predetermined grammatical rules and whether it conforms to its definition, such as data type, format, and value range.

  1. data consistency

Consistency refers to whether the data follows a unified specification and whether the data collection maintains a unified format. The consistency of data quality is mainly reflected in the specification of data records and whether the data is logical. Consistency does not mean the absolute sameness in value, but the consistency of data collection and processing methods and standards. Common consistency indicators include: ID coincidence, consistent attributes, consistent values, consistent collection methods, and consistent conversion steps.

  1. data accuracy

Accuracy refers to whether there are anomalies or errors in the information recorded in the data. Unlike consistency, data with accuracy problems is not only inconsistent in rules, but more common data accuracy errors are like garbled characters, and secondly, abnormally large or small data is also ineligible data. Common accuracy indicators include: percentage of missing values, percentage of wrong values, percentage of outliers, sampling bias, and data noise.

  1. data uniqueness

Uniqueness refers to the fact that there is no duplication of data in the database. For example, 10,000 transactions were actually made, but 3,000 records in the data table were repeated, resulting in 13,000 transaction records. This kind of data does not meet the uniqueness of data.

  1. Data timeliness

Timeliness refers to the time interval from when data is generated to when it can be viewed, also known as the delay time of data. For example, a piece of data is counted offline today, and the result will be counted on the next day or even the third day. This kind of data does not meet the timeliness of the data.

There are a few other metrics, briefly listed here :

dimension Metrics
referential integrity Whether the data item is defined in the parent table
Dependency Consistency Whether the value of the data item satisfies the dependencies with other data items
correctness Whether the data content and definition are consistent
accuracy Whether the data precision reaches the number of digits required by business rules
technical validity Whether the data items are organized according to a defined format standard
business effectiveness Whether the data item conforms to the defined
credibility Obtained from customer surveys or customer offers
Availability The ratio of the time the data is available and the time the data needs to be accessed
accessibility Is the data easy to read automatically?

3. Data Quality Management Process

The process of this section is shown in the following figure:

1. Data Asset Level

1) Class Definition

The asset level of the data is divided according to the degree of impact on the business when the data quality does not meet the integrity, standardization, consistency, accuracy, uniqueness, and timeliness .

  1. Destructive : Once the data is wrong, it will cause huge loss of assets, face significant loss of income, etc. marked as L1

  2. Global : Data is used for group business, enterprise-level performance evaluation, and important decision-making tasks. marked as L2

  3. Locality : Data is used for daily operations, analysis reports, etc. of a business line. If there is a problem, it will have a certain impact on the business line or affect its work efficiency. marked as L3

  4. General : The data is used for daily data analysis, and the impact of problems is small. marked as L4

  5. Unknown property : Application scenarios where data cannot be traced back. Marked as Lx

Importance: L1>L2>L3>L4>Lx . If a piece of data occurs in multiple application scenarios, it is marked according to its most important level.

2) Classification

After defining the data asset level, we can start from the data process link to mark the data asset level, complete the data asset level confirmation, and define different levels of importance for different data.

1. Analyze the data link :

The data is generated from the business system, enters the data warehouse system through the synchronization tool, performs a series of operations such as cleaning, processing, integration, algorithm, and model in the data warehouse in a general sense, and then outputs it to the data product through the synchronization tool consumption in. From the business system to the data warehouse to the data product, it is reflected in the form of a table, and the flow process is shown in the following figure:

2. Tag the data asset class :

On all data links, the application services that consume each table are sorted out. By classifying data asset levels for these application services, and combining the upstream and downstream dependencies of the data, the entire link is labeled with a certain type of asset level.

Example :

Suppose the company has a unified order service center. The application business of the application layer is to count the order quantity and order amount of the company according to the business line, commodity type and region, named as order_num_amount.

Assuming that the application will affect the important business decisions of the entire enterprise, we can classify the application as L2, so that the data level of the tables on the entire data link can be marked as L2-order_num_amount, all the way to the source data business system , as shown in the following figure Show:

2. Check point check during data processing

1) Online system data verification

Online business is complex and ever-changing, and it is constantly changing. Every change will bring about changes in data. Data warehouses need to adapt to the ever-changing business development and ensure data accuracy in a timely manner.

Based on this, how to efficiently notify the changes of online services to the offline data warehouse is also a problem that needs to be considered. In order to ensure the consistency of online data and offline data, we can solve the above problems as much as possible through the parallel method of tool + personnel management: it is necessary to automatically capture every business change on the tool, and at the same time, developers are required to be aware of Automate business change notifications.

1. Business online release platform :

Monitor the major business changes on the business online publishing platform, and notify the data department of the changes in time by subscribing to the publishing process.

Because the business system is complex and changeable, if the daily release changes frequently, the data department will be notified every time, which will cause unnecessary waste of resources. At this time, we can use the data asset level labels that have been marked before to sort out what types of business changes will affect data processing or the adjustment of data statistical calibers for data assets involving high-level data applications. The data department must be notified in a timely manner.

If the company does not have its own business publishing platform, then it needs to make an agreement with the business department . For business changes of high-level data assets, it needs to be promptly fed back to the data department by email or other written instructions .

2. Operator management :

Tools are only a means to assist supervision, and the people who use the tools are the core. The upstream and downstream process of data asset level needs to be notified to online business system developers, so that they know which are important core data assets and which are temporarily used as internal analysis data, so as to improve online developers' awareness of data risks.

Through training, online developers can be informed of the demands of data quality management, the entire data processing process of data quality management, and the application methods and application scenarios of data products, so that they can understand the importance, value and risks of data . Make sure that online developers also consider data goals while completing business goals, and keep the business side and data segment consistent.

2) Offline system data verification

In the process of data from online business system to data warehouse and then to data products, data cleaning and processing need to be completed at the data warehouse level. It is with the processing of data that the construction of the data warehouse model and the data warehouse code is made. How to ensure the quality of the data addition process is an important part of the offline data warehouse to ensure the data quality.

In these links, we can use the following methods to ensure data quality:

  1. Code submission verification :

Develop related rule engines to assist code submission and verification. The rules are roughly classified as:

  • Code specification rules : such as table naming conventions, field naming conventions, life cycle settings, table comments, etc.;

  • Code quality rules : such as reminders when the denominator is 0, NULL values ​​participate in calculation reminders, etc.;

  • Code performance rules : such as large table reminder, repeated calculation monitoring, large and small table join operation reminder, etc.

  1. Code release check :

Strengthen the testing process, release the test environment to the production environment after the test environment test, and the release is successful only after the production environment test passes.

  1. Task change or rerun data :

Before performing the data update operation, the downstream data needs to be notified of the reason for the change, the logic of the change, and the time of the change. After there is no objection in the downstream, the change release operation will be carried out according to the agreed time.

3. Data processing risk monitoring

Risk point monitoring is mainly to monitor the risks that data are prone to appear in the daily operation process and set up an alarm mechanism, mainly including online data and offline data operation risk point monitoring.

1) Data quality monitoring

The data production process of the online business system needs to ensure data quality, and the data is mainly monitored according to business rules.

For example, some monitoring rules configured by the trading system, such as the order taking time, order completion time, order payment amount, order status flow, etc., are all configured with verification rules. The time when the order is taken will definitely not be greater than the time of the day, and it will not be less than the time when the business is online. Once there is an abnormal order creation time, it will immediately alarm, and the alarm will be sent to many people at the same time. Through this mechanism, problems can be detected and resolved in time.

As the level of business responsibility increases, there will be many rules and the operating cost of rule configuration will increase. At this time, targeted monitoring can be carried out according to our previous data asset level .

Offline data risk point monitoring mainly includes monitoring of data accuracy and data output timeliness. Monitor all data processing scheduling on the data scheduling platform.

Let's take Alibaba's DataWorks data scheduling tool as an example. DataWorks is a one-stop development workshop based on the MaxCompute computing engine, which helps enterprises quickly complete a full set of data research and development work such as data integration, development, governance, quality, and security.

DQC in DataWorks implements the data quality monitoring and alarm mechanism in offline data processing by configuring data quality verification rules.

The following figure is the workflow of DQC:

DQC data monitoring rules have strong rules and weak rules:

  • Strong rule: Once an alarm is triggered, the execution of the task will be blocked (the task will be set to a failed state so that downstream tasks will not be triggered to execute).

  • Weak rules: only alarm but do not block the execution of tasks.

DQC provides common rule templates, including the volatility of the number of table rows compared to N days ago, the size of the table space compared to the volatility of N days ago, the maximum/minimum/average field volatility compared to N days ago, and the field null/unique number Wait.

DQC checks are actually running SQL tasks, but this task is nested in the main task. Once there are too many checkpoints, it will naturally affect the overall performance. Therefore, it still depends on the data production level to determine the configuration of the rules. For example, the monitoring rate of L1 and L2 data must reach more than 90%, and three or more rule types are required, and unimportant data assets are not required.

2) Data timeliness monitoring

On the premise of ensuring the accuracy of the data, it is necessary to further enable the data to provide services in a timely manner, otherwise the value of the data will be greatly reduced or even worthless. Therefore, ensuring the timeliness of the data is also the most important part of ensuring the quality of the data .

  1. Task priority :

For the scheduling tasks of the DataWorks platform, the priority can be set through the intelligent monitoring tool. The scheduling of DataWorks is a tree structure. When the priority of the leaf node is configured, the priority will be passed to all upstream nodes, and the leaf node is usually the consuming node of the service business.

Therefore, in the priority setting, it is necessary to first determine the asset level of the business. The higher the level of the business, the higher the priority of the consumer node , and the priority is to schedule and occupy computing resources to ensure the punctual output of high-level business.

In short, the scheduling tasks of high-level data assets are prioritized according to the data asset level, and the data requirements of high-level services are prioritized.

  1. Task alarm :

The task alarm is similar to the priority. It is configured through the intelligent monitoring tool of DataWorks. Only the leaf node needs to be configured to transmit the alarm configuration to the upstream. In the process of task execution, errors or delays may occur. In order to ensure the output of the most important data (that is, data with high asset levels), it is necessary to deal with errors immediately and intervene in processing delays.

  1. DataWorks intelligent monitoring :

When DataWorks schedules offline tasks, it provides intelligent monitoring tools to monitor and alert scheduled tasks. According to monitoring rules and task operation, intelligent monitoring decides whether to alarm, when to alarm, how to alarm, and who to alarm. Intelligent monitoring will automatically select the most reasonable alarm time, alarm method and alarm object.

4. Finally

To truly solve the data quality problem, it is necessary to clarify the business requirements, control the data quality from the requirements, and establish a data quality management mechanism . The problem is defined from the perspective of business, and the tool automatically and timely finds the problem, identifies the person responsible for the problem, and informs it through emails, text messages, etc., to ensure that the problem is notified to the responsible person in a timely manner. Track the improvement of problem rectification and ensure the management of the whole process of data quality problems.

This article was first published on the official account [Learn Big Data in Five Minutes], and the complete data governance and data warehouse construction articles are available on the official account!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324134702&siteId=291194637