Systematic thinking about data quality

|0x00 Quality Standard System

When talking about the quality of a thing, we usually think of ISO standards, such as ISO9000. If a product is marked with the ISO label, it is the most convincing evidence for the quality of your product.

So there is such a standard code in the data field? There are, such as ISO8000, ISO9126, or GB/T36344-2018, but these standards are too "heavy" on the one hand, and it is also difficult to understand and find information. Third, it is not easy to implement them in accordance with these standards. Therefore, extracting the essence of it and summarizing it into several major principles, and then supplementing the details according to the actual situation of the company, is more practical for practitioners in the data field.

Taking the ISO9126 software quality model as an example, it contains 6 major features and 27 sub-features, most of which are transplanted to the data field and are usually suitable.

The ISO9126 quality system is shown in the figure below:

Insert picture description here

|0x01 Thinking of 7 first-level features from a data perspective

We take out the first-level characteristics of ISO9126 and go through them one by one according to the understanding of the data field.

【Feature】

Functionality provides the functions required by software/data products, including suitability, accuracy, interoperability, and confidentiality. If it is explained in plain language, the data must be available, accurate, complete, and safe. The data can be output according to the delivery standard, the data is accurate, and can meet the demands of users, and the security must be guaranteed.

These features have been mentioned in many articles, but they have not been promoted to the height of the data quality system. For example, for data consistency, this concept can be used in many places, such as CAP theory, such as database foreign keys, such as reverse interface development, and so on. However, data consistency is still affected by data accuracy, so data consistency should be an interpretation of accuracy, not a sub-item.

【reliability】

Refers to the ability of a product to complete a specified function within a specified time under specified conditions. When mapped to the data system, it is the data user. When in use, the data can be output on time. The corresponding requirement is the reliability of the data link, such as whether the upstream data is produced on time, whether the job scheduling is correct, and so on.

[Ease of use]

Under specified conditions of use, the product's ability to be understood, learned, used and attracted to users. This one is neglected by most data practitioners, that is, whether the reports made by themselves can be read and understood by users, rather than just making reports and outputting data. For example, our definition of data indicators, whether users accept it, or whether the numbers we provide can really reflect the changes in business, not just for statistics. For users, whether the results of the data can be understood and whether they meet their own demands are usually related to product requirements and project planning.

【effectiveness】

The ability of a software product to provide appropriate performance relative to the amount of resources used under specified conditions. For example, the real-time data warehouse of the recent fires, such as the usual data tilt, is an interpretation of efficiency. These capabilities ultimately affect the value capabilities provided by the software, because the more real-time data provided, the more complex the calculation problem, and theoretically can bring more value increments.

【Maintenance】

The ability to use specified tools or methods to repair specified functions under specified conditions and within specified time. For fast-growing businesses such as the Internet, if chimney-style development methods are adopted just to meet demand, then maintenance must be a disaster. Therefore, providing data reuse capability is the focus of maintenance. Generally speaking, there are two aspects to improve the maintainability of data. One is the software itself, which provides the pre-calculation capability of Cube, and the other is the development process, which improves the quality of the data model and reduces the cost of understanding and maintenance.

【portability】

The ability to migrate from one environment to another. This ability tests the ability of data architecture or tools. For example, when the requirements for data change from offline to real-time, whether the SQL code written in the past can be successfully migrated. Or when the development architecture is migrated from A to B (Hadoop -> Spark / Storm -> Flink), how much is the cost.

|0x02 Data quality definition split

Can these features be used directly? Conceptually, it is possible, but it does not extend to our daily work, that is, staying in the conceptual stage. So how to combine our daily work with these standards is the next question to think about.

From the result point of view, if the "user feel" is not good, that is, there is a problem with the functionality, it can be attributed to the problem of data quality, because the final delivery is quality results. To solve this problem, we need to think about the entire link of data development, that is, the reliability of the development process. Although there may be many accidents during the development process, resulting in data inaccuracy/non-on-time output/fluctuations in results, their results are the same, that is, the "data quality" is not good.

Therefore, we should further divide data quality into "user-visible" data quality and "R&D-visible" data quality. Solving the "user visibility" is the "temporary cure", and solving the user-side problems and temporary dilemmas through quick recovery of the results. Solving the "R&D visible" is the "permanent cure", and the platform is reliable and modeling in the R&D process Fundamental issues such as clarity and data security are considered to solve long-term dilemmas.

|0xFF "Cure the root cause", we must change the concept

Larger companies will have a data testing team, but most of the data development students do not have a strong perception of testing, because they will not be stuck in various testing processes like the back-end team. Generally speaking, it is still relatively early The stage: stress efficiency, focus on output, and neglect standardization.

This actually involves the positioning of the data team, because we position ourselves more as business people, and solve business problems from the same perspective as the product/operation, instead of taking stability as the first priority like the engineering team. The first important task is to completely solve the problem of data quality, first of all, we must convert our own identity position, and use the standard of the research and development process as the starting point to coordinate the construction of the entire quality system.

So many people have to ask, shouldn’t our assessment criteria be the output of value? If you start from the perspective of the business team, yes, but if you start from the perspective of the engineering team, not. The reason is simple. For the engineering team, data quality and delivery efficiency are the first requirements. As for the value of the business, the proportion should not be too important.

If we focus on output, then we tend to use "temporary measures" to solve problems with data, such as downgrading at every turn, letting go if we don't understand, and solving problems if we have problems. Only by turning the focus to the management and control of the R&D process, that is, considering whether the performance of the platform meets the demand, how to solve the unreasonable design of the scheduling system, how to make the data model more reasonable, and how to effectively recover the data accident, can it truly "cure the root.

Of course, this is a decision that front-line engineers cannot decide by themselves, but no matter what, we should have the awareness that when the future of social development moves toward digitalization, data practitioners will also be divided, and some people need more Consider business issues, while another group of people need to think more about quality issues.

But at the moment, we need to collaborate with the testing students to solve the current problem first, that is, the standardization of the data testing process. We will talk about this in the next issue.

Guess you like

Origin blog.csdn.net/gaixiaoyang123/article/details/112257694
Recommended