Interview question: How to verify the accuracy of the indicator results?

question

A more interesting and open question was raised in the group yesterday: In daily work, how do you verify that the results are accurate after you have developed the indicators?

Here I share the thoughts of the big guys, and at the same time make a summary. The author's ability and level are limited. If there are any mistakes, please give me some pointers. If students have better ideas, please join the discussion together.

Big guy answer

The answers given by the above big guys are believed to be the daily operations of many students, and it can be said that there is nothing wrong with them.

Summarize

Here we need to distinguish this problem from the problem of ensuring data consistency. This article discusses the problem of data accuracy (DQC category).

The author combines the discussions of the previous bigwigs and inquired about some information, and makes some conclusions for your reference. It is mainly divided into the following parts:
1. Several directions to verify the accuracy of the indicator results
2. Several measures to ensure the accuracy of the indicator results

Authentication method

Caliber alignment, unified data source

Students who have done data development should have experienced rework due to inconsistent understanding of caliber or different data sources used. The occurrence of this situation will deviate from the expected results in the mind of the demand side. Therefore, the author here regards calibration alignment and data source unification as the prerequisite for accurate indicator results, otherwise, no matter how many subsequent verification methods are passed, the delivery cannot be completed.

direct check

This kind of verification method is as mentioned by the previous boss. For example, for simple indicators such as the number of new users on the day of statistics, it can be directly compared through details. This method is also the simplest and crudest.

Reference comparison

For situations that cannot be directly compared in detail, you can compare it with historical data or similar data, and verify the rationality by observing whether there are large fluctuations. Such a large increase or decrease is either due to business abnormalities or some business operations, or data statistics errors. Of course, the method of reference and comparison can only approach the tolerance level of the demand side, and cannot fully guarantee the accuracy.

Collision verification

Let me first introduce the definition of cross-check relationship: it refers to the relationship between the relevant figures in the account books and accounting statements, which can be checked and checked with each other. To give a simple example: the company sends a salary of 1,000 to A, then A will have an income of 1,000, and the two can be verified and verified with each other, which is a mutual verification relationship. Generally, this method can be used when there is a logical or computational relationship between indicators. For example, B index value = A index + C index value.

walk through test

Walk through testing refers to tracking the processing of transactions in the financial reporting information system. Here, it refers to connecting the entire business logic with an actual data to check whether each link conforms to the business logic. Usually this method is verified by testers, which is also the best way to verify the development logic.

reasonable judgment

The method of reasonable judgment requires the development students to have a deep understanding of the business, which is somewhat similar to the reference comparison introduced above. The reasonableness here needs to be evaluated and judged in combination with the current business situation of the enterprise. what a range.

Briefly introduce the above verification methods for you. Of course, the accuracy of the results is usually verified by checking the result null rate, whether there are invalid values, format type, discrete distribution and other indicators. The author found a list of accurate evaluation methods for statistical data here for your reference

Note: This picture is from "Wang Huajin Yongjin. Evaluation of Statistical Data Accuracy: Method Classification and Applicability Analysis [J]. Statistical Research, 2009, 26(1): 32-39."

If you have other good verification methods, welcome to discuss together.

Safety precautions

If we want to ensure the accuracy of the data, then we need to know which links the data may go wrong, and if we want to know which links the data may go wrong, then we need to know how the data flow is. Everyone knows that the entire life cycle of data includes: production, storage, cleaning, processing, and external services.

If you want to ensure the accuracy of the final result, then you must ensure the accuracy of each link. Here is a math problem: Suppose a job requires 100 processes to complete, if the pass rate of each process reaches 99%, after 100 processes, the total pass rate of the product is 36.6%. It is conceivable that to ensure the accuracy of the data, there are many factors involved, and the error rate is also very high, but we still need various measures to improve the accuracy.


The author here takes the data warehouse as an example. The data flow is abstracted into three parts: warehousing, in-warehouse processing, and warehousing. Backup delay and other issues. Auditing must be configured for core data to ensure the source of posting.

2. For the cleaning and processing process in the warehouse, it is necessary to confirm the cleaning rules and act according to the specifications. The unified type must be unified, and the place where the data should be filled must be filled. At the same time, attention must also be paid to the scheduling cycle, dependencies, auditing and other configurations.

3. The last mile is also to pay attention to the correct use of data. It is necessary to confirm its usage and applicable scenarios with the person in charge of the data, to avoid errors in reference tables, restrictions, and association errors. At the same time, it is necessary to ensure alignment. This involves OneData system construction has a lot of content, and I will share it when I have a chance in the future.

Only when the upstream data sources are unified, the midstream processing configuration is correct, and the downstream is used properly, can we finally get reliable and useful data value.

Of course, it was also mentioned earlier that it is necessary to ensure that each link can achieve an accuracy rate of more than 99%, so that the final pass rate of deliverables can reach 99%. Therefore, data errors are not terrible. As long as problems can be detected in time, smooth communication can be ensured, the scope of data influence can be reduced, and losses can be reduced. In this repeated cycle, data reliability will become higher and higher.

Guess you like

Origin blog.csdn.net/qq_28680977/article/details/125035206