Talking about data testing

|0x00 "Big Data Test"

Some people may be confused: "Big data, big data, how to test it when it's so big?" In the past, big data development did not have much testing. The fundamental reason is that data serves the business and the amount of data is huge. , It is often impossible to build a complete test environment like the engineering team. Only when it is in production can data be tested.

Therefore, in addition to ensuring the on-time output of data, that is, standardizing task priority, the most commonly used data test is to configure monitoring tasks for data tables. There are usually several rules:

  • Data is not empty: there must be no data under the data partition that needs to be produced;
  • The field is not empty: a field is in a certain partition, all of which are null, which may be a problem;
  • Unique primary key: Normally, everyone uses dimensional modeling, so the unique primary key is very important, and the wrong data can be found in the first time;
  • Fluctuation monitoring: If the number of partitions, or the value of a certain statistical result, fluctuates significantly, then the data is likely to be problematic;
  • Basic logic monitoring: For example, the amount is not negative, the sum of child orders is equal to the parent order, etc., this type of logic is strongly related to the business;
  • Custom: Test some characteristic monitoring configured by students, such as the value should be within a certain range, etc.

Through these basic monitoring, we can preliminarily judge whether a certain data script, or a certain data table, is running correctly in a big data environment.

|0x01 Data accuracy

However, just ensuring the above measures is not enough, because these measures are considered from a global perspective, and the user sees the one that belongs to him, which is a very small subset. It is difficult to find these problems in big data testing.

For example, an important advertiser found that there was no data from yesterday and complained. We checked the data check process and found that they were all correct. Then we looked upstream and found that there was a machine whose data reporting was delayed. It is small, so the volatility monitoring did not find it, and the lost data is precisely the important advertiser, so the problem was suddenly magnified.

Because of some "exactly" serious consequences, we need to be more careful when doing tests. Therefore, it is necessary to design a more sophisticated test method, which cannot be avoided 100%, but can cover at least 95% of the scenes.

With the above cases and rethinking the accuracy of the data, we can find that the accuracy of the data actually has three aspects: the meaning of the data must be accurate, the value of the data must be accurate, and the result of the data must be accurate.

  • The meaning of the data must be accurate: before making requirements, at least you must have a clear understanding of the concept of indicators, and you must pay special attention to the requirements review. For example, we count ctr and divide it into ctr1, ctr2, ctr3, ctr4, ctr5 according to different scenarios... We looked at the requirements document, developed it, and finished it, but someone came over and asked what is the difference between these ctrs. , There is a ctr why the value is 0, you are trapped, and then go back to look through the code to see the logic, this is a problem. Therefore, to test the data, it is necessary to invite some users in different roles to see if these indicators can be answered correctly for the problematic or ambiguous parts, at least to ensure that the logic is self-consistent.
  • The value of the data must be accurate: the whole process from data collection to calculation is regarded as a stage, called the data value is accurate, and requires some test procedures to cover, here you can use the configuration monitoring task just now, or CR Class static code check, or check common data logic such as maximum, minimum, average, null, mutual exclusion, etc.
  • The result of the data must be accurate: This sounds more confusing, but in fact we have to sample the data to at least ensure that the data of important users is OK. Although the first two steps can cover most situations, there are always some surprises that can only be seen when abstraction is needed. This process can be regarded as the "gray" stage.

|0x02 Test automation

Today, when automated testing is becoming more and more perfect, if you still use manual methods to test data, the cost is very high. Because the number of test practitioners is limited, fewer tests are allocated to the data; on the other hand, business logic is usually very complicated now, if you want to do data testing, you need to understand business logic, otherwise the code is written 1, 2, 3, I think it looks clear, but other people seem to be blinded.

Therefore, it is necessary to improve some common processes, like "test cases", which can improve the efficiency of testing.

Of course, before we want to automate data testing, we must first have a reliable way to organize data.

A simple consideration of the data caliber is to objectively describe what this indicator does, for example:

  • Time: creation time, modification time or statistical time;
  • Location: log system, database or latitude and longitude, location on the product;
  • People: user, application or system state;
  • Event: Click, display, billing or some kind of operation process.

From a testing perspective, it does not necessarily require a perfect description, but you need to understand what this indicator is for statistics.

Secondly, we need to consider the accuracy of global data. A lot of business logic monitoring will be used here, and there are outputs in the first part of this article. Collect as much business logic as possible and make them into a systematic form. When other students need to use the same functional tests, they can reuse the test cases.

Again, we need to consider the accuracy of the sampled data. There is something like A/B testing, that is, grouping the data, randomly extracting statistical results, and then comparing the results one by one with the business system to check whether the results are consistent.

For example, for detailed data comparison, when the indicator is to count the number of users under a certain label, you can pull out the list of users involved in the business system and compare it with the statistical table. For each filtered user, check their attributes Is it consistent with the filter conditions. The results of these processes are automated and are usually not very difficult.

Finally, it is the product-level testing. This is not necessarily a test of data results, but also a test of page functions, such as clicking interaction, whether it will produce results, because a lot of empty data will cause the interaction to fail; another example is the description of a field, whether it is related to statistics The indicators can correspond, and so on. At the product level, the test can be like a novice, a complete medical examination product use process, from the final delivery level to ensure the accuracy and reliability of the data results.

When we do better in test cases and automation, and summarize more monitoring projects, we can consider the productization of the test platform. Although this platform targets a relatively single group, it is very helpful to the ultimate guarantee of data quality. of. From a testing point of view, no matter what type of test, the test process is universal, and the test methods can be used for reference. When we reserve enough test foundations and test methods, we can easily deal with a variety of different test.

|0xFF embodiment of value

Data testing is a task that requires patience and care. It is a bit like finance. It is most likely that Virgo's character is the most suitable.

At the same time, testing is a position that values ​​collaboration very much. Many business details are buried in the code. If you don't look at it or listen to the initial communication of requirements, it is difficult to understand why. Therefore, when the data test raises a lot of questions, the data students will most likely be mentally collapsed, because many businesses have to repeat it, which will cause a lot of inner resistance.

Therefore, the guarantee of data quality requires not only data testing, not only data specifications, or even architectural capabilities, but a kind of advocacy.

On a larger scale, software development, delivery, efficiency, and quality have some relatively good precipitation, so the problems are controllable, but the data field has not yet formed a relatively good precipitation. There was a paragraph before that: A certain commercial espionage sneaked into a company to steal data, but after stealing it several times, he found that the data was out of line and was forced to do data governance. . In this future, perhaps the data warehouse/data development positions belong to the service for the quality of the data, not the business value.

Guess you like

Origin blog.csdn.net/gaixiaoyang123/article/details/112257957