[Edit, Test, and Learn] How to do a big data test

Big data is more and more vocabulary we hear, so how to do a big data test?
1. Start with the data quality model and analyze the law of data presentation.
What is the data quality model?
The traditional quality model is ISO9126. This is a typical software quality model. Whether it is development or testing, whether it is the quality of each terminal or the quality of the server, the general direction will not jump out of the 9126 model; of
course, foreign countries still ISO8000 is a hot standard for data quality, but its shortcomings are also obvious. 1. Not available for free. 2. The standard is too heavy. 3. The domestic information on this is pitiful.
Then we can try to apply the 9126 standard to transplant to big data quality?
First introduce 9126:
[Edit, Test, and Learn] How to do a big data test
Compare the above quality model, let's deduce the data quality model.
[Edit, Test, and Learn] How to do a big data test
Through the above analysis, it is not difficult to see the relevant case of the test data.
[Edit, Test, and Learn] How to do a big data test
Timeliness: Relatively speaking, the test method is relatively simple. The output data within the specified time can be judged by checking the number of entries in the whole table or checking the number of partitions.
Completeness: The two key points of completeness assessment are: not much data, but not too much data.
Not much: Generally, check whether the entire table data, important enumeration value data is duplicated or whether the data primary key is unique.
Quite a few: generally check the entire table data or important business-related fields (such as date, enumerated value, brand, category, etc.). If the data size can be known, for example, if you know that there are x brands in the table, you can check x every time. If the size of the data itself changes greatly and cannot be known (for example, the number of brands in the table will be opened or closed), the common method is to compare historical data to see if the overall fluctuations are normal.
Accuracy: In comparison, accuracy is a feature that is not easy to test. Why? Because it is difficult to find a rational reference to show how accurate the data is, most of it exists in cognition. It is precisely for this reason that accuracy testing is also an example of the relative divergence of thinking in the process of data quality model systemization. So I tried to summarize the accuracy test methods from the dimensions of myself, time and space in divergent thinking. Although some directional ideas have been summed up, each idea still depends on personal divergent thinking and business understanding in order to make good use of it.
a. Self-inspection: It is the most basic type of inspection that uses one's own data to check the accuracy without comparing it with other data. The self-checking test method can only increase the probability that the data is correct, but it is limited by the amount of information and the degree of improvement is limited. There are three typical methods.

The first method is whether the value is within the conventional range. For example, for example, the proportion of men and women will theoretically be between [0,1], which is the most basic check
of the value ; the second method is the value Whether it is within the business scope is generally a judgment after understanding the business attributes of the value. For example, if I test the number of people searching for a certain product, if the trigger channel is unique, theoretically the number of people searching for this product >= the number of people buying this product This is the judgment after understanding the business behind the value; the
third method is the distribution of the value. If you have a clear understanding of the business characteristics of a value, you can also test its accuracy by observing the value distribution. For example, if you test the value of "proportion of members in the number of purchasers", you can observe its distribution in the data and recognize that the ratio should be around 0.3. If after testing the distribution, it turns out that 80% is roughly between 0.2-0.4, which is consistent with cognition. If it turns out that 80% is between 0.8-0.9, it does not meet the cognition.
b. Comparison in time dimension: If you compare data in time dimension, you can continue to improve the probability of data accuracy. There are two methods: one method is to observe the same data in the same data batch at different times, and the other method is to observe the same data in different data batches.
Same batch: For example, if a batch of data is tested offline, this is the same data batch. Under this batch, you can compare the fluctuations of data on different dates such as date=20200324, date=20200323, and date=20200322.

Different batches: This is relatively difficult to find, because for data, it is rare to retain several data versions, but there is a more typical case, which is online and offline data diff. If the offline version is considered N, then the online version can be considered N-1. Through the online and offline data diff, the data that is determined to be unchanged can be diff checked.
c. Comparison in spatial dimension: Comparison in spatial dimension refers to fixing the time dimension and comparing the current value with other data to further assist the correctness. There are also three basic ideas:
one is the comparison between upstream and downstream, especially whether important fields have information lost in the upstream and downstream processing;

One is to compare with other data except upstream and downstream, such as comparing with brother tables under the same data source, or comparing with tables under different data sources. An example of the same data source, for example, Table A is the sales data of a certain primary category, and Table B is the sales data of the secondary category under the primary category, then the value of Table B is added = the data of Table A. Examples of different data sources, such as for calculation performance considerations, part of the data will be synchronized from the row database to the column database for calculation, the value in the row database should be equal to the value in the column database; the
last one It is compared with data outside the system, such as BI systems and other business back-end systems. This method is more restrictive, because from a security perspective, it is unlikely that conventional BI systems or other business back-end systems will open data, so this method is only a possible idea.
3) Timing
of test execution Regarding the timing of test execution, it is the same as traditional tests. There are several test timings: during self-test, after test, online data modification, and online data addition.
Regardless of whether it is a self-test or a test, the focus is on offline, and offline data testing has certain limitations. If the input data structure method is not adopted, the development generally only mentions a part of the data, such as the data of a certain day, but it is precisely because of the one-sidedness of the measurement data that even if the measurement data is okay, it does not mean that the data processing rules are completely absent. Problem: If the input data structure method is adopted, although more output data anomalies caused by input data anomalies can be found offline, because of the root cause problems such as the stability of the online production environment itself, it still does not mean that there will be no problems in the subsequent online .
It is precisely because of the limitations of offline data that continuous testing is still required when online data is modified or online data is added. The online test case may use the offline case completely, or it may be used after simple modification of the offline case.
One advantage of discussing test timing separately is that if a series of test cases are combined into tasks, whether offline or online, as long as there is a suitable trigger condition, these test cases can be run with a suitable trigger method. Triggering methods include conditional triggering and timing triggering. Generally speaking, conditional triggering is used offline, that is, when the development is completed and requires self-test or the test needs to be tested after the test, the execution is triggered through the API or interface.
For online data modification, this operation is not a normal operation. It is an operation that only occurs when there is a problem or a bug is fixed. Therefore, it is usually triggered by a condition. For online data addition, new data is usually produced regularly every day. This can either use conditional triggering (that is, the test is triggered after generating new data), or timing triggering (that is, periodically polling whether new data is generated and tested). The benefit of conditional triggering is similar to the concept of continuous testing in continuous integration. As long as data changes are involved, the test must be triggered, but it does not pay attention to timeliness. The advantage of timing triggering is that it can pay attention to timeliness in time, but for Data that does not require high timeliness (such as sometimes 6 o’clock output, sometimes 9 o’clock output), then timing triggers will cause many false alarms.
On different test occasions, although most of the test cases used are reusable, there are some differences.
In the self-test, it is mainly the development team to conduct the test. The test case pays more attention to the testing of the basic quality of the data, such as self-checking in completeness and accuracy. This part of the case does not require much divergent thinking.
After the test, the test team mainly conducts the test. In addition to the basic quality test of the data, the test case pays more attention to the "snapshot", that is, the comparison of different batches of the spatial dimension and the time dimension in accuracy, and as far as possible, the problem of data accuracy can be found through auxiliary evidence. In terms of the time dimension of the same batch, the development often does not mention data at many time points, so in general, the difficulty of supporting evidence will be more difficult.
When modifying online data, you can basically reuse the case used after the test.
When adding online data, in addition to the basic quality test of the data, most of the cases used after the test can be reused, but some cases with exploratory ideas or cases with too long running time will be weakened. For example, the distribution case of the test value is not suitable for testing when it is added every day, because the daily data distribution may be different and it is not a steady state. Adding this case will increase the false alarm rate. Another example is the case where the amount of test data is too large, especially for upstream and downstream comparison tests. Often the amount of downstream data will be very large. On the one hand, regular testing every day will consume too much time and resources, but on the other hand, it is not necessary because this problem occurs. The reason is often the problem of data processing logic, which can be found in the next test. The online test will add a comparison of the volatility at different times in the same data batch in the time dimension.
Therefore, the impact of test timing on testing can be summarized as a table.

[Edit, Test, and Learn] How to do a big data test
4)CR

Although the method of discovering problems through output data is introduced in the test method section, the most direct and effective method for discovering problems is CR, especially for the accuracy of SQL-like databases. The following are some CR rules frequently used in SQL CR.
The
order and type of the projection operation fields are consistent with the table declaration
. Is the business meaning of the fields in the table the meaning of the business requirements? Does the business require
data deduplication? Is
there any abnormality, such as the divisor 0, Null, and empty,
whether the data accuracy is Clear requirements.

Whether the association table uses outer join or join, or inner join, depends on whether the data is filtered

Whether the left and right value types are the same in the on clause of the association relationship

★ The filter condition
involves the equal sign of the string. Pay attention to the capitalization and correctness.
Have you considered the filtering of outliers such as Null, 0, and null?
Is there a limited source of
data? The disaster tolerance evaluation method in the data test is
mainly How to stop loss quickly when the data is not output or there are large-scale problems in the data. A typical approach is to quickly switch the available data, such as quickly switching to the data of the previous day or the data of the previous version. This situation generally requires the cooperation of the server to complete.
1) Efficiency evaluation method: Efficiency evaluation is mainly whether the computing resources of the current data meet the time requirements of the current product. It needs to be divided into three situations: one is whether the calculation request directly triggered by the user is too large; the second is whether the user data is too much, which causes the amount of calculation to be too large; the third is whether the efficiency of the program itself is low, and the performance is too low, resulting in resource consumption is too big. In the first case, pressure measurement and evaluation are often performed by constructing request flow. The latter two will generally find out which table has the longest running time and the most influential efficiency through the large-scale method, and then gradually optimize, including SQL query optimization.
2) Reliability & maintainability evaluation method: Reliability & maintainability evaluation involves more development participation and relatively less participation in testing. A few more typical ideas are:
reliability checks on the strength of the task.
In terms of maintainability, the systematic development work should be integrated or platformized as much as possible. For example, changing the data access mode from the chimney-shaped mode to the star-shaped bazaar mode, so that it is only responsible for accessing data ETL, and minimizing development work is an idea of ​​integration. The idea of ​​platformization is to complete process-based development work through platform configuration. On the one hand, it improves development efficiency, on the other hand, it reduces error costs and improves development quality.

Guess you like

Origin blog.51cto.com/14972695/2555033