big data test

what is big data

Big data refers to massive data sets that cannot be processed by traditional computer technology within a certain time frame.

For the testing of big data, different tools, techniques and frameworks are needed to process it.

The large volume, variety, and high-speed processing of big data involve data generation, storage, retrieval, and analysis, so that big data engineers need to master extremely high technical skills.

You need to learn to master more big data technology, Hadoop, Mapreduce and other technologies.

Big Data Testing Strategy

Testing of big data applications is more about validating its data processing than validating its single functional features.

Of course, when testing big data, functional testing and performance testing are equally critical.

For big data test engineers, how to efficiently and correctly verify at least terabytes of data successfully processed by big data tools/frameworks will be a huge challenge.

Because of the efficient processing and testing speed of big data, it requires testing software engineers to have high-level testing skills to deal with big data testing.

Let’s take a look at three characteristics of big data processing:

multitudinous
real-time
Interactable

In addition, data quality is also an important dimension of big data testing.

Therefore, data quality must be ensured before application testing and should be considered as part of database testing. Involves the inspection of various characteristics of data, such as consistency, accuracy, repeatability, coherence, validity and completeness, etc.

Big data application test steps

Let's take a look at the testing process of big data applications.

Overall, big data testing can be roughly divided into three steps:

Step 1: Data preprocessing and verification
When conducting big data testing, the first step is to verify the accuracy of the data before pre-hadoop and so on.

Our data sources may be relational databases, log systems, social media, etc., so we should ensure that the data can be correctly loaded into the system
We want to verify that the loaded data is consistent with the source data
We want to ensure that the data is extracted and loaded correctly into hdfs

Step 2, Map Reduce Verification When testing big data, the second key step is "Map Reduce" verification. At this stage, we mainly verify whether the business logic of each processing node is correct, and verify that after multiple runs, ensure that:

Map Reduce process works fine
Data aggregation and separation rules have been implemented
The data key-value relationship has been correctly generated
Verify the accuracy and other characteristics of the data after map reduce

Step 3: Result Verification In this stage, the results of the final data generated after being processed by the big data tool/framework are mainly verified.

Main verification:

Verify that data transformation rules are applied correctly
Verify data integrity and successful persistence to the target system
Verify no data corruption

Architecture testing

Hadoop processing massive data is very resource-intensive, and a good architecture is the foundation to ensure the success of big data projects. Bad engagement can lead to a dramatic drop in performance, making the system unsuitable for our needs, so we need, or at least perform performance testing, failover testing in a Hadoop environment to improve efficiency and deal with the worst possible condition.

Performance testing is a complex task that runs through the entire testing cycle and requires attention to indicators such as memory, CPU, and network.

The failure recovery test is to verify the possible failures in the data processing process, and to take corresponding countermeasures for unexpected recovery.

Performance Testing

The big data performance test mainly includes the following parts:

Data extraction and storage efficiency

At this stage, we mainly verify the efficiency of big data applications extracting and loading data from source data.

The first is to verify the efficiency of data extraction and loading per unit time.

The second is to verify the efficiency of data persistence to mongodb and other libraries, etc.

data processing

In this stage, we verify the execution efficiency of map reduce tasks, focusing on the efficiency of data processing. Of course, this process may also involve data persistence related indicators, such as storage to HDFS read and write efficiency, etc. It will also involve processing efficiency in memory, that is, the efficiency of our processing algorithm, etc.

Subcomponent performance

Big data processing generally requires comprehensive use of various components to assist processing, so we also need to pay attention to the performance of these auxiliary components

performance testing strategy

Big data application performance testing involves massive amounts of structured and unstructured data, which is different from the business systems we usually face, so we need to develop specific testing strategies for big data applications to deal with massive amounts of data.

According to the above figure, the performance test execution process is generally like this:

The big data cluster environment needs to be initialized before performance testing
Sort out and design big data performance test scenarios
Prepare Big Data Performance Test Scripts
Execute and analyze the test results (if the metrics are abnormal, tune the corresponding components and retest)
Optimization

Basic preparation for performance testing

In the big data performance test, you need to prepare the relevant basic work, as follows:

Data preparation, what magnitude of data do we need to prepare at different nodes?
Log estimation, during the testing process, how large the log may be generated, and what is the possible increment of the log?
Concurrency, at test time, how many threads might read and write concurrently?
Timeout setting, what kind of connection timeout should be set? Query timed out? Write timeout and so on?
JVM parameters, how to set the optimal JVM parameters, heap size, GC mechanism, etc.
Map Reduce, what sort, merge and other algorithms should we choose?
Message queue, what about the message queue length? etc

Required test environment

Big data testing is different from regular application testing. You should have the following basic environments:

Have enough storage devices to store and process big data
Have a cluster for distributed nodes and data processing
At least have enough cpu and memory to ensure a high-performance processing base

Challenges of Big Data Testing

For software test engineers engaged in big data testing, compared with traditional testing work, we may face the following possible challenges:

Automated
automated testing is a necessary technology for big data testing, but automated testing tools may not have the ability to handle exceptions caused by the testing process, which means that existing tools may not be applicable, and programming ability will be a better one Skill.
Virtualization currently uses virtualization technology on a large scale in the industry, but the delay of virtual machines may cause abnormal real-time test processing of big data.

For big data, managing image information will also be a huge problem.

Massive datasets

The amount of data that needs to be verified is huge and requires faster processing speed
Need for effective automated testing methods
Need to be as cross-platform as possible

Challenges of Big Data Performance Testing

For the performance test of big data, compared with the traditional performance test, what kind of challenges do we have to face, which may include the following aspects:

Diversified and complicated technologies, facing different big data solutions, we may need to master different technologies and customize different testing solutions
There is no universal tool. At present, there is no universal standard big data performance test tool in the industry, which means that we need to develop or integrate a variety of related tools according to the big data application solution technology to solve the problem.
The test environment is complicated. Because of the huge amount of data, the test environment we need will be more complicated, and the basic cost will be higher.
Monitoring solutions, there are currently limited monitoring solutions, but by integrating different monitoring tools, it is possible to have a relatively feasible monitoring solution.
Diagnosis scheme, due to the complexity of technology and environment involved in big data applications, we need to develop and customize the problem diagnosis and tuning according to the actual situation

From the above aspects, the problems facing big data performance testing are relatively complex, especially for the current domestic test engineers, there is still a long and difficult way to go.

summary

As big data engineering and data analysis gradually enter a new stage, big data testing will become inevitable and a popular career direction in the future.
Big data processing must be batch, real-time, and interactive
Three stages of big data application testing:

data verification
Map Reduce Validation
Data processing result verification

Architecture testing is also a very important type of testing. Poor architecture may directly lead to the failure of your big data project
Performance test three nodes:

Data extraction and storage efficiency
data processing efficiency
Subcomponent work efficiency

Big data testing is different from traditional testing, not only in types and strategies, but also in specific technologies such as tools.
Due to the complexity of big data, its testing challenges will be different from traditional testing
Big data performance testing will be one of the more difficult goals for software test engineers to overcome.

This article is reproduced: https://www.cnblogs.com/crstyl/articles/7277550.html