The career development of top high-paying test and big data test engineer.

Introduction

Today I will share with you the direction of the data test engineer's posting.

table of Contents

  1. Overall function introduction
  2. Business-based testing
  3. Testing based on ETL layer
  4. AI-based testing
  5. Skill stack introduction

1 Introduction to overall functions

The career development of top high-paying test and big data test engineer.

As we all know, data processing work is carried out around: data collection, data cleaning, data modeling, and data application . Like the pyramid above, it is divided into: big data development/operation and maintenance engineers, data mining engineers and data analysts.

The career development of top high-paying test and big data test engineer.

The work of the same data test engineer also revolves around data collection, cleaning, modeling, and application. The corresponding data can be divided into the following categories: business-type data testing, ETL layer data testing, and AI layer testing . Below we Let’s take a look at how their functions are different.

The career development of top high-paying test and big data test engineer.

2 Business-based testing

Business-based data testing is to ensure data accuracy, consistency, and security .

Business data testing emphasizes: to ensure that the final output data meets business custom rules,

Take a chestnut: The business side needs an indicator A to see which store has the highest net gross profit margin of a certain merchant, where: net gross profit = gross profit / operating income

The career development of top high-paying test and big data test engineer.

As a business tester, you need to consider:

1. Assess the components of indicator A. From a business perspective, evaluate which sub-indicators consist of and what are the sub-indicators, until the sub-indices cannot be dismantled. The net gross profit here is obtained by gross profit ÷ operating income. The gross profit is obtained by subtracting the main business cost from the main business income.

2. Evaluate the acquisition path of each component , whether each sub-indicator can be obtained directly in the system, in which subsystem, and if it cannot be obtained directly, what processing is required to obtain it.

3. Evaluate whether a unified data entry is required . In question 2, there may be multiple data entries for a sub-indicator. For example, the main business data may be obtained in both systems S1 and S2. It is necessary to evaluate whether S1 or S2 is more accurate.

4. Evaluate whether there is invalid data for each sub-indicator . For example, whether the main business contains data from trial operation stores or data generated by test stores.

5. Unify indicator definitions and titles . After the above points are evaluated, it is also necessary to evaluate whether the indicator definitions and titles are in line with the business side's usage habits.
There are more than these points in the real business data testing process. What I want to express here is that business testers pay more attention to the definition of the business. The definition, acquisition, and output of data need to follow the business definition. Many times the business side does not Clear definitions. At this time, these indicators need to be continuously specified with the business side, so that the business side can clarify the definition of the indicators.

3 Test based on ETL layer

The career development of top high-paying test and big data test engineer.
The full name of ETL is Extract-Transform-Load, extracting, cleaning, and loading . Here, the tester needs to ensure that the data of the R&D personnel is not distorted in these three stages :

Extraction needs to consider :

1. What are the data sources to consider, for example: mysql, mongdb, es, csv, erp systems, are the current extraction tools or systems compatible?

2. Consider the extraction method. A system with a large amount of data should be incrementally extracted. Is there any consideration in R&D? Is the extraction performed in real time or offline, which is more suitable for the current business form?

Cleaning/loading needs to be considered :

The purpose of cleaning is to filter out the data that does not meet the requirements, and the test is independent of research and development, and the result data needs to be checked based on etl cleaning rules:

1. Evaluate incomplete data processing . This type of data is mainly due to missing information: such as the name of the supplier, the name of the branch, the lack of regional information of the customer, the main table and the detailed table in the business system cannot match, etc. . For this type of data, it is filtered out, and the missing content is written into different Excel files and submitted to the customer. It is required to be completed within the specified time and then written to the data warehouse.

2. Evaluation of wrong data processing : The reason for this type of error is that the business system is not sound enough, and it is directly written into the back-end database without judgment after receiving input. For example, numeric data is input into full-width numeric characters, string data is followed by A carriage return operation, incorrect date format, date out of range, etc. This type of data should also be classified. For problems similar to full-width characters and invisible characters before and after the data, they can only be found by writing SQL statements, and then the customer is required to extract them after the business system is revised.

3. Evaluate repeated data processing : For this type of data—especially in dimension tables—all the fields of the repeated data record are exported.

4. Evaluation of inconsistent data processing : This process is an integrated process that unifies the same types of data in different business systems. For example, the code of the same supplier in the settlement system is XX0001, and the code in CRM is YY0001, so it is extracted After coming over, unified conversion into a code.

4 Testing based on AI layer

The full name of AI is Artificial Intelligence, quality assurance at the level of artificial intelligence. At this level, it is necessary to evaluate whether some algorithm models are smart , and to
evaluate whether some algorithm models are smart.

There may not be a clear point to prove that it is smart or not smart enough. It can only be judged from the business purpose, whether the data set is qualified, formulating the ABtest plan, and finally looking at the effect:

For example, there may be many reasons why a girlfriend is angry. Asking directly will definitely not produce results, but the correct approach:

1. Try to analyze the reasons that make her angry. There may be many reasons. It may be that you mentioned your ex-girlfriend, it may be that your birthday is forgotten, it may be that you dislike her for cooking... You mentioned that the proportion of ex-girlfriends must be 98%...

2. Then analyze the reasons for anger, and you must do some remedial measures, such as: buy gifts if you forget your birthday, pay attention to the next time when you mention your ex-girlfriend, and if you don’t think the food is well done, make it yourself and say: " Honey, let me do it (subtext: you don’t cook well)"
The career development of top high-paying test and big data test engineer.

After talking about the serious ones, let's continue to talk about serious AI tests:

Example: How to determine whether a set of intelligent recommendation algorithms are intelligent, the evaluation process of AI testers is as follows:

1. Understand what the business purpose is and participate in the evaluation of impact factors and weights.

Example: When the current product is out of stock, recommend another similar product to improve the user's shopping experience. On the other hand, the GMV that drives the overall sales - these are the core problems solved by the algorithm model. The impact factor evaluation includes:
product A and The number of times that
product B appears in the same order, the hierarchical structure of the product category of product A and product B,
the user of the same type of label purchases product A and also purchases product B,
...
the percentage of the above impact factors The weight is evaluated.

2. Understand and evaluate the algorithm model , and evaluate whether the data in the training data set is "qualified".

At present, most of the algorithm models of Internet companies are some common and mature algorithm models in the industry. AI algorithm engineers can configure some parameters of some models to achieve compliance with their own company’s business, so AI test engineers participate in the algorithm model level Not much, the
focus is still on whether the training data set is "qualified" and participating in the formulation of the AB test plan to evaluate the effect of the algorithm.

Evaluate whether the data is qualified: including the etl process-the processing of data integrity, consistency, and repeatability, in addition to evaluating whether the data noise reduction is reasonable,

Example: The historical average daily sales volume of a certain product is in the tens of thousands, and suddenly reached several million on a certain day. It is necessary to evaluate whether the day is caused by a large promotion or whether it is caused by an upstream calculation error. The algorithm layer needs to monitor and process these data.

3. Participate in the formulation of the AB test plan. The formulation of the AB test plan is not only the task of product, operation, and research and development, but the
tester is the person most familiar with the system.

Should participate in the AB test throughout:

3.1. Evaluate and formulate the core factors affecting the algorithm and the weights between them

3.2. Evaluate and formulate the effect indicators of the final AB test plan

3.3. Divide a small number of online users of the same type into group A and group B at the same time, and change the index factors in the model for each test

3.4. Then use the effect indicators to evaluate the quality of this AB test, so as to determine the optimal core factor and the weight between them

The career development of top high-paying test and big data test engineer.
For example: the
impact factor of the smart recommendation algorithm may be: the number of occurrences of product A and product B in the same order, the hierarchical structure of the product category of product A and product B...

The weight ratio may be: 25% each, 4:3:2:1...

The effect indicators may be: the total GMV of users diverted to group A is higher than the total GMV of users diverted to group B, and the unit price of users diverted to group A is higher than the unit price of users diverted to group B...

The model is divided into AB groups, and then the same type of users are led to different groups, and finally the quality of the model is evaluated through the effect indicators, so as to determine the optimal model.

5 Introduction to Skill Stack

Just like learning martial arts, what is often powerful is some internal skills. The moves are only on this basis. On the other hand, 80% of the data testing of the business layer, ETL layer, and AI layer is actually an in-depth business. Understand (this is the inner strength method) , and then work out a reasonable and feasible data test plan, and then execute the test plan through a series of languages ​​or tools such as hadoop, hive, jupyter, pandas, spark... (only moves), Finally achieve the purpose of data quality assurance.

Students who like my sharing this time can scan the WeChat QR code below, here are more big data test dry goods.

The career development of top high-paying test and big data test engineer.

Guess you like

Origin blog.51cto.com/14974545/2543134