Metadatabase in big data platform - MySQL abnormal fault resolution

The main goal of this article is to solve the abnormal failure of the metadata database MySQL in the big data platform. By analyzing the problem of slow application response, the cause of the cluster component HIVE and metadata database MySQL was found. Through a series of methods such as log analysis, tool detection and expert guidance, it was finally determined that the root cause of the problem was the irregular use of tenants in the big data cluster, and the problem was gradually solved. This article will describe the fault location and solution ideas in detail, hoping that the case analysis can provide a reference for peers who encounter similar problems.

This article comes from the research of the twt Community Professional Committee

1. Fault background

When marketers build target customer groups on the application side, they find there is a huge delay. After feedback and preliminary verification and positioning, it was found that when the backend called the big data cluster service, there was no return. This situation resulted in delays in subsequent household portraits, uploading groups, reporting numbers, and multiple applications that needed to remind target customers. It even caused complaints from some professional branches.

2. Troubleshooting ideas

1. Fault location:

Problems with HIVE components fall into two categories:

1.hivemetastore

Use the cluster monitoring page or hivemetastore log analysis to check the limitations of hivemetastore concurrency number and other parameters.

2.hiveserver2

1) Consult to see if there are any newly added tasks recently, and analyze to see if there are any abnormal SQL statements and other programs;

2) Check whether there are parameter problems through the cluster monitoring page or hiveserver2 log analysis;

3) Audit and analyze hive’s metadata database tables to see if there are a large number of partition tables or large full table scan tables that need to be focused on, such as audit tables and other information.

2. Troubleshooting:

Now that we know that the MySQL metadata database problem is caused by the hive component, it is recommended to start from the following aspects:

1. Start with hive components

a. Check whether there are any new tasks recently. Tasks that have not been code audited or whose SQL is not written in a standardized manner occupy too many resources, resulting in slow cluster response;

b. Check the parameters of hiveserver2 and hivemetastore, analyze their logs, and see if the cluster components are slow due to parameter problems;

2. Start with the MySQL database

a. Check the hardware resources of the MySQL server, check CPU, memory, IO, network card and other information to see if there is excessive usage;

b. Conduct an inventory analysis of hive's metadata database to see if there are long connections or SQL statements that take up a lot of resources, causing the database to be slow;

3. Start with YARN components

a) Check whether the allocation of tenant queue resources is reasonable;

b) Check whether there are a large number of tasks with abnormal status.

3. Case description:

1. How to discover the abnormal failure problem of MySQL metadata database

1) At 18:30 on May 6, the operation and maintenance personnel found that the task of creating the target customer group was delayed; after verification, the slow response efficiency of the cluster caused the task delay;

2) From 19:00 to 23:40 on May 6, after analyzing the spark log, hiveserver log, NameNode log, and hivemetastore log, no exception was found. On the CM monitoring page, no abnormalities were found in the cluster inspection indicators;

3) At 23:55 on May 6, operation and maintenance personnel discovered that there were many long connection sessions in the mysql metadata database, and the number of Innod locks continued to increase and was not released;

4) At 0:30 on May 7, the operation and maintenance personnel asked colleagues from the infrastructure department to help locate the cause. It was found that there were multiple long connections of big data tenants in the metadata database (MySQL), which affected the performance of the database and thus affected the cluster tasks. submission response efficiency; after verification, the long connection session and the unreleased Innod lock were initiated by the task of tenant user_yddsj (big data tenant);

5) At 0:12 on May 7, the operation and maintenance personnel called the big data tenant manufacturer to clean up; and notified the bureau for assistance by email, requiring the big data tenant manufacturer to clean up the long connection session;

6) At 0:30 on May 7, experts from Company H’s big data product line were invited to assist in the process. After remote analysis by experts from the big data product line, the initial cause was found to be insufficient concurrency of the metastore. The concurrency of the metastore was determined at the source code level. After adjustments (increasing the number of concurrencies), the test environment was deployed, debugged, and verified multiple times, and was released to the official environment at 20:30 on May 7, and the hivemetastore service restart was completed at 21:30. After restarting, the cluster capability returns to normal. However, after tracking and monitoring, the cluster service performance continued to decline around 23:45, ruling out the impact of the number of concurrent hivemetastores. Experts were invited to the site that night for support the next day.

7) At 8:10 on May 8, many experts from Company H arrived at the Hunan Telecom site and worked together to locate the cause of the fault. The integration experts found that the IO usage of the MySQL database host continued to reach 99%;

8) At 8:30 on May 8, through MySQL expert positioning, it was confirmed that the long connection session and the unreleased Innod lock discovered on May 7 have not yet been released. The target table pointed by these sessions is user_yddsj.volte_mw. After query Metadata information shows that this table has more than 20,000 partitions, and the tenant's execution program performs a full table scan. As a result, the IO usage of the MySQL database host continues to be high;

9) At 11:19 on May 8, the operation and maintenance personnel, in collaboration with the person in charge of the bureau, notified the big data tenant to clean up the partitions of the table user_yddsj.volte_mw. After confirmation by the person in charge of the bureau and the big data tenant, in order to restore the normal service of the cluster as soon as possible, it was decided to stop the cluster service of the big data tenant and stop its applications;

10) At 11:40 on May 8, the big data tenant began to clean up the user_yddsj.volte_mw table partition. Received a notification that the big data tenant table partition cleanup was completed at 12:30;

11) At 13:30 on May 8, after more than an hour of observation by operation and maintenance personnel, the service response and performance of the cluster have returned to normal. The efficiency of accessing the metadata database returns to normal.

picture

Figure 1: Colleagues from the Basic Support Department assist in locating the long connection problem

picture

Figure 2-1: Statements related to long connections, corresponding to users who are tenants open to big data

picture

Figure 2-2: Statements related to long connections, corresponding to users who are tenants open to big data

picture

Figure 2-3: Statements related to long connections, corresponding to users who are tenants open to big data

picture

Figure 3: MySQL database host IO high water level on May 8

picture

Figure 4-1: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-2: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-3: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-4: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-5: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-6: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-7: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 4-8: MySQL database long connection statement on May 8, locating the big data tenant table user_yddsj.volte_mw with more than 20,000 table partitions

picture

Figure 5: Locating big data tenant execution program full table scan problem on May 8

picture

Figure 6: After more than an hour of observation at 13:30 on May 8, the cluster services returned to normal.

3. Fault summary

1. Problem solving

Temporary measures:

1) Clean up the table partitions and release the pressure on the metadata database MySQL;

Permanent measures:

1) Re-evaluate the built table and rebuild the table design, especially the partition settings;

2) Set the table cleaning rules to prevent similar situations from occurring.

2. Summary

1) The big data tenant only cleaned the HDFS files but not the HIVE table partition information;

2) There is a MySQL full table scan in the big data tenant execution program;

3) The launch of big data platform tenant applications is not included in tenant management specifications

4) The big data platform cluster table partition metadata lacks monitoring.

4. Optimization to avoid problems

How to design a rectification plan for the abnormal failure of the metadata database that executes MySQL (limited completion time: omitted)

1) Big data tenants clean HIVE table partition information in a timely manner and configure automatic cleanup scripts;

2) The big data tenant adjusts the execution program, completes the transformation of the volte_mw table partition, and designs it into a large partition + small partition; completes the transformation of the execution program;

3) The big data platform incorporates tenant application online into tenant management specifications;

4) The big data platform will add cluster table partition metadata monitoring.

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/132269718