The cause and treatment of online Mysql database crash accident

Foreword

Following the previous article "A Record of an Online Mysql Database Crash Accident" , the article mentioned an incident record of an online database crash. It is recommended that the two articles be read together to avoid confusion.

Due to time reasons, it only talks about some of the experiences at that time and some of my psychological activities at that time. As for the reasons and follow-up processing steps, they are not clearly written in the article, so that many friends say that they are unclear. I apologize to them here, mainly because I really didn’t have enough time to prepare two articles at the same time last week, otherwise it would not have ended hastily, and the subjective factors in the previous article accounted for a large proportion, because I recalled There were indeed a lot of thoughts on this matter, so it seemed a little personal and diary.

This article will no longer describe the incident, mainly to organize the cause of the incident and the subsequent processing steps.

Memories of the past 1

There is a picture, which was sent to me by the boss later. I can see the database situation at that time:
mysql
this is the instance information after the database is down, which is basically paralyzed. As for the screenshots related to transaction locks, they were not saved at that time, so they cannot be placed in The article is in, a friend left a message to me and asked why the locked process was not directly killed at that time. I replied that it was because I did not understand this knowledge point. I searched for the diary of those days and there was indeed a record of this. It was at that time. The boss sent me a few commands:

show processlist;
//找到锁进程
kill id ;

It should have done these operations, but seeing the above picture, you should know why these are not working. The transaction lock does exist, which causes some tables to fail to operate normally, but the main reason is that the database resources are exhausted. Even killing the related process can't solve it.

Inventory function introduction

It has been located that warehousing is the main reason for this accident, so why does frequent operation cause this to happen?

First, let's introduce the functional design changes at that time and the SQL statements involved.

Table structure design and function design

Tables and entities involved in the warehousing function:

  • Warehouse information ( tb_storehouse )
  • Shelf information ( tb_shelf )
  • grid information ( tb_shelf_grid )
  • Product information ( tb_product )
  • Product location information ( tb_grid_product )
  • Inbound information ( tb_store_in_record )

Definition:

  • Since there are multiple warehouses, the warehouse also makes a table independently;
  • There are multiple shelves in a warehouse, tb_storehouse and tb_shelf have a one-to-many relationship;
  • There are multiple grids in a shelf (shelf specifications are different, some are 8 and some are 4), tb_shelf and tb_shelf_grid are also in a one-to-many relationship;
  • Product information, with the product code as the primary key, and other attributes, but they are not listed regardless of the warehousing information;
  • The location information of a commodity is which grid a commodity is on. The table structure is designed with four fields: id, commodity code, grid id, number (quantity), which store the primary key id and quantity value of the two attributes;
  • Warehousing record information, that is, which warehousing clerk is warehousing which item at which time point. The fields involved are: product_id, grid_id, operator_id, create_time, and other fields, but they are not related to the warehousing operation. listed.
    shelf

Storage function V1.0

In the initial version, the design of the warehousing function is relatively simple and clear. The warehousing operation only does one thing, that is, warehousing, and the page logic is relatively simple: enter the background –> warehouse management –> shelf management –> grid management – >Click the warehousing button -> warehousing , enter the grid page you want to store and click warehousing, then a pop-up box will pop up on the page with an input box on it, and then the warehousing staff can scan the product code with a scanner gun to complete Warehousing, the input box is unselectable during the warehousing process, and the next warehousing can be carried out after success. This method of warehousing is also very fast. .

The SQL statements that need to be executed in the initial version are:

  • Query the product according to the product code, if it is empty, it will report an error and remind that the product SKU needs to be improved;
  • Query grid information, if it is empty, an error will be reported;
  • Query the location information, if it already exists, add one to the number number, and if it does not exist, execute the new operation;
  • Add warehousing record information, complete warehousing, and return.

A total of 4 pieces of SQL can complete this one-time storage operation. The first version is developed and designed by itself. The function is very clear, and there will be no other operations and queries.

Storage function V2.0

This version is the version that caused the database crash. The reason for the change is very simple. The new "product manager" thinks that the storage page is too ugly and has no aesthetics, so it needs to be redone. The new functions are as follows: enter the background -> warehouse management -> shelf management –> Grid Management –> Click the Inventory button –> Inventory –> Refresh the page –> Click the Inventory button –> Inventory , the new page is changed as follows, the original grid list page will only display grid information, but the new page It is different. To display what products are on the grid, how many products are there, and how many products are in this warehouse, the pagination list is displayed. The new function requires that the page refresh operation be performed after the storage, so that the storage staff can see to change.

Seeing this, you may find it inappropriate or unreasonable. Let's reserve your opinion for the time being. Let's take a look at which SQL statements are executed by the new changes:

  • Query the product according to the product code, if it is empty, it will report an error and remind that the product SKU needs to be improved;
  • Query grid information, if it is empty, an error will be reported;
  • Query the location information, if it already exists, add one to the number number, and if it does not exist, execute the new operation;
  • Add warehousing record information;
  • Find a list of location information ("Product Manager" requires 20 pieces of data per page);
  • Find the corresponding real inventory according to the commodity code in the 20 records, complete the storage, and return.

Because querying the real inventory requires additional execution of SQL statements, a total of 25 SQLs need to be executed for the new function in one storage operation. Except for a few SQLs in the first version, the SQL statements added in this function change are complex SQLs. .

Q&A

Explanation: Since it involves some of the company's business, the warehousing operation is simplified in the article. The actual warehousing operation is more complicated than the process described in the article. The functional changes at that time were worse than this. The original warehousing operation The operation is to execute about 6 SQLs, but after the function is modified, more than 60 SQLs need to be executed at one time.

  • Q: Why is it designed this way?
  • A: "Product Manager" thinks it looks good.
  • Q: Does the warehouse need this design?
  • A: "Product Managers" think warehouse managers are fools and don't care about their thoughts.
  • Q: Why do you need all the real inventory data of the product?
  • A: The "product manager" feels that overall planning is needed.
  • Q: The page looks good, but the function is troublesome. Why do you still do this?
  • A: The "product manager" thinks it doesn't matter if there is more warehouse management.
  • Q: Did the developers refute each question?
  • A: The "product manager" said to do it, but it is useless to refute it.
  • Q: Who is the "Product Manager"?
  • A: The boss's girlfriend.

Answered!

a

In fact, this kind of design is superfluous at a glance. Later, when I chatted with the warehouse manager privately, they were also angry and cursed, and when they received the flow chart, they clearly asked about all the real inventory, but there was no way, no way to do it, originally once It may take less than a second to enter the warehouse. When a new function comes out, it may sometimes take 3-4 seconds to set up a longer time, and it also caused this accident.

crash cause

Through the previous description, we can roughly know what caused the database crash. There is a female hacker in our company! Hahaha, this is a joke.

The warehousing operation increased from the original 6 SQL execution statements to 60 SQL execution statements, and the warehousing time also increased. Moreover, the SQL for querying the real inventory in these 60 SQL statements is also more complicated, and a multi-table join query is used. and function operations, the performance is also relatively poor, and the storage operations of the day are also relatively intensive, so the database bears a load that is n times heavier than the original! It is not surprising that resources are gradually being depleted.

wide

The reasons for the crash are summarized as follows:

  • A business function executes too many SQL statements, and this function is called many times in a short period of time.
  • There are complex statements in SQL statements. For example, some functions are used, such as multi-table join query. A large number of SQL statements plus complex SQL statements will undoubtedly make things worse.
  • The selection and setting of database connection pools have resulted in a large number of database connections.
  • The code of the service layer is not standardized, and the select statement also adds transactions, which increases the opening and closing of some unnecessary transactions, and increases the overhead of the myslq database.
  • Some tables are not indexed, or the index is incomplete, which leads to the emergence of slow SQL.

There are so many reasons listed, transaction problems, non-standard indexes lead to query problems, slow SQL, database connection bursting, one link is linked, one problem is involved in another problem, but these are not the main problems. The problem lies in the first link. The main reason is that the function design is extremely unreasonable, resulting in the execution of a huge number of SQL statements in a short period of time, thereby exposing all the deficiencies, and finally detonating the problem. , Under normal circumstances, slow SQL and complex SQL statements will not drag down the database. Even if there is no index, it will only take more time for the query to return, and it is impossible to cause the entire application to crash.

Some people may say, wouldn't it be good to ask the boss to change his girlfriend?

Shh~ be quiet.

Others may say, isn't it easy to get here? Just go back to the version.

In fact, the problem is multi-faceted, not only because of this function change, although this change is the main cause of the problem, but the code is not standardized, the table structure is not optimized in place, and the slow SQL is not dealt with, these problems still exist, even if this This time, the increase in warehouse management traffic did not cause the database to crash. Maybe the next time the mall traffic increases or other page traffic increases will also destroy the database. Therefore, the function modification is also a full range of moves, not just the rollback version.

Follow-up

Asking the boss to change girlfriends is a joke.

There are many steps in subsequent processing, which are summarized as follows:

  • The storage function is modified, the page design is retained, and the function is changed;
  • Database connection pool changes;
  • Table structure optimization;
  • Clean up slow SQL;
  • Business code specification to reduce transaction overhead;
  • Mysql parameter modification, the most impressive is the wait_timeoutparameter;
  • Integrated cache function.

Although the accident is very helpless and frustrating, but after seeing the processing results, think about it. If there is no such accident, you will not think about optimizing the code, optimizing the database, integrating the cache, etc. Operation, these not only make the system more robust, but more importantly, experience! So don't be afraid of problems, experience is gained in bumps and bumps.

Several years of work experience have also made me gradually understand that the growth of technology is inseparable from one mistake after another. Although there is sadness and unwillingness in failure, it is undeniable that it also brings growth, whether it is a strong mentality, or The improvement of efficiency and experience are also accumulated in the occurrence and resolution of accidents one after another. In the future, this is still the case, and there will still be difficulties one after another.

Memories of the past 2

In this incident, I also came into contact with Mysql downtime for the first time, and the database could also be requested to crash. In the past, the tomcat server was crashed by the request or the server traffic was full. Therefore, about this matter The memory is relatively deep, and the details may not be clear, but it still has a great impact on me. After this incident, it was also the first time that I used cache in my project, which is why I wrote these two articles before writing the cache integration article. Of course, the integration code at the beginning was written by the boss, and after learning it for a long time, it was only a little bit of getting started, not only the storage operation, but also integrating the cache in other functions is also very helpful for the system. Here, The cache is responsible for reducing the pressure on the database, and transferring part of the request so that the pressure does not fall directly on the database.

To use an inappropriate analogy, a function will run well when 6 SQLs are executed, but when 60 SQLs are executed, once the operations are intensive, it may crash, and the cache can avoid this, and try to share the database as much as possible. Pressure, there is no need to access the database every time a request is made, just like the 60 SQL statements in this event, if the results returned by the next 54 SQL statements are put into the cache, this crash event will not occur.

Because it has been about two years since the accident happened, it may not be able to recall very accurately. We can only restore the course of the incident based on a few diaries written at that time, but it is only an approximation. After all, the time point of the incident happened A bit far now. In fact, the accuracy of the process is not particularly important. From the record of this incident, it can be seen that the handling of things was immature and immature at that time. If you look at it from the current perspective, you can definitely locate the problem and deal with it quickly, but for me at the time, it was still very troublesome, and it can even be said to be an impossible thing from a technical point of view. When I first heard about the lock table, the whole person was blinded. What is this, what is a database lock? What should I do if the table is locked?

Epilogue

The article about the crash record of the online Mysql database is over here! If you have any questions or have some good ideas, please leave me a message, and thank you for pointing out the problems in the project.

First published on my personal blog , new project demo address: perfect-ssm , login account: admin, password: 123456

If you want to continue to learn about the project, you can view the entire series of articles on Spring+SpringMVC+MyBatis+easyUI integration series , or you can go to my GitHub repository or open source Chinese code repository to view the source code and project documentation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325817604&siteId=291194637