A record of an online Mysql database crash accident

Introduction to the article

In the past few years of work, the technology stack has been constantly updated, the project management experience has also increased a lot, and the speed of writing code has also improved. I feel very gratified. After all, we have been making progress, but there are also many twists and turns in the process. After countless bumps and pits, it is not easy to go from a newbie who doesn't even know how to use Baidu to a leisurely old oil. Many people should have similar experiences and feelings, so the blog will also sort out some of the past. Remind yourself of accidents and problems you have encountered.

Since the cache module will be introduced in the perfect-ssm project next, I happened to see the record of this accident when I was looking through the diary, so I organized this article and reviewed the incident according to the diary at the time of the incident. The real case of this database accident and the follow-up accident handling are used as an introduction to talk about caching. Why do you want to do this, because I think there are many articles on the Internet about the importance and necessity of caching use, one after another. Reasons and uses The benefits of caching are clearly written, and it is a bit redundant for me to write it again. It is better to go through this kind of personal experience.

Although I knew the importance of cache at that time and wanted to use it in the project, but because I was too lazy at that time and didn't know how to integrate the cache in the project, the use of the cache was delayed again and again, and this incident and the The follow-up processing is also the first time I use cache in project development, which is why I chose to use this diary to write this article and use it as an introduction to cache access. In fact, there are many diaries, but this one is rather special. Well , it just happened to be used in perfect-ssm , so I organized an article in conjunction with this diary.

Project Introduction

I wrote three paragraphs before, and I can’t continue to write anymore. The protagonist should appear. First, let me briefly introduce the project where the accident occurred. The company I worked for at that time was a small e-commerce company. The product is also a mall project. It has been running for a period of time and is still in the stage of continuous development and optimization. The user interface is very similar to some e-commerce attributes: products, shopping carts, orders, and payments. The background is some operational data, warehouses System (warehousing and outgoing of goods), order management, etc., the number of users and orders are not bad, more than enough, so-so can make the company run normally, **OK, the basic information is introduced.
intro

first crash

The accident happened in the morning of a certain day (well ... it happened again in the morning ), the incident happened suddenly, and the customer service received feedback calls from several users one after another. The user's mishandling or some products are out of stock in the warehouse, these are often encountered and do not take it to heart ( the biggest advantage of myself in the new stage should be that I have a big heart, and I don't care about everything, hahahaha ), but there were more and more complaints, and the boss asked me to hurry to see what was going on. At this time, I opened the website and started to check the problem (are you worried ? Or is it dumb? ).
houhou

Strange things also happened. The website opened normally, the homepage of the website was fine, the search function was normal, and the product details page was also available. At that time, I was nervous, what happened? Server stuck? One of the tomcat clusters hangs up? Hurry up to see the server, and it turns out that all instances are running normally. Seeing this result, I feel a little more nervous. It's not the same as the usual plot. Shouldn't this time be the bridge where the server hangs and restarts? The answer is no…

I went to the customer service again to find out what was going on. The customer service basically said that I could not place the order. I hurried to place the order and walked through the order process. Sure enough, the card was stuck on the page and
loading
could not generate the order. I continued to click on other pages and found that they are all normal. The order list and order deletion can be used. Only the order generation cannot be used. Hurry up and check the log, and no error is reported (I can’t remember this a bit, whether it is reported or not and whether it is reported or not. What's wrong, I just remembered to check the log and found no problem), I quickly went to the boss and reported it! The server is ok! Logs are fine too! The database can also be queried normally! The order form is also normal! It is just that the order cannot be generated, there may be a problem with the order interface! (You can make up for the stunned green scene at that time), I was also stunned at this time, what was the situation? Everything is normal, how could this happen? With my level at the time, I could only call 666 beside me, and then I pushed the matter to the boss (beautiful...), no way, I really can't, and I don't understand what it is question.

The boss also checked according to my idea at that time ( maybe I was lazy and didn't find it out ), of course he didn't find anything, and then started to look at the database, and it took a long time to check, I remember this During the process, I was very tormented, and I couldn't help me much. I had to listen to the customer service. I was lucky in the middle, and I felt that it would be better after a while, so I went to place a few orders, and the results were all. No, in the end, I can only sit next to the boss and watch him type the code. The most annoying thing is that many sql commands and linux scripts can't be understood...

The code is fine, the logs are fine, and the cluster is fine. Basically, you can guess that the problem should be in the database, but I don't have permission to view and operate it, so I didn't know if it was.

The boss has been checking, I can only wait, after a while, the boss said that there are two tables locked ( my inner OS: the table is locked? What does it mean? How can it be locked? Wait... What is a database lock? ), I tried to solve it but it didn't work. I couldn't think of a good solution for a while. The colleagues at the customer service side were also pressing hard, and the atmosphere was a bit stiff, and then I said my "high opinion", boss , let's restart the database . After speaking, the boss glanced at me, I guess his eyes should be full of praise, and then smiled at me, he may think this is a good plan, why does he Did not expect it? ( ps: It may also be scolding me in my heart. This is a stunned young man. As soon as there is a problem, I know to restart, tomcat restart, nginx restart, server restart... Now I have to restart mysql ) Of course, because of the confusion, I said this plan After that, he compromised a bit, and then went to kill some processes and restarted the mysql service, which took about two or three minutes. Although it is an online environment, there must be some impact, but also It will definitely not be too big. After all, the business has just started, and the system is also under development and improvement. After the restart, it really worked, and we could place orders normally, and we were relieved. After observing for a while, I found that everything was normal in the system. It just happened to be noon, so I went out to eat. After returning, I continued to check and observe the situation of the database, and found that everything was normal, so I regarded this as an accidental event. , I didn’t pay special attention to it. Due to the tightness of the company’s personnel ( technical department of a start-up company, what kind of department… ), everyone has a lot of things, and the boss also went to other things. The suggestion was to let me check it out. sql, is it using too many joint queries, or the problem of creating an inappropriate index has caused a deadlock. Since it did not have a particularly large impact, it is regarded as an ordinary accidental event. The boss also said , the next version will add the caching function, otherwise the database may not be able to support as the business grows.

crashed again

Maybe some friends will have some doubts in their hearts. Is this the end of the matter? how so fast?

Certainly not! It's definitely not going to end like this, or I won't be sorting out such an article alone. After four o'clock in the afternoon, the same problem appeared again, but this time it was much more serious.

It is still similar to the situation in the morning. When I was developing at the workstation, I received information from the customer service one after another. The same user feedback as in the morning, this time is not like the rush in the morning. After verifying the website application, I found problems and It was the same in the morning, so the boss went directly to check the database, and then restarted the database, and the website could run normally again, but at this time, the fool could realize that there was a problem with the website. Database service, but I can't do it every time. I have to find and solve the problem, but the information I learned is that several tables in the database are locked. The key is that I don't know how to do it, and I can't go to Baidu. Come on, of course, the problem must be in the database, the boss's suggestion to me is to find the problem of order logic and whether there is a problem with the associated SQL... ( I definitely didn't find the problem, because this time The source of the incident is not here )

During the period when everyone checked the code, checked sql and looked for solutions on the Internet, the same problem happened again, but this time it was more serious than the previous two, not only the ordering process, but also other functions could not be used, the interface was large Part of it also hangs, and the database problem is more serious. The time between the two events in the afternoon is basically not too long, that is to say, not long after the restart, the database collapsed again ( does this feel like it was attacked? Hey? Hehe ), it really was an explosion, and the problem was not located. When everyone thought that we were attacked and even felt a breath of despair, it was a coincidence that the warehouse manager also came over and reported that the product could not be entered. Carrying out the listing operation, and also said that a similar situation happened in the morning. At this time, a few of our developers responded. The warehouse background was recently updated. Maybe the problem is related to this, so I quickly asked the warehouse management in detail, how many are on the shelves. The goods, a few people are putting them on the shelves, and the answers they get are very satisfactory. They were also put into the warehouse in the morning, but before the call, there were a lot of them. I asked a few part-timers to do it together. After a while, we realized the updated functions and the new warehousing code, and gradually understood the root cause of the problem! The sql statement in question is here! So we communicated with the warehouse supervisor. Since the warehouse management system has just been updated, some instability has caused some problems. We will suspend the storage first, and we will quickly fix the problem.

It turned out that it was my own people who attacked our website!

Dear Long Ding Dong, it turns out that it was my own people who "attacked" our website!

I have forgotten the specific process. In short, I remember that there were three incidents on the website that day. Now it is much clearer to summarize. Basically, every time something happens, the warehouse manager is doing the warehousing operation, and the final warehousing amount is relatively large. , according to the description of the person in charge of warehouse management, it can roughly correspond. The first and second times were fine, but the table was locked, but the third time was more serious. The database service resources were exhausted and could not be connected at all. Only some requests were normal. Of course, according to the warehouse management's statement, it is generally clear if you think about it again. Since the goods in the first two warehouses are not many or dense, although there are problems of lock table and order failure, there are still some orders. The request can be executed normally, and most of the interfaces can also run normally, but the third time due to the large number of goods in the warehouse and the high speed, not only caused the lock table, but also caused the database connection to be exhausted, and then As a result, most of the interfaces also hang up.

Seeing this, there must be many friends who will ask, how did your website do, it will crash as soon as it is put into the library, so what are you playing? Don't get excited. We were caught off guard at the time. We have done many storage operations before. Everything is normal. We are also quite surprised by the parties involved. Since there has never been such an incident before, we also ignored it for granted and did not consider the warehouse management, and the two incidents in the afternoon were very close, so it was too late. Troubleshoot problems. This accident happened just by such a coincidence, because the warehouse management backend has just been revised and some functions have been added. The original relatively stable functions were required to be revised. The main reason is the page design after this update and the corresponding SQL statement problems. , and the configuration of the database is not particularly related. Of course, if the database configuration is high, the accident may occur later, but it will happen sooner or later.

The crashing part is over

Unconsciously, I have written so many words. I roughly estimated that there are about 6,000 words. It seems that I still have a lot of emotions and thoughts about myself at that time and what I experienced at that time. Some scenes are still vivid. In my eyes, some of my thoughts and reactions at that time still have some impressions, and I really feel that I was stupid and stupid at that time.

Originally, I wanted to write the follow-up cause analysis and event handling process together, but if it is added, the length of the article is indeed too long. This article has already recalled so many things, so the specific reasons and follow-up solution process. Let’s put it in a later article. If it is written together, it will be too long. It is estimated that many people will pull it to the end, hahahaha.

Epilogue

The record of this crash will come to an end here. The poor system design and the SQL statement that drags down the system will be introduced in detail in the next article.

First published on my personal blog , the new project demo address: perfect-ssm , login account: admin, password: 123456

If you have any questions or have some good ideas, please leave me a message, and thank you for pointing out that there are problems in the project Friends, this article mainly tells about a Mysql crash event.

If you want to continue to learn about the project, you can view the entire series of articles on Spring+SpringMVC+MyBatis+easyUI integration series , or you can go to my GitHub repository or open source Chinese code repository to view the source code and project documentation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325782690&siteId=291194637