The optimal solution to common online faults may not be understood by old operation and maintenance drivers

This article is reproduced from Erma Reading

Many people will be asked such questions during the interview: What system failures have they encountered? How to solve it? The following are real cases of online failures summarized by the author based on my 15 years of Internet research and development experience.

There are not many pictures in this article, but the content is very dry! Understand first, apply what you have learned!

Fault 1: JVM frequent FULL GC quick troubleshooting

Before sharing this case, let’s talk about which scenarios will cause frequent Full GC:

  1. Memory leaks (there is a problem with the code, and the object references are not released in time, causing the object to not be recovered in time).

  2. Infinite loop.

  3. Big object.

Especially for large objects, he is more than 80% of the time.

So where did the big objects come from?

  1. Databases (including NoSQL databases such as MySQL and MongoDB), the result set is too large.

  2. Large objects transmitted by third-party interfaces.

  3. Message queue, the message is too large.

According to many years of first-line Internet experience, most of the cases are caused by the large result set of the database.

Okay, now we start to introduce this online fault:

In the absence of any release, the POP service (access to the third-party merchant's service) suddenly started to go crazy Full GC, observe the heap memory monitoring and no memory leaks, roll back to the previous version, the problem still exists, embarrassing! ! !

According to the conventional practice, generally use jmap to export the heap memory snapshot (jmap -dump:format=b,file=file name[pid]), then use tools such as mat to analyze which objects occupy a lot of space, and then check the relevant references to find the problem Code. The cycle of locating the problem in this way will be relatively long. If it is a critical service, the problem cannot be located and solved for a long time, and the impact will be too great.

Let's take a look at our approach:

First, analyze the heap memory snapshot according to the usual practice. At the same time, other students check the database server network IO monitoring. If the database server network IO has increased significantly and the time point is consistent, it can basically be determined that the large result set of the database caused the Full GC. Quickly find a DBA to quickly locate the big SQL (it is very simple for the DBA, it takes a minute to get it, if the DBA does not know how to locate, then he will be fired, haha), after locating the SQL and then locating the code is very simple.

In this way, we quickly located the problem. It turned out that a parameter that must be passed by an interface was not passed in, and no verification was added. As a result, two conditions were missing after the SQL statement where, and tens of thousands of records were checked at a time. What a pit! Is this method much faster? Haha, it takes 5 minutes.

The DAO layer at that time was implemented based on Mybatis, and the SQL statement that caused the problem was as follows:

<select id="selectOrders" resultType="com.***.Order" > 
select * from user where 1=1 <if test=" orderID != null "> and order_id = #{orderID} </if > 
 
<if test="userID !=null"> and user_id=#{userID} </if > <if test="startTime !=null"> and create_time >= #{createTime} </if > <if test="endTime !=null"> and create_time <= #{userID} </if > </select> 

The above SQL statement means to check an order based on orderID, or check all orders of a user based on userID, at least one of the two parameters must be passed. But neither parameter was passed, only startTime and endTime were passed. Therefore, tens of thousands of records were found in one Select.

Therefore, we must use if test with caution when using Mybatis. Carelessness will bring disaster. Later we split the above SQL into two:

Query the order according to the order ID:

<select id="selectOrderByID" resultType="com.***.Order" > 
select * from user where 
 
order_id = #{orderID} </select> 

Query order based on userID:

<select id="selectOrdersByUserID" resultType="com.***.Order" > 
select * from user where user_id=#{userID} 
 
<if test="startTime !=null"> and create_time >= #{createTime} </if > <if test="endTime !=null"> and create_time <= #{userID} </if > </select> 

Fault two: memory leak

Before introducing the case, first understand the difference between memory leak and memory overflow.

Memory overflow: A memory overflow occurs when the program does not have enough memory to use. After the memory overflows, the program basically cannot run normally.

Memory leak: When the program fails to release the memory in time, resulting in a gradual increase in the occupied memory, it is a memory leak. Memory leaks generally do not cause programs to fail to run. However, if the memory leak continues to accumulate to the upper limit of the memory, a memory overflow will occur. In Java, if a memory leak occurs, it will lead to incomplete GC recovery. After each GC, the heap memory usage gradually increases.

The following figure is a monitoring diagram of the memory leak of the JVM. We can see that the heap memory usage has increased after each GC:

The memory leak scenario at that time was that the local cache (a framework developed by the company's infrastructure team) was used to store product data, and the number of products was not too much, hundreds of thousands. If you only store hot items, the memory usage will not be too large, but if you store all items, the memory will not be enough.

At the beginning, we added 7 days of expiration time to each cache record, so that we can ensure that most of the cached items are hot items. However, after a reconstruction of the local cache framework, the expiration time was removed. There is no expiration time, the local cache is getting bigger and bigger every day, and a lot of cold data is also loaded into the cache.

Until one day I received an alert message, indicating that the heap memory was too high. I quickly downloaded the heap memory snapshot through jmap (jmap -dump:format=b,file=filename[pid] ), and then analyzed the snapshot with the mat tool of eclipse, and found a large number of commodity records in the local cache. After locating the problem, quickly let the architecture group add an expiration time, and then restarted the service node by node.

Thanks to the addition of server memory and JVM heap memory monitoring, we discovered the memory leak in time. Otherwise, as the leakage problem accumulates, if one day it is really OOM, it will be miserable.

Therefore, in addition to monitoring the operation and maintenance of CPU and memory, the technical team also monitors JVM.

Trouble three: idempotent problem

Many years ago, I worked as a Java programmer in a large e-commerce company and developed a credit service at that time. The business logic at that time was that after the user's order was completed, the order system sent a message to the message queue, and the points service would give the user points after receiving the message, and add the newly generated points to the user's existing points.

Due to network and other reasons, messages may be sent repeatedly, which leads to repeated consumption of messages. At that time, the author was a rookie who had just entered the workplace, and did not consider this situation. Therefore, there will occasionally be repeated points after going online, that is, two or more points will be added to the user after an order is completed.

Later, we added a point record table. Before each consumption message adds points to the user, check the point record table according to the order number. If there is no point record, the user will add points. This is the so-called "idempotence", that is, multiple repeated operations do not affect the final result.

In actual development, many scenarios that require retry or repeated consumption must be idempotent to ensure the correctness of the results. For example, in order to avoid repeated payments, the payment interface should also be idempotent.

Fault four: cache avalanche

We often encounter situations where we need to initialize the cache. For example, we have experienced user system reconstruction, the user system table structure has changed, and the cached information has also changed. After the reconstruction is completed, before going online, you need to initialize the cache and store user information in batches in Reids.

The expiration time of each user information cache record is 1 day. After the record expires, the latest data is queried from the database and pulled into Redis. Everything was normal when Grayscale went live, so it was released in full soon. The entire online process went very smoothly, and the code farmers were also very happy.

However, the next day, disaster happened! At a certain point in time, various alarms are coming. The user's system response suddenly became very slow, and even no response at one time. Checking the monitoring, the user service CPU suddenly soared (IO wait is very high), MySQL visits soared, MySQL server pressure also soared, and the Reids cache hit rate dropped to the extreme.

Relying on our powerful monitoring system (operation and maintenance monitoring, database monitoring, APM full link performance monitoring), the problem was quickly located. The reason is that a large number of user records in Reids are centralized and invalid, and user records cannot be found in Redis for requests to obtain user information. This causes a large number of requests to penetrate the database, which instantly puts huge pressure on the database. At the same time, user services and other associated services have also been affected.

This kind of centralized cache failure causes a large number of requests to penetrate the database at the same time, which is the so-called "cache avalanche". If the cache invalidation time is not reached, the performance test will not detect the problem. So we must draw everyone's attention.

Therefore, when you need to initialize the cache data, you must ensure the discreteness of the expiration time of each cache record. For example, if we set the expiration time for these user information, we can use a larger fixed value plus a smaller random value. For example, the expiration time can be: 24 hours + a random value from 0 to 3600 seconds.

Fault five: disk IO causes thread blocking

The problem occurred in the second half of 2017. There was a period of time when the geographic grid service would respond slowly, and it would automatically recover every time it lasted from a few seconds to tens of seconds.

If the response slows down and it is still easy to handle, you can use jstack to grab the thread stack directly, and you can basically locate the problem quickly. The critical duration is only tens of seconds at most, and it is occasional. It only happens once or twice a day, sometimes only once in a few days, and the timing of the occurrence is also uncertain. It is obviously unrealistic for people to stare and then manually grab the thread stack with jstack.

Well, since the manual method is unrealistic, let's do it automatically. Write a shell script to automatically execute jstack at regular intervals, execute jstack once every 5 seconds, and put the results of each execution in a different log file, and only save 20,000 log files.

The shell script is as follows:

#!/bin/bash num=0 log="/tmp/jstack_thread_log/thread_info" 
 
cd /tmp if [ ! -d "jstack_thread_log" ]; then   mkdir jstack_thread_log fi 
 
while ((num <= 10000)); do    ID=`ps -ef | grep java | grep gaea | grep -v "grep" | awk '{print $2}'`    if [ -n "$ID" ]; then  jstack $ID >> ${log}   fi    num=$(( $num + 1 ))    mod=$(( $num%100 ))    if [ $mod -eq 0 ]; then  back=$log$num  mv $log $back fi          sleep 5 done

The next time the response slowed down, we found the jstack log file at the corresponding point in time, and found that there were many threads blocking the logback process of outputting the log. Later, we streamlined the log and changed the log output to asynchronous. The problem was solved. The script is really easy to use! I suggest you keep it, and you can use it when you encounter similar problems in the future!

Failure six: database deadlock problem

Before analyzing the case, let's first understand MySQL InnoDB. In the MySQL InnoDB engine, the primary key is in the form of a clustered index, that is, both index values ​​and data records are stored in the leaf nodes of the B-tree, that is, the data records and the primary key index exist together.

The leaf node of the ordinary index stores only the value of the primary key index. After a query finds the leaf node of the ordinary index, the leaf node of the clustered index must be found according to the primary key index in the leaf node and the specific data records in it are obtained. The process is also called "back to the table".

The scene of the failure is about the order system of our mall. There is a timed task that runs every hour and cancels all orders that were not paid for one hour ago. The customer service background can also cancel orders in batches.

The structure of the order table t_order is as large as the following:

id

Order id, primary key

status

Order Status

created_time

Order creation time

id is the primary key of the table, and the common index on the created_time field.

Clustered index (primary key id).

id (index)

status

created_time

1

UNPAID

2020-01-01 07:30:00

2

UNPAID

2020-01-01 08:33:00

3

UNPAID

2020-01-01 09:30:00

4

UNPAID

2020-01-01 09:39:00

5

UNPAID

2020-01-01 09:50:00

Ordinary index (created_time field).

created_time (索引)

id (primary key)

2020-01-01 09:50:00

5

2020-01-01 09:39:00

4

2020-01-01 09:30:00

3

2020-01-01 08:33:00

2

2020-01-01 07:30:00

1

The timed task runs every hour and cancels all unpaid orders within two hours before one hour each time. For example, unpaid orders from 8 to 10 will be cancelled at 11 am. The SQL statement is as follows:

update t_order set status = 'CANCELLED' where created_time > '2020-01-01 08:00:00' and created_time < '2020-01-01 10:00:00' and status = 'UNPAID' 

The customer service batch cancel order SQL is as follows:

update t_order set status = 'CANCELLED' where id in (2, 3, 5) and status = 'UNPAID' 

Deadlock may occur when the above two statements are executed simultaneously. Let's analyze the reasons.

The SQL of the first timed task will first find the created_time ordinary index and lock it, and then find the primary key index and lock it.

The first step is to lock the created_time ordinary index.

The second step is to lock the primary key index.

The second customer service batch cancel the order SQL, go directly to the primary key index, and directly lock the primary key index.

We can see that the sequence of locking the primary key by the timed task SQL is 5, 4, 3, 2. Customer service cancels orders in batches. SQL locks the primary key in 2, 3, and 5. When the first SQL locks 3 and is preparing to lock 2, it is found that 2 has been locked by the second SQL, so the first SQL has to wait for the lock of 2 to be released.

At this time, the second SQL is preparing to lock 3, but it is found that 3 has been locked by the first SQL, and it is necessary to wait for the lock of 3 to be released. Two SQLs wait for each other's locks, and a "deadlock" occurs.

The solution is to ensure the same lock order from the SQL statement. Or change the customer service batch cancel order SQL to only cancel one order per SQL operation, and then execute the SQL multiple times in the program. If the number of orders for batch operations is not large, this dumb method is also feasible.

Fault seven: domain name hijacking

Let’s take a look at the DNS resolution. When we visit www.baidu.com, we will first query the IP address corresponding to the Baidu server according to www.baidu.com to the DNS domain name resolution server, and then access the IP through the http protocol The website corresponding to the address.

DNS hijacking is a way of Internet attacks, which resolves the domain name of the target website to other IPs by attacking the domain name resolution server (DNS) or forging the domain name resolution server. As a result, the request cannot access the target website or jump to another website. As shown below:

The picture below is a case of DNS hijacking we have experienced:

uploading.4e448015.gifFailed to export, re-upload canceled

Looking at the red frame in the picture, the picture above should be a product picture, but it is displayed as an advertisement picture. Is the picture mismatched? No, the domain name (DNS) was hijacked.

It was supposed to display the product images stored on the CDN, but after being hijacked, it showed the advertising link images of other websites. Since the CDN picture link at that time used the insecure http protocol, it was easy to be hijacked. Later it was changed to https and the problem was solved.

Of course, there are many ways of domain name hijacking, and https cannot circumvent all problems. Therefore, in addition to some security protection measures, many companies have their own alternate domain names. Once domain name hijacking occurs, they can switch to alternate domain names at any time.

Fault eight: exhaustion of bandwidth resources

Although bandwidth resources are exhausted and the system cannot be accessed, although it is rare, it should also attract everyone's attention. Let’s take a look at an accident we encountered before.

The scene is like this. Each product image shared by social e-commerce has a unique QR code to distinguish the product from the sharer. Therefore, the QR code must be generated by program. Initially, we used Java to generate the QR code on the server.

Due to the low volume of system visits in the early stage, the system has been no problem. But one day, the operation suddenly launched an unprecedented promotion, and the instantaneous visit volume of the system increased dozens of times. The problem also ensues. The network bandwidth is directly used up. Due to the exhaustion of bandwidth resources, many page requests respond very slowly or even without any response.

The reason is that the number of QR codes generated instantly doubled dozens of times. Each QR code is a picture, which puts tremendous pressure on bandwidth.

How to solve it? If the server cannot handle it, consider the client. Put the generated QR code in the client APP for processing, and make full use of the user terminal mobile phone. Currently, Andriod, IOS or React have related SDKs for generating QR codes.

This not only solves the bandwidth problem, but also releases the CPU resources consumed by the server when generating the QR code (the process of generating the QR code requires a certain amount of calculation, and the CPU consumption is more obvious).

External network bandwidth is very expensive, we still have to save a little bit!

The cases shared in this article are all personal experience of the author, and I hope it will be helpful to readers.

Guess you like

Origin blog.csdn.net/m0_46163918/article/details/113123287