Detailed explanation of the problems of first-tier manufacturers (2)

1.   What did you learn from the troubleshooting and recovery process of site B crash?

At 22:52 pm on July 13, 2021  , Station B crashed. The culprit of the whole incident turned out to be just a few  lines of code;

Back on the day when Station B  collapsed, in less than half an hour, the news hit the top trending topics on Weibo.

Users from  Station B even brought down many websites such as Station A, Douban, and Zhihu.

Some netizens joked that when the server at site B was down, not only the programmers at site B were working overtime nervously, but the programmers at site A’s Zhihu , Douban, and Weibo silently turned on their computers.

 What is the root cause behind the collapse of Station B ?

1. Understand the public network architecture of B stack

A year later, Station B announced the underlying cause of this problem. In fact, this problem is not complicated, but in order for everyone to fully understand this problem, let’s first understand  the public network architecture of Station B (as shown in the figure) )  .     

    1. CD N is the abbreviation of Content Distribution Network. It provides nearby access function in a region and accelerates users’ access to server information.

    2. LVS is a four-layer load balancer, which is based on IP+port load and provides a high-availability cluster for OpenR e sty .

    3. OpenResty is a high-performance Web platform based on Nginx+Lua. Simply put, we can directly use Lua scripts to extend the capabilities of Nginx to build dynamic Web applications. 

    4. Finally, a multi-computer room deployment is used to implement a multi-active remote location architecture to ensure high availability.

When a user initiates a request, it is distributed to the business host room through the CDN, then sent to  the penResty server through LVS layer four load routing, and finally forwarded to the corresponding application server instance to obtain relevant data and return it.

2. Analysis of fault resolution process

For such an architecture, (as shown in the picture) when station B  crashed, the operation and maintenance personnel took some conventional measures to  locate and solve the problem.

  •     1. Sort out the entire request link and check the abnormal server nodes. The SLB  operation and maintenance personnel found that the CPU of the seven-layer load server reached 100%, so they resorted to reloading and cold restarting the SLB ,  but they did not solve the problem.

    2. Then it was discovered that the SLB in the multi-active computer room had a large number of timeout requests when the CPU usage was normal, so the SLB in the multi-active computer room was restarted and the use of some services was restored.

Everyone found that no, when encountering a production accident, without knowing the specific problem, the most efficient method is to give priority to restarting the relevant service nodes to stop the failure. Therefore  , the operation and maintenance personnel of station B restarted the SLB and multi-active nodes. back.

This allows some services to be accessed normally after the multi-active computer room returns to normal.

However, the business computer room has not been restored yet. At this time, it is necessary to investigate slowly.

So the operation and maintenance personnel used  the Perf system analysis tool provided by Linux to locate the CPU hotspot of the SL B  server in the computer room  concentrated on a Lua  function.

Therefore  , a conventional solution was adopted - version rollback.

Generally, if a problem occurs and there are not many changes at the macro level, it is most likely caused by the recent release of a new version.

Judging from the review of the problem at site B, the rollback did not solve the problem. It was finally solved by rebuilding a new SL B cluster   .

The entire resolution process took 3 hours. For Internet companies, this was a very big failure.

Although the problem has been solved, the root cause of the problem and who will take the blame have not yet been clarified, so we need to continue analyze the reasons.

After a period of investigation, it was found that the reason for the CPU being full was a lua function in OpenResty (as shown in the figure)   .

The function of this function is to synchronize the service registration address and the access weight of the service node from the registration center to the shared memory  of Nginx .

Then use the lua-resty-balancer module to dynamically route the service address.

Among them, when dynamically selecting the target server, a weighted polling algorithm is used, and the following method (as shown in the figure) is used to calculate the greatest common divisor .

There is no problem with this method itself, but when the weight of the node is weight="0", the input parameter b  received by the _gc d function  may be the string "0". 

Lua  is a weakly typed language and allows the string "0" to be passed in. At this time , if  the condition string 0 is not equal to the number 0, a _gcd recursive call will be executed. When a%b is executed , the string and Taking modulo a number gives a  result of Na N.

So when it is executed again , it becomes a recursive call of _gcd(NaN, NaN) causing an infinite loop problem.

You may be wondering why rolling back the code did not solve this problem.

The official statement said that the weight function was  launched 2 months ago, which means that the potential risk of this problem has existed for  2  months.

In a certain release mode, the weight of the application instance will be temporarily adjusted to 0, causing the weight returned by the registration center to SLB  to be a string of "0".

The reason why this

A production accident that affected the production environment for 3 hours and caused huge impact and losses was actually  caused by When you get this answer, you may find it difficult to accept. This is the true reason why a thousand-mile embankment was destroyed by an ant nest. A true portrayal .

After this accident, there are several things that

    1. Issue a detailed accident report, clarifying the person responsible the level of the accident.

    2. Review the accident

    3. Propose optimization and improvement measures from the technical and management levels to avoid similar problems in the future.  In general, the more you focus on low-level development, the higher the requirements for technical capabilities and the rigor of the work.

Through this case at Station B  , you in front of the screen can also learn some experience and calmly cope with the challenging work .

2. Limit  1000000,10 loads very slowly. How to optimize it?

 There are many solutions to this problem . You can consider it as comprehensively as possible when answering.

1.    If the ids are consecutive, you can use this method directly.

select * from order where id > 1000000 limit 10

This method actually filters the data first and then limits it, which can effectively improve query efficiency.

2.    Solve it by order  by + index

select * from order order by id limit 1000000,10

It should be noted that id is an index column. Sort by index and then limit, which also reduces the number of calculations.

3.    From a business perspective, limit the number of pages. Generally, users flip through 1,000,000 pages to find data. If  your boss is asked to flip through 1,000,000  pages, he will probably fire you the next day. We often use searches to optimize the search process .

The above is the way to answer this question. During the interview, you don’t have to completely fall into the interviewer’s logic. You can think outside the box.

3. How to implement the plan of batch membership expiration ?

"There is a membership table with a data volume of 200W. Each member has an expiration time of different lengths. Now I want  to send an email notification to remind me to renew before expiration." How to implement this?

Problem analysis

For this type of scenario question, I suggest that you don’t rush to answer it, but calm down first and think clearly about  what abilities the interviewer hopes to understand through this question, and what pitfalls there are in this question.

Pay special attention to one point. You must understand the interviewer's question clearly. If you don't understand, you can confirm it again.

Obviously , there are several keywords in this question:

    1. 200W data means a relatively large amount of data

    2. Each member has an expiration time, and it is necessary to be able to filter out members who are about to expire.

Obviously, if you directly filter through the sele c t  statement, you will fall into a pit, because there will be performance  problems  . Let's take a look at some relatively reasonable answers.

Questions and answers

The first

The system does not actively poll, but waits for the user to log in to the system before triggering a check. If it is found that the expiration time of the member is less than the set threshold, a pop-up window and email reminder will be triggered. This approach circumvents polling issues and does not put any pressure on the database and back-end applications. The disadvantage is that if the user never logs in, the membership expiration will never be realized, and the renewal reminder message cannot be sent in advance according to the operation strategy

The second kind

We can use search engines such as Solr or Elasticsearch. Store a copy of the member ID and membership expiration time in the membership table  to the search engine. The advantage of search engines lies in the rapid retrieval of large amounts of data, and their high scalability and reliability, making them very suitable for processing large-scale

The third kind

This can be achieved using Redis. After the user registers as a member  , the member ID is stored in Redis and the expiration time of the ID is set.  Then you can use the expiration reminder function of redis and change the configuration item notify - keyspace - events  to notify - keyspace - events  "Ex". When the key in Redis expires, a key  expiration event will be triggered. We can Listen to this event to handle it. 

The fourth kind

You can directly use the delay queue provided by MQ. When the user activates a membership,  the expiration time of the member is directly calculated , and then a delayed message is sent to MQ. Once the message reaches the expiration time, the consumer can  consume this message to trigger the membership. Expired reminders.

4. What is idempotence? How to solve the idempotence problem?

Idempotence, in fact, is a mathematical concept. In the field of computer programming , idempotence means that when a method is executed  repeatedly , the impact it has is the same as the first execution.

The reason why we need to consider the issue of idempotence is that in network communication, there are two behaviors that may cause the interface to be executed  repeatedly .

1. The user 's repeated submission or the user's malicious attack causes this request to be executed multiple times.

2. In a distributed architecture, in order to avoid data loss caused by network communication, when communicating between services

A timeout retry mechanism will be designed, and this mechanism may cause the server interface to be called repeatedly .

Therefore, in programming  , for the interface of data change operations, it is necessary to ensure the idempotence of the interface.

The core idea of ​​idempotence is actually to ensure that the execution result of this interface is only affected once, and even if  it is called again later , it will not have an impact on the data. Therefore, based on such a demand, there are many common solutions.

3.    Use the unique constraint of the database to achieve idempotence. For example, for data insertion scenarios, such as creating an order,  because order number must be unique, multiple calls will trigger the unique constraint exception of the database,  thus avoiding A question requesting the creation of multiple orders.

4.   Use the setNX command provided in redis. For example, in the MQ consumption scenario, in order to avoid the problem of  repeated consumption causing data to be modified multiple times, you can write the message to MQ through setNx when receiving the MQ message.  In redis, once the message has been consumed, it will not be consumed again. 

5. Use state machines to achieve idempotence. The so-called state machine refers to the conversion process , such as order status. Because its status will only change forward, the same piece of data must be modified multiple times. At this time,  once the status changes  , the impact on this data modification will only occur once.

Of course  , in addition to these methods, it can also be implemented based on token mechanism, deduplication table and other methods, but no matter what method it is , there are only two methods:

     Or the interface only allows one call, such as unique constraints and redis-based locking mechanisms  .

    Or the impact on data will only be triggered once  , such as idempotence and optimistic locking.

5.  What are the common current limiting algorithms? 

The current limiting algorithm is a system protection strategy, mainly to avoid the system being overwhelmed during traffic peaks and causing system unavailability .

1.      (As shown in the figure) Counter current limiting is generally used to limit access frequency in a single dimension. For example, the SMS verification code can only be sent once every  60s , or the number of interface calls. Its implementation method is very simple. Every time it is called, Add 1 and subtract one after processing is completed .

2. (As shown in the figure) Sliding window current limiting is a counter, but it reduces the problem of concurrency exceeding the threshold caused by critical values ​​through the sliding window design with time as the dimension.

Every time you perform data statistics, you only need to count the number of visits at each time scale within this window .    The fuse framework Hystrix in Spring Cloud and Sentinel in Spring Cloud Alibaba both use sliding windows for data statistics .

3.   (As shown in the figure) Leaky bucket algorithm, which is a constant rate current limiting algorithm. No matter how much the request is, the processing efficiency    of the server is constant. The producer-consumer model implemented based on MQ is actually a leaky bucket current limiting algorithm.

4. (As shown in the figure)   Token bucket algorithm, compared with leaky bucket algorithm, it can handle the problem of sudden traffic.

Its core idea is that the token bucket generates tokens at a constant rate and saves them in the token bucket. The size of the bucket is fixed.  When the token

When each client request comes in, it must obtain a token from the token bucket before accessing it, otherwise it will wait in line.  During low traffic peaks, the token bucket will accumulate. Therefore, when an instantaneous peak occurs, enough tokens  can be obtained, so the token bucket can allow the processing of instantaneous traffic .

The token bucket algorithm can be used for current limiting at the gateway level or for interface calls. Google's Guava  and Redisson  's current limiting all use the token bucket algorithm.

The essence of current limiting is to achieve system protection. The final depends on  the accuracy of statistics on the one hand, and on the other hand, the current limiting dimension and the needs of the scenario are considered.

   

Guess you like

Origin blog.csdn.net/gnwu1111/article/details/132690986