Didi 2023.11.27 P0 level failure technical review review (k8s’s fault?)

This article uses Didi’s official recovery and technical public accounts to review the accident from a technical perspective.

Table of contents

1. Background

2. Didi official news

3. Problem analysis and positioning

4. K8s and analysis of network transmission

5. Thoughts caused by k8s: Draw inferences from one example and how to avoid it happening again

6. Review of other platform crashes in recent times


1. Background

At about 10pm on November 27, Didi Taxi suffered a large-scale technical failure. Users encountered many problems when using Didi's apps and mini-programs, including slow response in the ride-hailing function, inability to use Qingju Bicycle's QR code scanning function, and failure of the function to receive ride-hailing coupons. By the next morning, Didi’s posts had returned to normal.

According to Weibo feedback, the following issues were discovered:

  • The network is loading abnormally and orders cannot be placed;

  • The data is chaotic, one order is dispatched to 4 driver orders;

  • Data display and data status are incorrect, and there are problems with order cancellation and order payment;

  • There was an error in the order scheduling logic, and the driver took the order 2,000 kilometers away;

  • There was an error in the order flow, and the charge for 8 kilometers was shown to be 1,540 yuan;

  • The overall problem involves Didi, Xiaoju Charging, Didi Refueling, and Qingju Bicycle;

  • Didi intranet problems, employees were unable to use intranet-related services normally;

At this point, Didi’s “Xiti” hot search on Weibo

2. Didi official news

3. Problem analysis and positioning

On November 29, Didi Chuxing apologized again for the system failure on the night of the 27th and proposed corresponding remedial measures and compensation plans. And announced the preliminary investigation results of this accident:The cause was a failure of the underlying system software, not an "attack" transmitted over the Internet.

In this incident, the platform functions were almost completely paralyzed. Only the ride-hailing service function took nearly 12 hours to recover. It can be guessed that it was not a bug in a certain software function, otherwise the impact would not be so wide and the recovery would not be so slow. Didi official also issued a document explaining that the underlying system software was faulty, so the server hardware problem was ruled out. Therefore, it can be guessed that it was a problem with the underlying software of the cloud server.

Didi has a huge business line, and its underlying system is composed of complex software and hardware, including servers, network equipment, databases and other important components. Failure in any link may cause the entire system to collapse and users cannot use it normally. Serve.

360 security experts believe that there may be six technical reasons behind Didi’s flash crash:

  1. System update and upgrade Programming errors, logic errors or unhandled exceptions occurred during the process: Under normal circumstances, Internet manufacturers release updates at night, causing problems with Didi Of course, the business upgrade and maintenance is a large-scale update, but now Didi's entire platform and all businesses are out of order, which means it must be a problem at his "home".
  2. Server Failure: For example, in Didi’s core computer room, there may be problems with the constant temperature and humidity environment, causing the server to overheat and the CPU to burn out, or a natural disaster may occur at the location of the core computer room, such as Earthquakes, floods, tsunamis, etc. In this case, the hardware needs to be replaced and the service software inside also needs to be reconfigured. The recovery period is relatively long, but this possibility is relatively small.
  3. Third-party service failure: Didi’s backend architecture may usethird-party services or components. If there is a problem with the third party, it may also affect the normal operation of Didi. However, due to security considerations, Didi may not host its core business to a third party, but this possibility is also small.
  4. Other network security issues (since Didi has officially stated that it is not under attack, the other three speculations can be ruled out)

Personal analysis:

  • Due to the official statementthe underlying system software failed,so the server failure has been ruled out. Then a large company like Didi uses the third Third-party components should also be provided by major suppliers. If there is a problem with a third-party component, other companies using the component will also have problems. It seems to be fine so far, so it can be ruled out. In the end, it should be < /span>There was a problem when upgrading the underlying system software

4. K8s and analysis of network transmission

It has been located through positioning aboveThere was a problem when upgrading the underlying system software,According to the following picture uploaded online, k8s conforms to the inference

Browse through itDidi Technology Official Account2023-10-17List an article " "Scheduling Practice of Didi Elastic Cloud Based on K8S", I found that Didi technical students were upgrading the k8s cluster:K8s version upgrade: introduced tofrom k8s 1.12 To the 1.20 cross-version upgrade plan, and the single cluster node has exceeded 5000 nodes, once it explodes, it explodes The radius is not small.

Two options were given:

1. Introduction to replacement and upgrade solutions:

Two k8s clusters, 1.12 and 1.20, directly build a new set of 1.20 master and peripheral components;

The same amount of business loads as those in 1.12 are created in the 1.20 cluster, that is, sts and pods;

Traffic distribution is determined through the upstream traffic control application. At the beginning, the traffic is in the pod of sts 1.12, and then gradually switched to the pod of sts 1.20;

When there is a problem, the traffic can be quickly switched back.

2. Introduction to the in-place upgrade solution:

There is only one k8s cluster, and the master and peripheral components are directly upgraded from 1.12 to 1.20;

Gradually upgrade the nodes in the cluster, that is, kubelet, from version 1.12 to 1.20;

Do not perform any business load operations, that is, sts and pods do not need to be rebuilt, and the actual traffic distribution does not need to be operated. With the node upgrade, the traffic will naturally gradually change from 1.12 Cut to 1.20;

When there is a problem, you need to partially roll back the node's kubelet. When there is a global risk, you need to fully roll back the master and peripheral components.

Didi, from the perspective of implementation and cost, finally chose in-place upgrade. What if it goes out of control? What if the rollback fails?

As for why the in-place upgrade solution is adopted, we probably still don’t know many details, but this method is indeed a bit radical and will be difficult to deal with once problems arise.

Thoughts caused by 5.k8s:Learn from one example and draw inferences about other cases, how to avoid it happening again

  1. There is no need to use the latest technology just to show off your skills. Although k8s container orchestration technology is very new and can solve many microservice deployment problems, today’s k8s is very heavy and requires high operation and maintenance technology. Once problems arise, Very difficult to solve. Appropriate technology selection is the best. For some scenarios where QPS is not very high and availability requirements are not very high, a high-performance It is enough to use Docker on the server. There is even no need to use Docker containers in some scenarios. Tomcat can solve the problem
  2. Don't put all your eggs in one basket. It is still necessary to deploy multiple clusters in the production environment. Even if it is an in-place upgrade, the gray scale can be based on the gray scale of the production cluster. If there is a problem with the gray scale cluster, another cluster can still take care of it.
  3. Fault drills are often partial and certain module shutdown drills. Cluster-level and computer room-level fault drills can be arranged. After all, it is a national-level application.

6. Review of other platform crashes in recent times

  1. On October 23, Yuque (an online document editing and collaboration tool) experienced a server failure, and both the online documents and the official website cannot be opened at the moment. At 15:00 on the same day, Yuque issued an official statement saying, "Currently due to a network failure, access is inaccessible. This failure will not affect the data stored by users in Yuque and will not cause data loss. We are in the process of emergency recovery. We apologize again. The loss caused to you.”
  2. At about 5 pm on November 12, Alibaba Cloud experienced an abnormality, and subsequently topics such as "Taobao collapsed again", "Xianyu collapsed", "Alibaba Cloud collapsed" and "DingTalk collapsed" successively appeared on Weibo hot searches . The reason is that starting from 17:44 on November 12, 2023, Alibaba Cloud product console access and API calls experienced abnormal usage, and Alibaba Cloud engineers are urgently involved in troubleshooting. It returned to normal around 7:20 that evening.
  3. Alibaba Cloud’s second incident occurred onNovember 27. Alibaba Cloud stated in a statement that starting from 09:16 on November 27, Alibaba Cloud monitoring discovered that the consoles and consoles of database products (RDS, PolarDB, Redis, etc.) in Beijing, Shanghai, Hangzhou, Shenzhen, Qingdao, Hong Kong, and the East and West United States regions. An exception occurs during OpenAPI access, but the instance operation is not affected. After emergency treatment by engineers, the abnormal access problem was restored at 10:58 that day.

The frequent downtime events are a reminder to everyone:Technical risk assurance and high-availability architecture design are very important,Ensure data backup, System fault tolerance, such as increasing the off-site disaster recovery of the storage system to achieve rapid recovery, and conducting regular disaster recovery emergency drills to reduce the grayscale scope of operation and maintenance actions. In the future, we will also strengthen the quality assurance and testing of operation and maintenance tools to prevent such operation and maintenance bugs from happening again.

おすすめ

転載: blog.csdn.net/caoyuan666/article/details/134721748