The reasons behind Alibaba Cloud's large-scale failure and hot comments from netizens

4ae179fb5b8739a22e4b7b0b40635e30.jpeg

Organizing: things about programmers (ID: iProgrammer), reference: Yuntoutiao, Zhihu

On November 12, 2023, Alibaba Cloud experienced an epic failure, which had a wide impact.

553d2e10550e1d44e06ecde91fb28784.png

Recently, a "fault analysis report" sent by Alibaba Cloud to customers was exposed online.

e97434b2fb7d6f6ae65e05e6197b7cfe.png

Sphere of influence

1. Some services of OSS, OTS, SLS, MNS and other products are affected, but the operation of most products such as ECS, RDS, network, etc. is not affected.

2. Cloud product console, management API and other functions are affected.

time

From 17:39 to 19.20 on November 12, 2023, the failure time was 1 hour and 41 minutes.

Problem overview

Starting from 17:39 on November 12, 2023, Alibaba Cloud product console access and management API calls were abnormal, and access to some cloud product services was abnormal. Engineers determined that the cause of the failure was related to the Access Key Service (AK) exception. After the engineers revised the whitelist version, they took measures to restart the AK service in batches. The restoration began at 18:35, and most of the Region product consoles and management APIs were restored at 19:20.

Processing

17:39: An exception occurred in the Alibaba Cloud product console access and management API call.

17:50: Engineers confirmed that the fault was caused by an abnormality in the AK service, which affected the abnormal operation of the cloud product console and management API calls, as well as the abnormal operation of cloud product services that rely on the AK service.

18:01: Engineers locate the root cause.

18:07: Start implementing recovery measures, including revising the whitelist version and restarting the AK service.

18:35: Hangzhou and other regions begin to return to normal.

19:20: Most Region cloud product consoles and management API calls have returned to normal.

reason

Access Key Service (AK) encountered a read exception when reading whitelist data. Due to a logical flaw in the code that handles the read exception, an incomplete whitelist was generated, causing valid requests that were not in this whitelist to fail. This affects the cloud product console and management API services. At the same time, some products that rely on AK services experience abnormal operation of some services due to incomplete whitelisting.

improvement measures

1. Add verification and alarm interception capabilities for AK service whitelist generation results.

2. Add grayscale verification logic for AK service whitelist update to detect abnormalities in advance.

3. Increase the quick recovery capability of AK service whitelist.

4. Strengthen the linkage recovery capabilities on the cloud product side.

Reviews

81ab05ee1b31152992ceed5fe9d4de0c.png
@XYC:

Bad news: cost reduction and efficiency improvement have reached the deep end.

Good news: Alibaba is sending real talents to society.

@伊西

One bad news and 3 good news.

The bad news is that a glitch of epic proportions occurred.

Good news 1: It has the ability to handle epic failures that other clouds do not have.

Good news 2: A failure has already occurred once with a probability of 3 in a million. The next failure will be a thousand years later. Now everyone can use it with confidence.

Good news 3: This time we really captured the pain points of users.

@王万德

The aftermath of layoffs.

When layoffs are made, front-line workers are always laid off first, leaving behind those who are good at writing ppts and dare to brag.

Among them, "those who dare to brag" are the most hidden, but the most harmful. They often dare to disguise themselves as experts, deceive laymen (in Internet companies, laymen are the executives), get promotions and salary increases, and obtain "immunity" from layoffs, and they can never be eliminated from the job.

@乐场box

I also thought about what the 360 ​​boss said. When a company gets bigger, there will be a Dead Sea effect. Those who do practical things will often be the first to leave and evaporate. In the end, all the old fritters left behind will be dawdling, including the management.

The original words of Alibaba Cloud's previous CEO were: I don't care about technology. Technology is not valuable. What I care about is cost.

This accident may have filled Ali's last moat.

Guess you like

Origin blog.csdn.net/Ed7zgeE9X/article/details/134472892