Taking stock of the top ten "strange scenes" of downtime accidents in 2023

Famous scene? Hell scene!

Come and watch the top ten "strange scenes" of downtime accidents in 2023——


Bilibili (Bilibili) crashed twice

At around 20:20 on the evening of March 5, 2023, many netizens said that when using Bilibili, neither the mobile phone nor the computer could access the video details page, and the mobile phone could not view favorites and history. Some netizens said that the home page can be loaded normally, but it is all in traditional Chinese characters.

On the evening of August 4 , five months after the last accident, many netizens reported that the pictures (video cover) of Station B could not be loaded, the video could not be opened, and the video kept buffering.

 

Tencent’s “3.29” first-level accident

In the early morning of March 29, 2023, Tencent's WeChat and QQ services crashed, and many functions including WeChat voice conversations, Moments, WeChat payment, QQ file transfer, QQ space and QQ mailbox were unavailable. .

It was not until the morning of the 29th that Tencent's WeChat team responded that the system was gradually recovering after engineers made emergency repairs.

This accident was caused by a failure of the cooling system in the computer room of Guangzhou Telecom. Tencent defined it as a company-level accident and punished a large number of relevant leaders.

On April 12, the Information and Communications Administration Bureau of the Ministry of Industry and Information Technology listened to Tencent’s report on the abnormal situation of WeChat business on “3.29” and required Tencent to further improve the safety production management system, implement network operation guarantee measures, and resolutely avoid major safety incidents. production accidents, and effectively improve the safe and stable operation of public services.

 

Vipshop 329 Incident Punishment Result: The person in charge of the Basic Platform Department was dismissed

On March 29 this year, "Vipshop collapsed" became a hot search topic. Because the collapse took too long, many consumers were unable to place orders normally. Vipshop’s official response stated that due to a short-term system failure, the main site’s “Additional Purchase” and other functions may experience abnormalities.

On June 5, Vipshop issued an "Announcement on Troubleshooting of 329 Computer Room Downtime". The announcement stated that on March 29 (00:14-12:01), the refrigeration system of Nansha IDC failed, causing the temperature of the equipment in the computer room to rise rapidly and cause downtime, causing the online mall to stop service. The accident lasted for 12 hours, causing Vipshop to lose more than 100 million yuan in performance and affecting 8 million customers. Vipshop determined the failure to be a P0 level failure. It is understood that P0 is the highest level accident, such as crash, page inaccessibility, main process failure, main function not implemented, or the impact is very large (even if the bug itself is not serious).

The announcement pointed out that Vipshop decided to deal with this incident seriously. The direct managers of the corresponding departments will bear the responsibility for the accident, and the person in charge of the basic platform department will be dismissed and handled accordingly.

 

Microsoft Azure outage, 17 production-level databases deleted

On May 24, Microsoft Azure DevOps experienced a failure in a scale-unit in southern Brazil, resulting in approximately 10.5 hours of downtime. Subsequently, Microsoft's chief software engineering manager Eric Mattingly apologized for the failure and revealed the cause of the outage: that is, a simple spelling error caused 17 production-level databases to be deleted.

up-d28b235003ee1390973397efd32e59d2ee1.png

 

China Telecom encounters massive service outage problem

On the afternoon of June 8, 2023, China Telecom's network and communication services experienced failures such as no signal. Most of the users who reported feedback were in the Guangdong area, and it was suspected that the fault was in Guangdong Province.

Afterwards, China Telecom’s customer service responded that the telecom base station in the entire province (Guangdong Telecom) was out of order and calls could not be made temporarily. Please wait patiently. It is now being processed urgently. We apologize for the inconvenience.

It took about 4 hours to fully restore the telecommunications network in Guangdong Province.

 

Yuque 10.23 major service failure, lasting 7 hours

On October 23, 2023, Yuque experienced a major service failure, which took more than 7 hours to fully recover. The Yuque team later announced the cause of the failure and its handling process:

On the afternoon of October 23, when Service Yuque's data storage operation and maintenance team was performing an upgrade operation, due to a bug in the new operation and maintenance upgrade tool, the production environment storage server in East China was accidentally offline. Affected by this, Yuque's data service suffered a serious failure, causing widespread service interruption.

 

Alibaba Cloud 11.12 has a major service failure, affecting all products

On the afternoon of November 12, 2023, Alibaba Cloud suffered a serious failure, affecting all products.

Later, officials confirmed that the cause of the failure was related to an underlying service component. After about 5 hours, Alibaba Cloud announced that all affected cloud products have been restored. Due to the failure, the data of some cloud products (such as monitoring, billing, etc.) may be delayed, but business operations will not be affected.

 

Didi 11.27 system service failure, technical team repaired it overnight

On the evening of November 27, 2023, Didi's App service was abnormal due to a system failure. The location was not displayed and taxis could not be hailed. On the evening of November 27, Didi Chuxing responded: We are very sorry. Due to a system failure, the Didi App service experienced an abnormality tonight. After emergency repairs by technical colleagues, it is currently being restored.

In the morning of November 28, 2023, Didi Chuxing reported that online ride-hailing and other services have been restored, and cycling and other services are being gradually repaired. On November 28, when Didi issued an announcement, reporters used Didi to call online ride-hailing services in Shanghai, Shenzhen and other places, and found that the online ride-hailing function had not been restored, the network was loading abnormally, and taxis were still unable to be booked. On November 28, Didi responded to reporters that the online ride-hailing service had resumed and the rights of drivers and passengers were gradually restored.

On November 29, Didi issued another apology, saying that it was initially determined that the cause of the accident was a malfunction of the underlying system software .

 

Twitter is seriously down, Musk is furious

In February 2023, Musk urgently summoned about 80 people late at night to solve algorithm problems because his tweets about the Super Bowl were not as popular as US President Biden.

In March, when an engineer modified a configuration that caused Twitter to experience a serious outage, Musk threatened to refactor the entire code.

In July, users reported that the platform had problems again, unable to publish new tweets, and receiving an "limit exceeded" error message. Musk responded that Twitter is working hard to deal with "extreme levels of data scraping" and "system manipulation," and these new restrictions are important measures to curb these pressing issues.

 

ChatGPT service was interrupted for nearly 2 hours, CEO Altman apologized: the traffic far exceeded expectations

At around 22:00 on the evening of November 8th, Beijing time, OpenAI's ChatGPT and related APIs experienced an outage, causing services for users and developers to be unavailable for nearly 2 hours.

Subsequently, OpenAl updated the incident report and stated that it had identified an issue that caused a high error rate in the API and ChatGPT and was working hard to fix it.

At the same time, OpenAI CEO Sam Altman publicly apologized , saying that the new features released this week encountered far more usage than expected. The company originally planned to enable GPTs service for all subscribers on Monday, but it is not yet possible. Due to load reasons, service may be unstable in the short term, and we apologize to users for this situation.

 

Further reading: The Cyberspace Administration of China released the "Measures for the Management of Cybersecurity Incident Reporting (Draft for Comments)"


For more review of the year's major events, check out the "2023 China Open Source Developer Report" .

Guess you like

Origin www.oschina.net/news/273501