The downtime exceeded 12 hours, and the loss exceeded 100 million yuan. The person in charge of the basic platform of Vipshop was "sacrificed to heaven"

4714ab5775903b1986d05ffd445e9240.gif

Organize | Kexin Zhu   

Produced | CSDN Program Life (ID: coder_life)

For back-end programmers, "high concurrency" is not a new topic. Only after experiencing a server downtime can a career be "complete".

But if the accident exceeds 12 hours, it may directly cause a career "downtime"!

On March 29, the topic of #唯品会派了# was on the hot search. 

Yesterday, the follow-up to the incident came.

Vipshop issued an announcement on the troubleshooting of the 329 computer room downtime: The major failure of the Nansha computer room affected more than 8 million customers, and it was judged to be a P0-level failure, and the person in charge was dismissed.

d693bb43ff785a87f8588fe3c53df81b.png

21c8246d41887354a3ce0710a008cdf0.png

"Crash" hot search: the loss exceeds 100 million yuan, lasts for 12 hours

Speaking of which, everything goes back to the end of March.

On March 29, many netizens reported that Vipshop had "broken", and when logging in with a verification code, it showed a network error and could not log in.

Subsequently, the official Weibo of Vipshop stated: Due to a short-term system failure, the main site's "additional purchase" and other functions may appear abnormal.

c72c4a6162047b58cec476f63d333493.png

Source: Screenshot of Weibo

After more than two months, Vipshop officially responded to the failure.

It is reported that the main reason for the major failure of the Nansha computer room is that the failure of the Nansha IDC refrigeration system caused the temperature of the computer room equipment to rise rapidly and shut down, causing the online mall to stop serving.

The fault lasted for 12 hours, resulting in a loss of more than 100 million yuan in the company's performance and affecting more than 8 million customers. The company judged the fault as a P0 fault. (P0 belongs to the highest level of accidents, such as crashes, page inaccessibility, main process failure, main function not implemented, or great impact on the affected surface.)

At the same time, Vipshop believes that the accident revealed that the disaster recovery emergency plan and risk prevention measures are not in place, and decided to deal with the incident seriously. The direct managers of the corresponding departments are responsible for the accident, and the person in charge of the basic platform department will be dismissed to deal with it accordingly.

In fact, it is not the first time that an accident like the downtime of No. 329 computer room has occurred.

But there is no doubt that, as an e-commerce platform with a large number of users, the normal operation of servers and network equipment is very important. Any downtime event will cause the platform to be unable to provide normal services. Therefore, the reasons and effects behind each failure are worth thinking about and taking precautions.

15363b39e5e6b3ca295f79537c5e5f9b.png

Tencent's social software has been "implicated"

It is also worth noting that Tencent's social software such as WeChat and QQ were also affected by the accident in the computer room, including WeChat voice chat, circle of friends, WeChat payment, QQ file transfer, Qzone and QQ mailbox. Function not available.

9e78c568a50a3135567ab1da03420165.png

Source: Screenshot of Weibo

In this regard, the Tencent WeChat team released a message on the morning of March 29, saying: "Some users used WeChat and WeChat payment related functions in the early morning of this morning to experience abnormalities. After repairs by engineers, the system is gradually recovering. Sorry for the inconvenience."

At the same time, Tencent internally assessed the incident as a "first-level accident", and several executives were criticized, degraded, and dismissed to varying degrees.

a7b8ccbb1c2f09303712a6b2788f9c8d.png

Server downtime caused by high concurrency occurs frequently

With the development of live e-commerce platforms, the increase in the number of users leads to an increase in the probability of high concurrency.

In recent years, the servers of major platforms and popular apps seem to be unable to escape the fate of freezing, crashing and even downtime.

In the early morning of Double Eleven in 2017, when the enthusiasm of millions of consumers poured into Tmall, a large number of mobile Taobao and mobile Tmall users were unable to pay and modify addresses, orders, footprints, favorites, red envelopes, coupons, etc. Abnormal problem, Tmall's server did not return to normal until 12:30.

On the evening of October 20, 2021, at the beginning of the "Double Eleven" promotion of Taobao e-commerce, many users found that Taobao had many problems such as unable to send messages in the customer service chat window, and unable to click to confirm receipt. Afterwards, the topic #淘宝泪了# quickly entered the Weibo hot search list, and occupied the first place on the list.

Today, even though Internet-related technologies have been iterated for many rounds, there are still many large-scale and long-term downtime events.

In response to this kind of problem, why is it so difficult to have high concurrency in the "three downtimes a day" reported by CSDN before? As mentioned in the article, it can be analyzed from two aspects:

  • On the one hand, failure is inevitable, there are human failures (human error-prone - Human Error) and non-human failures (machine Failure). These are unplanned outages, but there are also planned outages such as releasing new systems, upgrading maintenance, updating hardware, etc. This is also the main reason why even some companies in the industry can only say how many 9s they can achieve, rather than 100%.

    At present, all we can do is to achieve as many 9s as possible, which requires strong technical support.

level

availability level

Colloquialism

Annual Downtime

supporting measures

basic usability

99%

2 of 9

3d-15h-39m-29s

Services have redundancy in a data center, simple and basic automated operation and maintenance

high availability

99.9%

3 9

8h-45m-56s 

A large number of automated fault tools, as well as infrastructure such as various control and scheduling systems must be done well

auto-recovery

99.99%

4 of 9

52m-35s

Local multi-machine room (like AWS, each location has three availability zones)

high availability

99.999%

5 of 9

5m-15s

Remote multi-machine room, multi-active in different places

  • On the other hand, from the perspective of distributed architecture design, all software in the world is faulty. When a fault occurs, everyone first hopes that the fault will not spread and can be controlled. Second, they also hope that the fault time is as short as possible. , not too long.

However, the architecture system also has a lot of dependencies, such as infrastructure DNS, CDN, operators, computer rooms, etc. To achieve stability, everyone needs to work together to achieve it.

ce28d4869e979048595a03ad7ea6fd0d.png

Netizen: Please give programmers a raise!

Indeed, once the server is down, consumers will not be able to visit the website, and customers will not be able to place an order, which will directly have a great impact on the company's economic profit, and may even affect the website's inclusion and ranking on search engines.

Therefore, when the promotions of various platforms kick off every year, developers and operation and maintenance personnel are faced with huge challenges.

Along with topics such as #唯品会洗了corresponding person in charge was dismissed# and other topics once again aroused everyone's attention, many people left messages saying:

  • "I hope that in the future, large companies will have a complete set of procedures for avoiding and handling downtime accidents";

  • "Downtime is a common phenomenon in every family, such a long processing time is really sloppy";

  • "It is still necessary to strengthen infrastructure construction and technical management";

  • "It must not collapse at critical times, the number of customers affected by this."

At the same time, many netizens expressed their concern for the career of programmers:

  • "It's normal for the server to be down, and the developers have worked hard to maintain it";

  • "I used to be a programmer. I know that maintenance is not easy. Please give programmers a raise."

So, have you experienced server downtime? You can leave a message and discuss in the comment area.

Reference link:

https://news.mydrivers.com/1/914/914671.htm

https://www.163.com/dy/article/I6HMABGN0553V12F.html

https://k.sina.com.cn/article_3172142827_bd130eeb0190120sh.html

Recommended reading:

Due to a code typo, 17 production-level databases were accidentally deleted and paralyzed for 10 hours!

▶Spoilers for the legal science and technology innovation competition | Click me to learn about the second "Mobile Cloud Cup" competition

 ▶The China Software Cup and Mobile Cloud Developer Community University Tour·Chengdu station event was successfully concluded

Guess you like

Origin blog.csdn.net/FL63Zv9Zou86950w/article/details/131098458