Decades of downtime and failures of AWS, the world's largest public cloud brother

  

Any public cloud provider has encountered various downtimes and failures in the long history of development.

Or because of human factors, or because of too much lightning, or because of a power outage in the computer room, or because of the fiber optic cable being dug, or because of the wrong input of the code...

The emergence and resolution of these problems happens to be the process of continuous optimization and improvement of public cloud services.

However, as a big brother of the global public cloud, from the perspective of events such as downtimes that can be queried and recorded, they occur almost every year.

Here, Amin came to review, specially organized and edited:

this set

From 2010 to today in 2019,

A complete collection of downtime failures of the first brother of the global public cloud in ten years.

【year 2010】

AWS cloud service outage for unspecified duration due to UPS and human error.

On May 4, 2010, Amazon's cloud computing service experienced two failures. The reasons were respectively a UPS unit failure and human error.

AWS cloud services lasted up to 7 hours due to data center circuit outages.

On May 8, 2010, the Amazon cloud computing service failed. Due to the electrical grounding and short circuit of the data center power distribution panel, some users lost their services for up to 7 hours, and also caused a very small number of users to lose their data.

Amazon data center power outage AWS cloud service interruption, which lasted for 1 hour.

On May 11, 2010, Amazon's cloud computing service failed due to a power outage, causing a small number of users in the eastern United States to lose service for up to an hour.

The cause of the accident was that a car knocked down a high-voltage power pole near the Amazon data center, and the power distribution switch in the data center failed to switch from the utility grid to the internal backup generator normally (distribution automation system error The ground interprets the cause of the outage as an electrical ground).

It is worth noting that this is the fourth time that Amazon's cloud computing service has failed due to a power outage in a week.

Amazon's European website was down for more than an hour and a half.

On December 13, 2010, Amazon's British, French, German and Spanish websites were down for more than an hour and a half on Sunday night, but there is no indication that this is related to the cyber attack. Amazon was one of the first companies to cut business with WikiLeaks after it began releasing classified U.S. diplomatic cables. Subsequently, a group of network hackers who supported WikiLeaks launched a network attack on the Amazon website.

Amazon's UK, German, French and Spanish websites were down for more than 30 minutes until they gradually returned to normal at 21:45 GMT on Sunday. Amazon's US site was not affected this time.

【year 2011】

AWS cloud data center servers were down on a large scale and lasted for a long time.

On April 22, 2011, a large area of ​​Amazon AWS cloud data center servers went down, which is considered to be the most serious cloud computing security incident in Amazon's history.

Some sites, including answering service Quora, news service Reddit, Hootsuite and location-tracking service FourSquare, were affected by the outage at Amazon's cloud computing center in Northern Virginia.

Amazon's official report claimed that the incident was due to loopholes and design flaws in its EC2 system design, and that it is constantly fixing these known loopholes and flaws to improve the competitiveness of EC2, Amazon's ElasticComputeCloud service.

AWS was disconnected from the network by a lightning strike, and the failure lasted for about two days.

In August 2011, Amazon's EC2 (Elastic Computing Cloud) service in Northern Virginia suffered a network outage, which temporarily interrupted many websites and services using Amazon Web Services' cloud computing infrastructure. This data center is Amazon's only data storage location in Europe, which means that EC2 cloud computing platform customers have no other data centers for temporary use during the accident.

Amazon said that lightning in Dublin, Northern Ireland, caused a power outage at Amazon's data center there, causing its cloud services there to go offline. Amazon confirmed that there was a problem with the connection between US East 1 and the Internet, but the connection was fully restored soon. Also found another connectivity issue in Northern Virginia's relational database service. This glitch was fixed in 11 minutes. However, the outage caused a two-day long interruption to several websites using Amazon's EC2 cloud service platform.

【2012】

Amazon AWS's EC2 service failure lasted for more than 29 hours.

On June 14, 2012, Amazon's data center in the eastern United States failed, which affected multiple AWS cloud services and well-known websites such as Heroku and Quora based on it. On the 16th, Amaozn announced the accident analysis. The accident was caused by a failure of the public power grid and set off a chain of failures.

Thunderstorms knocked out power at Amazon's facilities in the area, shutting down generators, uninterruptible power supply (power supply) systems that consumed emergency power, and bringing down roughly a thousand MySQL databases running on Amazon RDS.

At the same time, the loss of EBS-related EC2 API is concentrated in 20:57-22:40. Specifically, during this period, variable system calls (such as create, delete) failed, which directly affected customers to release new EBS-backed EC2 instances. EC2 and EBS APIs are implemented in multiple replicated datastores available. EBS data storage is used to store volume snapshots of resources such as metadata.

Generally speaking, in order to protect data storage, the system will automatically switch to read-only mode until the power recovery can start the availability zone, and then restore to a consistent state as soon as possible, and return to the data storage read-write mode, so that the variable EBS call is enabled successfully. But in this case, this protection scheme did not work.

AWS network services were interrupted for an unknown duration.

On October 22, 2012, Amazon's network service AWS in Northern Virginia was interrupted.

The accident affected well-known large websites including Reddit and Pinterest. The outage affected the Elastic Beanstalk service, followed by the console of the Elastic Beanstalk service, Relational Database Service, Elastic Cache, Elastic Compute Cloud EC2, and Cloud Search. The accident led many to believe that it was time for Amazon to upgrade the infrastructure of its Northern Vinigia data center.

The Amazon AWS elastic load balancing service failed for an unknown duration.

On December 24, 2012, just past Christmas Eve, Amazon did not let their customers live too peacefully. Amazon AWS's data center in Region 1 in the eastern United States failed, and its Elastic Load Balancing Service (Elastic Load Balancing Service) was interrupted, causing websites such as Netflix and Heroku to be affected.

Among them, Heroku was also affected by the previous service failure in the AWS US East region.

However, in some coincidence, Netflix's competitor, Amazon's own business, Amazon Prime Instant Video, was not affected by this glitch.

【year 2013】

Amazon AWS load balancing failure lasted for 2 hours.

The failure that occurred on Black Friday on September 13, 2013 was caused by a load balancing problem, and customers in some regions were affected.

Amaozn solves the access problem of complex equalization and increases the configuration time to prevent such problems in the future.

While the outage lasted only about 2 hours and only affected one availability zone in Virginia, it was an important reminder for Amazon to have a backup plan.

【Year 2014】

The AWS CloudFront DNS service was interrupted for nearly 2 hours.

On November 26, 2014, Amazon Web Services' CloudFront DNS servers lasted nearly 2 hours from 7:15 PM ET. After 9pm the DNS server started to restore the backup.

Some websites and cloud services were down, during which the content delivery network was unable to fulfill DNS requests. Nothing major happened, but it deserves to be on this list because it involves the world's largest and longest-running cloud.

【2015】

AWS went down on a large scale, and the downtime lasted more than 40 seconds.

On July 1, 2015, Amazon Web Services (AWS) experienced a large-scale outage, which lasted for more than 40 seconds. Many apps such as Slack, Asana, Netflix, and Pinterest, as well as many websites using AWS services have become unresponsive. 

 

In this regard, many netizens laughed and said, "It's all the fault of the leap second!". Some netizens also suspected that it was caused by "Apple Music Service". In addition, some users wrote on the Hacker News website that it was caused by an EC2 server of Amazon. 

DynamoDB on the Amazon AWS platform timed out and caused a downtime, which lasted for 5 hours.

In September 2015, Amazon's automated infrastructure process was interrupted, causing the AWS platform to go down. Cascading from simple network outages to widespread service outages, Amazon has experienced the kind of outages that traditional in-house data centers would experience — despite having a very advanced and integrated cloud platform.

Amazon's network outage affected storage servers for some of its DynamoDB cloud databases. This happened while some storage servers were still requesting their membership data. As a result, disconnection caused retrieval and transmission timeouts, and these servers were unable to obtain their own membership data and automatically exited the service.

DynamoDB timeouts caused wider outages when servers that couldn't get requests started retrying requests. In this way, a vicious circle is created, and Amazon customers cannot use AWS for 5 hours.

The AWS service was interrupted due to a power outage in the data center, and the outage lasted more than 5 hours.

On September 20, 2015, a data center of Amazon AWS suffered a power outage, which affected the online services of Netflix, Tinder, Airbnb and other applications, as well as the interruption of Reddit and IMDB services.

The service interruption was attributed to a software problem in its us-east-1 data center in Northern Virginia, and most of its affected customers were local customers. Shortly after the 3 a.m. outage on the 20th, a total of 24 applications and services reported problems, and 10 of them were in full "service outage" mode.

【2016】

Amazon AWS was down for 20 minutes.

On March 11, 2016, at around 2:20 local time in the United States, the e-commerce giant Amazon’s official website was down for 20 minutes. This accident not only made Amazon’s e-commerce main website inaccessible, but also affected Amazon's other services include the world's strongest Amazon cloud computing service and some digital content services.

This is quite a huge accident for Amazon, and this accident will cause huge economic losses. For Amazon, which ranks first in strength and number of users in the world, a cloud service accident is not only a simple economic loss, but also brings hope to catch up.

Australia's AWS service was interrupted due to a power outage, which lasted for nearly 10 hours.

In June 2016, Sydney was hit by a storm, and AWS facilities in the region lost power, and many EC2 instances and EBS volumes hosting critical loads for some well-known companies failed one after another.

Websites and online services in Australia's AWS availability zone were down for nearly 10 hours that weekend, causing problems with everything from banking services to pizza deliveries.

【2017】

Amazon AWS S3 downtime event, downtime for 4 hours.

On February 28, 2017, S3, known as the most stable cloud storage service of Amazon AWS, experienced an "ultra-high error rate" downtime event.

In the end, AWS gave an exact explanation: When a programmer was debugging the system, he ran a script that was originally intended to delete a small number of servers, but entered a wrong letter, causing a large number of servers to be deleted. The wrongly removed service runs two sets of S3 subsystems, which causes S3 to fail to work normally and the S3 API to be unavailable.

Since S3 is responsible for storing files and is a core component of the AWS system, this has led to other AWS S3 consoles and Amazon Elastic Computing Cloud (EC2 for short) that rely on S3 storage services in the Northern Virginia Day (East 1) service area. New instance launches, Amazon Elastic Block Storage (EBS) volumes (limited to data that needs to read S3 snapshots), and AWS Lambda are all affected.

In order to fix this error, Amazon had to restart the entire system, which had not been restarted for several years before that, which eventually led to the 4-hour downtime of Amazon S3 that shocked the world.

【2018】

AWS network failure, the duration of the failure is unknown.

In March 2018, Amazon Alexa smart home had a regional failure. When users wake up Amazon Echo series products at home, Alexa will let users try again and report that the server cannot be found.

Alexa's failure stemmed from a problem with Amazon AWS's network service. Not only Alexa, but other applications that relied on AWS as the backbone network were also affected that day, including software development company Atlassian and cloud communication company Twilio.

An Amazon spokesperson said it may be related to a power outage at a redundant Internet connection point in AWS, Virginia. In the subsequent confirmation of the failure, AWS stated that it had caused multiple data center failures in the US East Region 1. Meanwhile, some AWS Direct Connect customers in the US-East region were affected by the loss of packets. Also affected Direct Connect connections from Equinix DC1 - DC6 & DC10 - DC12 in Ashburn, VA and CoreSite VA1 & VA2 from Reston, VA.

The hardware failure of the AWS data center caused cloud services to be affected, which lasted for 30 minutes.

On May 31, 2018, AWS experienced connectivity issues again due to a hardware failure in a data center in the Northern Virginia region. In this incident, AWS's core EC2 service, Workspaces virtual desktop service, and Redshift data warehouse service were all affected.

 

The AWS management console failed, and the failure lasted for nearly 6 hours.

On July 18, 2018, Prime Day, Amazon's grand shopping promotion, encountered the biggest embarrassment in history. Major technical failures occurred on Amazon's website and applications, threatening its 36-hour sales feast.

At the same time, Amazon's core product AWS cloud services also experienced disruptions. When customers log into the AWS Management Console, they will receive an error message with a picture of a dog, similar to what consumers see with a picture of a dog on Amazon.com on Prime Day.

The AWS glitch said in a statement: "Customers are experiencing intermittent errors logging in with their accounts and are unable to access the AWS Management Console." The provisioning of AWS resources will not be possible.

The outage lasted nearly six hours, and an AWS spokesperson said the intermittent AWS management console issues had not had any meaningful impact on Amazon's consumer business, and that AWS and Prime Day issues were not linked.

The AWS South Korea server was interrupted, and the downtime lasted for more than an hour.

On November 23, 2018, the core servers of Amazon Web Services (AWS) were disrupted across South Korea, causing two major cryptocurrency online trading platforms to cease operations. AWS, one of the most widely used cloud services in the world, was affected by an internal core server failure, which brought major digital asset trading platforms Upbit and Coinone to an abrupt halt. Several major e-commerce centers were also inaccessible for about an hour,media reported.

 

AWS said that "between 3:19pm and 4:43pm PST, the Asia-Pacific server error rate increased, but the issue has been resolved and the server is operating normally." Amazon's statement details also confirm that the Seoul network was the most affected by the outage . The Upbit platform issued several statements after the outage and apologized for not being able to inform users of the sudden outage in advance. The Coinone platform also announced that it has entered maintenance mode.

【2019】

In the AWS Beijing region, the launch of new instances was interrupted due to the cutting of the optical fiber, and the duration of the failure continued to be unknown.

On the evening of June 1, 2019, several optical cables were cut during the overnight road construction in the CN-NORTH-1 area of ​​the AWS Beijing region, which caused the availability zone to be unable to connect to the Internet, which in turn caused the failure of new instances in all availability zones to start. Including EC2 API enable failure. Thus EC2 API is not available in the whole CN-North-1 region. At present, the maintenance team has found the specific breakpoint and is trying its best to restore it.

Industry insiders pointed out that this is an optical fiber in an availability zone in the Beijing area was cut by municipal construction, and it was cut in more than one place. The EC2 API interface part is exactly in the availability zone that was cut off, so new instances cannot be started. Encountering such a thing also shows that the municipal construction team is always so caught off guard.

The above content information is compiled from: Sina, Sohu, Tencent News and other related websites, information platforms, and public news reports.

If there are any omissions in the editorial statistics, I hope that friends in the industry can make up for it.

However, be brave after knowing your shame, Brother Gongyouyun, we are optimistic about you!

As the saying goes:

Ten years of downtime.

Don't think about it. Unforgettable.

Global users have nowhere to say bleakness.

Even if you don't know each other when you meet, your face is full of shock and your heart is cold.

Who would dare to think about the failure at night.

Not good to drink. It's not delicious.

There is no word to talk to each other, but the farmer is busy.

It is expected that the cloud breaks every year, and the moonlit night sleeps in the computer room.

Source: A Ming's "Jiangchengzi Ten Years Downtime Busy"

Editor's comment: Aming

What do you think of it?

Welcome to leave a comment at the end of the article.

Source of this article: Amin Independent We Media, all rights reserved, infringement must be investigated, please authorize for reprinting

This article and the author's reply only represent the author's personal views and do not constitute any investment advice.

 

Guess you like

Origin blog.csdn.net/qq_41689867/article/details/90738851#comments_19490823