2019 IT accident inventory [IT] reading

Yun notes veteran brother @

2020 Lunar New Year is not easy to start, the new crown pneumonia is still difficult to attack tough stage. Looking back over the past 9102 years, there are always some things are accidents worth recording. Let's take stock of what 9102's "external accident."

 

First, we encountered the accident IT infrastructure services

2019 is the IT infrastructure services is relatively dark year. High incidence of various catastrophic events, we rely on a number of the company's key services unavailable time break SLA four nines (ie 52 minutes a year). Let's recap:

  1. Ali cloud : March 3, 2019, beginning 0:00, Ali cloud instance two rooms available in North District, part C ECS servers appear IO HANG, causing all server resource utilization Hosting Business 100% and can not sign, business interruption up to 3 hours.

  2. China Mobile : March 15, 2019, as China Mobile's centralized system of Things (CMIOT) before dawn deployment architecture optimization (second batch) cutover on the line, from 7:11 beginning to affect large areas of the country things cards, at 7:30 we began to gradually restore services, but due to the number of ten million things line up card access, until 11 o'clock in the morning did not relax.

  3. CUP, net-linking and micro-channel payment : March 23, 2019, around 14:42, CUP and Network Alliance (two clearing organizations) and micro-channel pay Physics, Shanghai room is connected Waduan, although they quickly switched backup line, but the downstream business were affected, all transactions (Alipay, micro-channel payment, cloud lightning pay) are shaking.

  4. 114DNS : 2019 Nian 4 Yue 4 Ri, 10: 30 ~ 12: 30,114 DNS and Google DNS 8.8.8.8 have been hung up.

  5. Shanghai Mobile : May 29, 2019, 11:10 - 11:20, Shanghai mobile network abnormal, network and calls are affected, not at once receive a single line in Shanghai.

  6. Alipay : December 5, 2019, 16:23 - 16:45, Alipay whole network failure occurs, the system error "010002: System exception" increased sharply, which lasted 22 minutes. The whole network are aware, the news "Alipay collapse."

  7. Ali cloud : December 6, 2019, some users feedback Ali cloud north, east and geographical part of the network is abnormal, affect some access to cloud resources.

 

Two, IT giant accident

Annual IT giant will fly a variety of demon moths, let's take stock of what IT major accident before the 2019 Chinese New Year 2020:

  1. MongoDB : 2019 January network security personnel Bob Diachenko broke the news on Twitter, Deep Web horizon MongoDB database on the public network "streaking" to allow full access to the global press and then follow up aggressively this from the large-scale data breaches the database contains personal information recorded more than 2.56 million, involving identity card number, issuance and expiration of the time, sex, country, address, date of birth, passport photo, employer and location through the last 24 hours recorded on camera information, about 6.68 million records.

  2. Fight a lot : January 20, 2019, to fight a lot of a 100 yuan no threshold coupons (corresponding to one expired operational activities) due to operational errors, resulting in the early morning again on the line, from 1:00 to 10:00, a full 9 hours, wool partisans who carnival, is believed to fight a lot of losses of up to tens of millions.

  3. Facebook : 2019 years 3 22, according to an anonymous internal staff revealed that from 2012 to date, there are nearly 2 to 600 million Facebook users account password may be based on (ie, stored in plain text) stored as plain text, and can be more than 20,000 Facebook employees name search.

  4. AWS : 2019 Nian 3 22, before the use of AWS AWS network engineer Thompson "Configuring loopholes" or "firewall configuration errors," the invasion of the AWS customers in the US First Capital Bank (Capital One Financial) of S3 Bucket (bucket) and Download the contents, she also tried to use IPredator of VPN and Tor to hide the traces of the invasion. She publishes data within their Github account and boasted she holds these customer privacy data on Twitter.

  5. AWS : 2019 Nian 10 22, Amazon's AWS DNS server suffered severe and lasting DDoS attack, attack lasted 15 hours!

  6. ES : 2019 In December 2009, or Bob Diachenko announced the discovery of a huge, publicly accessible without password ElasticSearch database containing more than 2.7 billion e-mail addresses, of which 1 billion are simple passwords stored in clear text, this the situation is much like the database originally bought this someone tries to start its search function, but was misconfigured became publicly available.

  7. Jingdong : 8 January 2020, it is estimated that the misuse Jingdong self-employed small appliance category to the applicable area of 200 yuan no threshold coupons inside for up to fifty minutes, believed to be involved 240,000 low-cost pen orders, pre estimated 70 million the amount of goods.

 

Third, we encounter a variety of external software defect accident

Often in the river walk, there will inevitably be wet shoes, using third-party software used for a long time, it will be more or less met their fatal flaw:

  1. Docker resources perception problem : in January 2019, we found that the container cluster built on the cloud Ali, Docker containers of various instances of Java applications intermittently be SIGKILL (signal = 9). The reason is CGroup Docker container resources used to process limit, while still in the container JVM will use the memory size and the number of CPU cores host environment are the default settings, which led to the JVM Heap of miscalculation. We made two steps to avoid Docker OOM Killer: 1) open CGroup resources perception, the purpose is to set to the same size container heap heap memory and memory restrictions, 2) the vessel memory limit is adjusted by 2.5G 3G (note: 2G (heap space) + 1G (additional) = 3G), also recommended the use of SpringBoot, it is relatively lightweight footprint is relatively small, the current Division I part of the project using SpringCloud set to 1G (heap space) + 1G (additional) = 2G .

  2. 阿里云镜像问题:2019年4月15日以及25日,我司在阿里云华北和华东机房的几台宿主机都曾突然出现IP地址缺失,导致上面跑的服务全部失联。此乃阿里云镜像的bug,对应于它的《KB:94181:检查与修复CentOS 7实例和Windows实例IP地址缺失问题》。我司随后检查所有机房并都通过脚本修复了。

  3. Consul缺陷:2019年6月18日,有关键业务突然告警说它在nginx的注册地址1.1.1.1注册失败。原因是在内部容器注册的流程中,consul-template程序将consul中的数据读取出来,再写入nginx的upstream模块的配置中,但如果consul-template读取不到数据,则它会将默认地址1.1.1.1写入到upstream模块的配置中,从而为事故埋下了隐患。Consul官方已经在0.9.0以上新版本中修复此问题,我司随后逐一升级了所有机房的consul服务。

  4. OKHTTP缺陷:2019年7月,据现场端反馈,即使在网络正常的情况下,也会有个别设备会在某个时段内出现支付缓慢,多笔交易连续失败的情况。原因是如果OKHTTP第一次出现SocketTimeoutException,后续即使网络已经恢复正常,请求也始终返回SocketTimeoutException,必须等到多活域名切换、重新连接WiFi,或重新启动应用程序才能恢复正常。此问题尤其是在4G网络下比较常见,官方未解决。我司只能在全局ResponseError监听器里,如果发现出现SocketTimeOut就清空连接池,并持续关注此issues修复状态,及时更新。

  5. MySQL缺陷:2019年8月12日,数据中心的主从数据库宕机。原因是innodb做table truncate时候,要把属于这个table的表空间文件的所有的页刷盘并从buffer pool中去掉。代码中的判断应该存在问题,触发了实例crash。后续已将MySQL版本升级至较新版本5.7.27,同时将数据中心的数据库报表库和其他关键业务库拆分到不同实例,减小影响面。

 

-END-

感谢阅读老兵笔记,祝百病不侵鼠你健康!

 

Guess you like

Origin www.cnblogs.com/zhengyun_ustc/p/12296241.html