Business life or death alarm caused by a test route

About the Author:

image

Zhou Xiaojun,
Tencent senior operation and maintenance expert, has more than ten years of Internet IT operation and maintenance experience, and is good at Internet architecture, cloud computing platform, operation and maintenance automation and other fields. He has a deep interest in cross-industry DevOps, cloud computing, technical architecture and team management. Lecturer of Tencent Academy, gold medal lecturer of efficient operation and maintenance community. Responsible for the operation and maintenance management of social services in the Tencent SNG social network operation center, currently focusing on the practice of operation and maintenance AI, operation and maintenance big data, and mass operation and maintenance automation.

Warning from a surprise attack

A heavy rain from a sudden attack violently washed the wide glass windows on the side of DBA Xiaowei's body. Shennan Avenue under the original blue sky and white clouds outside the window fell into darkness in a blink of an eye.

image

Xiao Wei raised his hand and looked at his watch. It was almost 11:50. He got up and shouted to the DBA Xiao Wang, Xiao Wang, are you okay? Let's go and eat. The cafeteria on the 12th floor of Tencent after 12 o'clock is usually crowded with people, and the queues to eat are all the way to the elevator entrance. DBAs go to eat early to save time.

Xiao Wang stood up and said, just got a work order, let's go.

When the two of them were about to get out of the card slot, suddenly Xiaowei’s mobile phone rang the classic iPhone ringtone, and “Tencent automatic voice alert” flashed on the screen, tapped the thumb to answer, and the artificially synthesized blunt female voice popped out of the microphone: “You have a voice alert , The success rate of business ID 394521 mode adjustment dropped to 0%, please deal with it in time, please press 1 for confirmation, and press 2 for the person in charge of transfer and backup..."

Almost at the same time, Xiao Wu from the business operation and maintenance on the same floor yelled to Xiao Wang, Xiao Wang, there is a DLP alarm in the open interface business, and ROOT shows that there is a problem with the data layer. Take a look.

image

DLP is a business life or death alarm. Each business connects core indicators to DLP, such as image upload success rate, user login success rate, etc. When the core indicators drop to the threshold, the DLP alarm is triggered. DLP alarms are the most concerned alarms for operation and maintenance.

ROOT is root cause intelligent analysis. It is based on business architecture, combined with data flow relationships, and uses algorithms such as time correlation and area weight to filter and classify monitoring alarms, discover alarms with business value, and directly analyze the root causes of the alarms.

Xiaowei immediately rushed to the computer. The pianist’s fingers skillfully entered more than a dozen lock screen passwords. He turned on the monitoring view and saw that the modulated success rate of the open interface business ID dropped from 99.98% to 0%, which triggered the trigger. Mode adjustment failure rate alarm.

小卫滑动鼠标点开业务视图,输入告警的业务ID,跳转到业务ID所属的数据仓库视图。数据仓库监控一切正常,接入服务器、数据服务器等没故障告警,网络正常,唯一异常的是接入和数据流量都严重下跌。

小卫马上对小王说,你查下A仓库到业务逻辑的链路是否OK。稍候小王回,PING包正常,网管值班反馈交换机无异常。

与此同时,几个RTX群已开始在屏幕右下角闪烁个不停,不消说,产品、开发、QA和运维已经在拉群询问故障原因。

小卫脑海里闪出清晰的业务架构逻辑,仓库正常,业务逻辑模块正常,流量跌零,网络正常,大机率是路由出了问题,他切到路由系统,果然,仓库路由流量跌到零,路由里的接入服务器IP列表都不是正确的接入服务器IP。

小卫马上联系路由系统运维回滚路由版本,同时让小王排查是谁在这个时间点变更仓库路由。

10分钟后,路由版本回切到正确版本,数据仓库流量回升,模调成功率重回到三个九。

谁变更了路由?

业务恢复正常后,此时已经是中午12点30分,幸亏是长尾业务数据仓库,影响用户范围不大。

image

顾不上吃饭,小卫和小王继续排查问题根源。小王反馈,从系统上查不到故障时间点的路由变更记录。看来是非正常途径的路由变更操作,只能查看路由变更接口的记录。

路由系统运维在接口变更记录里看到在11点46分,有一个IP做变更,将该业务ID的所有接入服务器变更为一台测试接入服务器。在CMDB(配置中心)里查到此IP是开发测试机。

小卫和小王电话联系上开发测试机的负责人,运维开发小李。

原来小李负责开发跨仓库的数据搬迁工具,今天上午他刚完成某个版本的迭代。在测试环境开发完成后,经过简单的功能调试,小李在现网验证工具的功能。

数据搬迁工具有个步骤,将新仓库的路由批量变更接入机IP。小李在现网的测试仓库创建了测试业务ID 394521,但杯具的是,他在调试代码里把测试仓库ID误写成现网仓库ID,没经过认真核查,小李在上午11点多匆匆手工跑了下流程。看到流程运行正常,没有异常日志,恰好是饭点,小李于是便起身去了食堂。

After the existing network service routing of the normal warehouse was incorrectly covered, the modulus alarm, DLP alarm, etc. were triggered one after another, but at the moment Xiao Li was queuing in the cafeteria while swiping his mobile phone, and did not realize that his tool BUG caused a serious online failure. .

Optimization and improvement measures after failure

After the fault is resolved, QA Xiaohuang pulls up the scene to review the entire process of the fault. Alarm triggering, response speed, root cause tracing, fault recovery, etc. are all within expectations. But the fault reflects the irregularity of the operation and maintenance development tools: the operation and maintenance development does not understand the operation and maintenance environment, the tools are randomly tested in the production environment, the operation and maintenance personnel are unknown during the test, and the core code is not reviewed.

According to everyone's opinions, Xiao Huang formulated the development specifications for operation and maintenance tools. The specifications include the following core principles:

1. Strictly follow the team coding development specifications

2. Both technical architecture and core product planning need to be reviewed by the team

4. The modules are loosely coupled, and the API is used for interaction. The API must have standard protocols, authentication, logging and current limiting capabilities

5. The website has strict authority authentication

6. Two-person review of core logic codes, especially large-scale and clustered change tool codes, must be reviewed by T3 engineers. The core architecture code has undergone rigorous unit testing, and grayscale is gradually used

7. All operation records must be stored in the operation log

8. The large-scale change tool has the ability to record the change version difference and change the rollback

9. The test environment and the production environment are strictly separated, and the use of various test cases in the production environment is prohibited

10. The change tool must submit a change log to the central change interface

11、……

Operation and maintenance personnel hold the power of life and death in the production environment. Compared with product function bugs, operation and maintenance script bugs, or operation omissions, the damage caused is great, and even the entire network can fail.

Therefore, " Strict testing and gray-scale verification before tools go live " and not introducing BUG into the production environment is not only a DBA, but also a principle that must be grasped by all operations and maintenance.

DevOps Thirty-Six Strategies for Database Operation and Maintenance:

image

image

image


Guess you like

Origin blog.51cto.com/15127563/2665773