Say "NO" to misoperation, the daily operation and maintenance of DevOps 36 strategies

image

about the author:

image

Liang Ding'an, head of
Tencent Zhiyun product and director of operation technology. More than ten years of experience in the Internet operation and maintenance industry, a gold medal lecturer in the efficient operation and maintenance community, a guest lecturer at Fudan University, a Tencent cloud evangelist, and a DevOps expert. I have personally experienced the operation and maintenance work of the enterprise's server scale from dozens to tens of thousands, and has rich theoretical and practical experience in building an automated operation and maintenance system and monitoring quality system. Currently focusing on Tencent Zhiyun operation and maintenance and product output of DevOps solutions.

Summary
Although the challenges and difficulties faced by the operation and maintenance team of each enterprise are different, the daily work content of each operation and maintenance team is similar. We must be on standby 24 hours a day to ensure stable and reliable business quality; we must prepare resources and release changes for each business operation activity; we must rush to the front line of the business to extinguish the flames of failure.

The mission of the operation and maintenance team is to provide enterprises with quality, efficiency, cost, and safety technical support, and to escort the healthy development of the business.

In some news reports, I often see that other people’s operations and maintenance have stepped on their backs due to human error. For example, a code hosting website mistakenly deleted the database of the production environment due to the error of the operation and maintenance personnel; an error in the operation and maintenance release system of a travel website caused business suspension and loss of revenue.

These cases are all hurting the operation and maintenance industry and the professionalism of operation and maintenance. While we were lucky that the failure did not happen to ourselves, we should reflect more on why different operation and maintenance teams are still in the process of rapid development of operation and maintenance technology. Repeatedly committing the same mistakes, and these mistakes are still repeatedly torturing all kinds of enterprises.

The Thirty-Six Strategies for Daily Operation and Maintenance is the first strategy in the "Thirty-six Strategies for DevOps". The original intention of writing this strategy is very simple. It is to establish some correct and appropriate daily operation and maintenance rules for the operation and maintenance counterparts. The majority of operation and maintenance colleagues can create greater value in their work, instead of wasting time on constantly stepping and filling pits.

We divide the daily operation and maintenance work into planned tasks and unplanned tasks . This strategy is derived from my personal operation and maintenance work summary for more than ten years. The main purpose is to precipitate appropriate practical experience and work experience, guide each operation and maintenance team to build tools or design processes, and effectively identify and avoid risks.

I believe that there are familiar ideas in the strategy, but I implore everyone to read it patiently, because the repeated operation and maintenance accidents have been admonishing us, and the contempt and negligence of repetitive and simple operation and maintenance are the big taboos of daily operation and maintenance. .

Do one line, love one line, since we have chosen the operation and maintenance industry, we must use our ingenuity to optimize and improve it, familiarize ourselves with the thirty-six strategies of daily operation and maintenance, and say "NO" to misoperations, so that we can better cope with the plan There is more time and energy to deal with unplanned tasks for the internal operation and maintenance tasks, so that the operation and maintenance work enters a virtuous circle.

Daily operation and maintenance of DevOps 36 strategies:

image

image

image

image


Seventh plan: The emergency plan for disasters must have a drill mechanism to keep the troops for a thousand days and use them for a while

In 2009, Tencent Building entered the company, and the Goose Factory finally owned its own office building in Shenzhen. When everyone was celebrating the housewarming, the friends in charge of social networking business were anxious about a happy worry.

At that time, it was the tide of WEB2.0. Qzone farming and ranch business detonated the social activity of "stealing vegetables" for the whole people. The enthusiasm of netizens was fully revealed in the number of service requests. The entire social network BU was busy with this popular product. In full swing.

The product team is intensively launching new functions and features to attract more users, and the development team and operation and maintenance team are providing technical support for the business around the clock. However, when the construction speed of the computer room could not keep up with the speed of business development, the scale of Shenzhen IDC could not accommodate the rapid value-added social and gaming business at that time.

In this context, in order to meet the long-term business planning and development, Tencent’s social network business officially launched the project to go out of Shenzhen, and the platform-level business (QQ, Qzone) began to optimize and upgrade the structure, from the barbaric distribution without clear planning in the past. The growth model of the company has transformed into a striped, SET-based, multi-regional, multi-center, remote, and multi-active architecture model.

SETization is a very important concept put forward by the operation and maintenance team, and it represents a more abstract upper-level operation and maintenance management idea.

SET is a collection of business modules that provide specific functional scenarios (the business module is the management concept of CMDB, which refers to a cluster that provides a single function service, corresponding to a batch of IP.), such as Qzone's access to SET, we will connect users to the first of Qzone All the business modules involved in the screen, including the access layer, logic layer and storage layer, are classified into Qzone's access SET planning. In the same SET, the business modules need to be distributed and centralized, and the access between the modules does not generate the traversal traffic between IDC floors.

同类SET之间保持服务的独立性,业务架构采取一写多读的架构,以此保证当容灾容错调度时,SET能作为最小的调度单元供运维操作。

每个SET的规模一般在不超过50个业务模块,且规模被控制在500台设备以内。关于SET概念,在规划和容灾场景下,为运维团队减少了运维对象,提升了决策的速度与调度的质量,是腾讯SNG运维团队应对突发故障屡试不爽的管理利器。

但在刚尝试引入SET概念时,SET的管理模式和运维团队、运维系统的磨合却是花了很长一段时间去磨合,腾讯SNG的SET管理方案也经历了几次迭代,最终才能成为很好服务于业务的法宝。这一切是依靠我们践行的运维方法“灾难的紧急预案一定要有演练的机制,养兵千日用兵一时”。

运维提出SET的管理概念的目的很清晰,就是为了分布与容灾。结合织云运维平台的SET管理,实现了“一天部署一个SET“和“一键上云”的运维效率,斩获两次公司级卓越运营奖。但是,当时在基于SET的容灾调度管理上,我们却未能这么顺利,因为摆在运维面前有几个问题还有待解决:

  • 调不调?

决策调度的核心指标太多,以哪个指标为准,在紧急故障发生时,这是个决策难题。必须事先准备好,最好是有且只有唯一一个决策的指标。

  • 调哪个?

业务峰值遇到专线容量不足,调度解决时,调走哪个SET的流量,释放多少专线容量,在紧急故障发生时,这是个决策难题。

  • 怎么调?

SET之间有依赖关系,从源头SET调度走还是从中间的故障SET调度走,在紧急故障发生时,这又是一个决策难题。最好有统一的决策方式,无论故障形式,统一从源头SET调度走。

  • 调多少?

调度操作时,要切走一定百分比的用户量,还是整个SET的用户量全部切走,目标SET的冗余容量是否充足,在紧急故障发生时,这是个决策难题,而且一旦决策失误容量引起SET雪崩。

  • 谁来调?

这是个背黑锅的话题,调度的成功率由诸多因素与工具环环相扣组成,如有疏漏很容易造成执行调度时出现异常或故障,倘若没有健全的策略保障,调度操作员的压力将是巨大的,以至于影响调度的决策和执行速度。

再长远的规划总是有时效性和局限性的,腾讯社交业务的SET规划也一样,我们通过坚持在业务高峰期反复的调度演练,克服了SET容灾调度规划中的弱点。

每次在业务高峰期演练结束后,我们都有对规划、系统、人员等因素的优化点,持续改进与打磨后,我们得出了最符合业务对质量与效率要求的SET架构与调度策略。

事后总结,这一切都是源于这条策略:“灾难的紧急预案一定要有演练的机制,养兵千日用兵一时”。我诚恳的建议各位运维同仁,纸上得来终觉浅,绝知此事要躬行。任何方案都要辅以不断的模拟演练,才能在千钧一发的瞬间发挥最大的效能。

第二十三计:每个偶然的故障背后都深藏着必然的联系,找到问题根源并优化掉

运维的日常工作时而杂乱时而繁琐,但如果我们能抽丝剥茧的看清运维工作的本质,在看似杂乱无章的工作中,会有很多共同之处,加以规划和处理,必定能对运维工作效率和质量有着事半功倍的成效。

在DevOps持续交付八大原则中,有一条原则是“提前并频繁地做让你感到痛苦的事情”,我们可以把运维的工作分成计划内任务和计划外任务两大类。计划内任务是指可预见的能否被提前防范,或者能够辅以工作高效处理好的工作;而计划外的任务则是指无法预见到,每次出现则要求运维人员紧急响应的救火工作。

七年前,负责系统运维的我接到一个专项优化任务,任务的目标是降低每位值班运维人员的电话量。这个任务对于团队和个人都是非常有意义的,如果目标达成,则意味着我们的电话告警量减少,运维人员的幸福指数增加,为此我义不容辞的接下了该任务。

当时,腾讯SNG的值班运维主要负责响应处理基础告警,包括设备ping不可达、agent上报超时、进程/端口不存在、硬盘容量满、硬盘只读、大范围网络故障,这几类告警。

与业务逻辑异常无关的告警由值班人员统一负责处理,为的是收敛入口可以集中优化,减少基础问题对更多人的骚扰。

对值班告警的优化过程,主要经历了三个阶段,我给大家还原下全过程:

  • 第一阶段,配置标准化与自愈

对于腾讯几万台服务器的运营规模而言,宕机的情况经常发生,那值班告警中,很大的告警占比都是由此问题引起的。而应对这类问题的共性做法,便是安全的重启设备或替换新设备,只要能保证自动化的操作不影响业务,就能做到故障自愈。

我们理清问题的脉络后,便着手推进运维标准化、配置化的落实,在CMDB中存储关键的业务配置信息,如架构层、响应级别、数据是否有状态等信息。通过工具流程来保障配置信息随着模块和设备的状态流转的及时更新。

配置标准化带来的直接的效果是,ping不可达、agent上报超时、硬盘只读,这三类直接可以通过重启解决的基础告警可以通过自动化的工具流程实现自愈。

  • 第二阶段,共性规则提炼

对于进程/端口不存在和硬盘空间满的基础告警,我们从运营数据观察得出,这类问题基本上会随着集群多次发生。如某个模块对应的集群有100台设备,其中有一台设备发生日志写满硬盘,则其他99台设备有很大的可能性也会发生日志写满硬盘的故障。

应对此有共性的基础告警,我们采用了以模块为管理节点,将硬盘清理策略的规则应用于模块对应集群的所有设备上,以此保证只要模块下有任何一台设备发生过硬盘容量满的告警,运维人员将清理策略配置在该模块的硬盘清理工具中,则可以免除该模块下其他设备的硬盘容量满的告警发生。

同理,这种共性规则提炼的方法也适用于进程/端口不存在的基础告警。

  • 第三阶段,关联分析与溯源

大范围网络故障多指一些核心交换机故障,或者某些机房掉电的网络故障场景。在面对这类故障时,因为网络层面的故障影响的下联设备众多,告警的表象多是告警不断,运维人员可谓苦不堪言。

通过CMDB对设备关联的网络设备与上联交换机信息的记录,我们在设计告警架构时,专门增加二级告警收敛模块,其主要的逻辑是对告警内容进行机房与网络设备的聚类收敛,如果有共性则收敛升级告警。以此实现减少告警发送量和告警量收敛的目的。

值班电话告警优化项目已经结束了很多年,目前我们已经实现基础告警90%以上的自愈处理,回想起当时项目的优化思路与执行过程,“每个偶然的故障背后都深藏着必然的联系,找到问题根源并优化掉”这条计策对于项目的顺利进行给予了很好的指导。

在日常的运维工作中,建议大家切勿因事小而不为之,充分结合运维场景积累的知识和技术,利用脉络思考的方法,顺藤摸瓜,找到故障背后的共同点或相似点,再把解决问题的方法上升提炼到更高的维度,说不定能收获事半功倍的成效。

第十一计:对不可逆的删除或修改操作,尽量延迟或慢速执行

“运维三板斧:重启、重装、回滚”,这是运维圈内流传的一个运维自嘲的段子,尽管是个段子,但是却道出了不少运维行业为了快速修复问题所采取的紧急措施。其中,回滚操作往往是应对变更操作后出现异常的有效恢复手段。

国内国外的互联网公司都出现过因为相关技术人员操作不当而引发的运营事故,而这些事故无法被快速修复的原因是因为变更无法被回滚。这些惨痛的案例一直在重演,如2017年3月gitlab的运维人员在多终端操作切换时,误把生产环境当成测试环境,人为失误的把生产环境的数据库删除。

虽然gitlab在遭遇重大事故时的公关处理得很从容优雅,公开致歉并且在线直播数据修复的全过程,赢得了大部分人的原谅,但是事故最终还是造成了700多个gitlab用户的数据丢失。

在鹅厂工作的近十年运维工作经历中,类似的低级人为失误并不少见,我们反思着这些故障的根源是因为我们技术太差吗?显然不是的,对于运维而言,我们最优先的职责是要保障业务的质量,然后在谋求更高的效率、更低的成本等。

运维是技术团队,但是光靠技术是无法解决所有难题,我们还需要有严格的记录与规则,正如《日常运维三十六计》中的计策:“对不可逆的删除或修改操作,尽量延迟或慢速执行”,这正是一条强调记录的计策。

因为不光是gitlab,在腾讯海量的运维工作中,我们为此也是付出过很多沉痛的故障代价才“悟出”这个道理。

在应用程序的生命周期中,伴随着业务的上线与下线,设备(物理机与虚拟机)会根据它所处于不同的生命周期阶段,被安装或卸载相关的应用程序和数据,以此达到资源有效流转的目的。

在这个设备上线与下线的过程,埋藏着很多引发故障的“坑”,稍不留意就容易触雷,操作人员又要背上黑锅。以腾讯的经验为例,在操作设备隔离下线时,容易产生的人为失误操作有:

删错程序、删错数据。工具跑得太过自信,执行太过迅速,但是执行的对象因为一时失误搞错,酿成操作事故。

复制错IP,搞错下线设备。复制粘贴可以提高效率,但是每次复制粘贴的内容是否足够准确,粘贴板中的内容真的是要操作下线IP吗?这又是一个故障频发的源头。

流量未切干净就下线服务。心存侥幸的操作,切换流量没有观察足够长的时间,便草草的下线服务或删除数据,这绝对是人为过失。

为了彻底的降低和优化这些删除或下线等不可逆操作潜在造成的风险,我们在运维工具中将运维的纪律与规则贯彻落地。以服务下线的工具流程为例(如下图),运维标准化的工作执行流程分成七个步骤:

image

  • 核实模块与IP。杜绝复制粘贴的失误,每次执行操作时必须校对模块与IP的对应关系,对设备IP的下线操作遵循模块权限的管理,基本上屏蔽了误操作的风险。

  • 名字服务下线。腾讯内部服务大量接入名字服务来实现负载均衡与高可用,下线操作流程会通过统一的接口层,在删除服务或设备前,集中批量的在名字服务下线,将流量切走。

  • 服务停止。停止进程和端口,以此检验服务流量是否完全切干净,如果仍有访问请求,则业务监控告警可以将其发现,下线流程将被中止。

  • iptables隔离。基于iptables的防火墙策略,隔离非运维管理端口以外的全部访问请求,保障该设备不对外提供任何服务,如有异常,则业务监控告警可以将其发现,下线流程将被中止。

  • Automatically capture and observe. Start tcpdump to automatically capture packets, find out the access source other than the source IP of the operation and maintenance management, and analyze it. If it is a normal business access request, the offline process will be suspended or aborted.

  • Isolation period. Based on CMDB configuration management, for devices with different modules, different architecture layers, and different service levels, the isolation period is treated differently: fast offline isolation for 2 days, ordinary offline isolation for 7 days, database offline isolation for 1 month, etc.

  • Reload/destroy. The last step of offline is to automatically reinstall the operating system or virtualize and destroy by tools.

The above-mentioned Tencent operation and maintenance case hopes to convey a truth. There are many high-risk and irreversible operations in the operation and maintenance work. These operations require both technology and discipline to ensure that our work can enter a virtuous circle and become better and better.



Guess you like

Origin blog.51cto.com/15127563/2665771