Bank of China's 20-year operation and maintenance expert | Embark on the wave of the times and be the best practitioner of AIOps!

The changes of the times, the dreams of people, these are unstoppable.

 

In the rapidly changing IT industry, sparks of new technologies and new ideas are constantly bursting. For operation and maintenance, stability and innovation are eternal topics. Automated operation and maintenance, DevOps, and AIOps are the future development directions of operation and maintenance. So how to follow the trend to practice AIOps? How to maintain enthusiasm and vitality for technology?

 

Recently, the community interviewed Mr. Yuan Chunliang, a system analyst in the Maintenance Department of the Software Center of Bank of China, and asked an veteran expert with 20 years of experience. He listened to him and shared his experience from development to operation and maintenance, from automated operation and maintenance to AIOps practice.


image

Yuan Chunliang (left) is interviewed by Xiao Tianguo (right)


Entered Bank of China for 12 years

Development, operation and maintenance are closely connected


Yuan Chunliang just graduated from postgraduate in 1999, and has started his career in software development since then. In 2007, Mr. Yuan Chunliang entered the Software Center of Bank of China (hereinafter referred to as “Bank of China”) to engage in the development and maintenance of core banking systems. In more than 20 years, he has deeply experienced the continuous popularization and rapid development of computer technology. Especially in recent years, he has felt the unprecedented influence of scientific and technological power on the development of the financial industry.

 

At the beginning of 2014, Bank of China carried out institutional integration and divided the maintenance department originally in the data center into the software center, which promoted closer integration of operation and maintenance and development. At the same time, under the background of new technology waves such as artificial intelligence and big data, it further promoted The application practice of new technologies in the field of application maintenance.

 

At the beginning of 2016, Mr. Yuan moved from the development department to the maintenance field, which was a key point in his career. In the bank, development and operation and maintenance are different departments, but the two are closely related in terms of technology and workflow. In the maintenance department, Mr. Yuan can focus more directly on the first-line operation and maintenance of the production system, and can more deeply think about the relationship between the safety of the production system and the efficiency of operation and maintenance. Mr. Yuan is gratified that with the rise of Devops and AIOps, the Bank of China Software Center, under the coordination of the Quality Management Department , has made the maintenance and development closer. At the same time, the maintenance work has improved efficiency and reduced risks, and ushered in new A rare opportunity for technological transformation. Mr. Yuan's many years of development experience has more uses in the maintenance field, and he can continue to enrich his career in the trend of new technologies.


It's a matter of course

Bank is the best scenario for AIOps practice


The transformation of the core banking system architecture is the beginning of the practice of AIOps


In 2017, Bank of China started the X86 downshifting of the mainframe core banking system. The core system of the Bank of China adopts the coexistence of a centralized architecture and a distributed architecture. While greatly reducing the cost of system resources such as MIPS, the stability and security of the system have been significantly improved. For maintainers familiar with centralized architecture, the newly launched distributed core banking system is a brand new thing. Faced with hundreds of virtual machine nodes and a large number of open source software, batch production, daily changes, emergency handling and other tasks China needs efficient and safe operation and maintenance tools.


Teacher Yuan recalled, “We started to study the corresponding automation tools at the beginning of the distributed core system, such as DUBBO-based automatic transaction statistics and monitoring tools, QREP-based data simultaneous delay automatic sensing tools, etc., to solve the problem of production Problems in operation and maintenance."


The downward movement of the mainframe core banking system opened the prelude to the large-scale structural transformation of the Bank of China, and a wave of innovation in the operation and maintenance of distributed systems has also emerged. For example, the maintenance department independently developed many tools such as transaction statistics and monitoring adapters, automatic collection of node application information, etc., and jointly designed the automatic transaction switching mechanism and distributed disaster recovery system architecture with the development department. In these innovations, maintainers have used a variety of new technologies including machine learning, neural networks, mobile computing, and big data, and achieved good results. When talking about these tasks, Mr. Yuan felt deeply, “The transformation of the core banking system architecture is the real beginning of our practice of AIOps”.


image


Teacher Yuan introduced that the purpose of BOC's AIOps is to improve efficiency and reduce risks.

 

In the daily operation and maintenance of large-scale X86 distributed systems, maintenance personnel have proposed a new monitoring mechanism of thresholdless intelligent monitoring in order to better solve the problems in monitoring. After X86 is moved down, a system has grown from a few LPARs to hundreds or even thousands of virtual machine nodes; after dozens of systems, the surge in daily operation and maintenance workload can be imagined. Therefore, the maintenance department has made a lot of innovations in daily operation and maintenance, realized centralized and one-click operation, and greatly reduced the workload of operation and maintenance.

 

但是在监控领域,袁老师认为这是一个全新的挑战,不能基于传统的方式去改良。不仅仅是说分布式系统规模庞大,而且其中每个节点角色不同,监控需求千差万别。更重要的是,分布式系统是一个动态系统,节点版本迭代频繁,虚机资源也可能随着压力而调整。很明显,传统的固定阈值监控方法不能有效面对这个局面。在这种情况下,运维人员设计了一种全新的机制——无阈值智能监控。

 

无阈值智能监控实践


“银行是一个非常好的场景”,袁老师谈到无阈值智能监控方式时这样说道。银行业务具有非常明显的时序特征,包括周期性、趋势性和特殊时点,而且非常稳定,由此带动了后台系统的运行也具有明显的时序特征。


据袁老师介绍,无阈值智能监控系统能够根据系统运行历史数据来自动判断系统当前运行是否正常,并且能够计算每个监控项的风险概率值,也就是说,它是智能的。第二,它是自适应的,也就是说它能跟随系统动态变化而自动调整,当系统变化后,它能够很快的掌握新的运行特征,并在此基础上进行有效监控。第三,这种监控方式是“无创”的,对银行生产系统不会带来风险。

 

无阈值智能监控系统最核心的技术是时序预测模型、风险识别模型以及自适应告警模型三类模型,今年的 GOPS 大会上海站,袁老师分享了其中的建模细节,包括它们是如何协同工作的,以及在规模化应用中遇到的问题及解决方案。

 

中行无阈值监控系统覆盖了三类指标:

    1、系统级,像 CPU、内存、数据库链接数、MQ 深度、磁盘空间等等。 

    2、应用层面的,包括 TPS、交易响应时间、交易成功率等等。

    3、业务层面的,比如说客户数量增长、外币汇率波动等等。  

 

袁老师举例,比如说有一次无阈值监控发现磁盘空间消耗上涨异常并发出告警,维护人员及时检查发现是一个节点的开源软件 Zookeeper 自己不停地写日志,消耗大量文件系统。进一步分析,发现有一个Connection 断掉触发了Zookeeper 的一个 BUG。当时采取了应急措施重启有问题的节点,临时解决这个问题,成功消除了一次安全隐患。之后通过ZooKeeper版本升级,彻底解决了这个问题。


由于大部分开源软件来源于互联网公司,并非给银行量身定做的,并且缺少专业厂商支持,对银行系统来说是一个风险点,需要维护人员重点关注开源软件的运行情况。“无阈值智能监控很大程度上帮助我们这个忙,一但开源软件的运行有风吹草动我们能够第一时间感知。

 

中行软件中心应用无阈值智能监控已经一年多了,推广到8个重要生产系统,取得了良好效果,不但能够提前发现风险隐患,让生产事故消除在萌芽阶段,而且告警数量以及误报率大大降低,减轻了值班同事压力。

 

无阈值智能监控应用是软件中心 AIOps 创新落地实践中的一项重要内容,受到了总行和软件中心领导的关注,软件中心也把它纳入了工程活动管理领域重点专题,并专门立项进一步完善、推广。

 

联机系统多指标智能监控工具


同时,中行维护部门开展了基于多指标的智能监控基础研究,比如多维时序数据的聚合,分布式场景的异常特征分布等等。在这个基础上陆续开发了一些特定场景的智能监控工具,比如:联机系统多维度联合监控、利用前馈神经网络监控和预测复杂场景下 MQ 队列深度等,都已经在实际中应用。这些不同于一般的单指标监控技术,而是非常灵活的多指标监控技术,在某些复杂的特定场景下有很好的预测效果。袁老师认为这是维护部门在智能监控领域自主创新的另一个亮点。

 

全流程智能工单处理系统


近几年各类新技术的兴起激发了中行维护人员的技术热情,开展了新技术在维护领域的各类应用研究,并取得了一些成果。比如在工单处理领域,维护部针对工单数量不断增加和人员不足的情况,开发了“全流程工单处理智能辅助系统”。这个系统使用人工智能算法建立工单分类模型,帮助处理人员快速判断工单所属系统和模块,并定位问题类型、推荐处理方案。同时,在整个工单流转过程中,系统能够协助运维管理人员实现工单自动的分类、分派和提醒,大大提高了工单处理效率。袁老师全程参与了这项工作,成果已发表在《中国金融电脑》上。

 

另外,中行还开展了一些大数据类的算法研究,应用在银行客户风险等级预测、核心交易数据快速挖掘等相关领域。

 

云时代的前瞻思考,以不变应万变据袁老师介绍,后续中行还将进一步推动 AIOPS 的落地实施,比如建立统一的应用维护数据源、常见运维场景微服务平台等等,提升现有AIOPS的功能和效率。现在中行处于数字化战略转型时期,如何面向云场景开展维护工作也是下一步要积极考虑的问题。


无论是现有系统向云端迁移,还是新建的基于云的系统,未来产品运行形态都会发生很大变化,运维流程也会随之有新的调整。这方面中行维护部希望可以参考借鉴业界的SRE机制,建立“主动运维”思维,提升运维开发能力,做到以不变应万变。

 

日积月累不放松

体现真正的价值


从 1999 年开始,如今袁纯良已经从业20年之久。回首往事,他感受到对于银行,运维是极其重要的一项工作,维护人员肩负保证系统的安全性、稳定性的重要职责。运维工作对人的体力、精力是很大的考验,加班通宵都是常态。同时维护工作也非常锻炼人,年复一年的维护工作,在增长技术能力的同时,也培养我们敢于担当的责任感,临危不乱的心理素质,细致入微的观察能力。在解决问题的同时提升了自己,实现人生价值。

 

印象深刻的困难与挑战


没有工作能一帆风顺,袁老师在工作中也遇到了过很多困难和挑战,他都是积极寻找解决方法并顺利过关。他回忆起在做无阈值智能监控期间,最困难的是做特征工程阶段。那是2018 年中,袁老师为了确定分布式系统中虚拟机节点TPS、时序及CPU的关系,在没有任何资料可以借鉴的情况下,尝试了很多模型,包括线性模型、多项式模型和指数模型等。最终通过大量真实数据分析确定了不同角色的节点所适用的不同模型,以及模型自动选择的算法。之后做时序建模,也遇到了很大的困难。


因为现有的开源工具不能拿来直接使用,需要做很多优化,修改接口源码。在缺少文档的情况下完全靠自己摸索,连续加班加点,考验非常大。袁老师笑着对我们说,“我看到了它的价值,所以我能够下决心技术攻关,即便只有我一个人,哪怕我不再年轻。”


image


年轻运维的成长


袁纯良在谈起 AIOps 的各种实践工作时依旧是神采奕奕,一切犹如昨日,热情不减。对于运维的年轻人如何做好运维工作,袁老师分享了自己的宝贵经验。

 

要真正做好运维工作,肯定离不开过硬的技术能力和长期的经验积累。对于刚进入维护领域的年轻人来说,只要认真、踏实,能够很快掌握相关领域的基本运维技术,积累相关经验。但是,做好运维工作,最重要的是必须始终具备强烈的责任心。要始终保持安全、合规的意识,这是所有运维人员必须坚持的底线。


其次要注意在个人技术上的积累,不要把自己看作一个操作员,要努力理解系统内外架构,系统内部不是黑盒子,系统之间也不是孤立的,一定要找到它们之间的关联,明白系统内部结构和外部关系,一旦出现故障对于其他系统可能造成什么影响,建立自己的全局观、脑图。


在新技术时代,要注意哪些手工操作是可以自动化的,同时要关注自动化或者智能化可能带来的风险与应对策略。


运维人站在舞台中央


At the recent GOPS Global Operation and Maintenance Conference 2019 · Shanghai Station, Mr. Yuan gave a wonderful speech, sharing the practical experience of BOC AIOps with everyone, and felt that he could communicate and discuss with many experts and colleagues in the operation and maintenance industry. Very happy. Teacher Yuan said that GOPS is a good activity for operation and maintenance personnel to walk from behind the scenes to the front and stand in the center of the stage. It will help to give full play to personal value and ignite everyone's enthusiasm!


image

   

During the interview, we can deeply feel the enthusiasm and fun of Mr. Yuan for work. Whether it is for development or operation and maintenance, Mr. Yuan has always followed the trend of the times, worked hard to learn and practice, and do things that are valuable to the company. , And also achieved his own value!

 

I hope that Mr. Yuan Chunliang’s story can inspire everyone. We are in the best era. The vigorous development of the IT industry provides an infinite stage for each of us. We are also in the most challenging era. The way forward is not yet known, the capital winter has not yet faded, and you, are you ready?



Guess you like

Origin blog.51cto.com/15127563/2664959