Not to scapegoat Man, operation and maintenance work Essentials Guide

With the size of the data center construction continues to expand, new technologies iterative updates, network carries data center operations become very complicated. In order to adapt to the development of data center operations, network data centers are constantly updated with changes to the operation and maintenance work has brought great difficulty.

With the size of the data center construction continues to expand, new technologies iterative updates, network carries data center operations become very complicated. In order to adapt to the development of data center operations, network data centers are constantly updated with changes to the operation and maintenance work has brought great difficulty. Data center downtime accidents are bound to happen, which not only increases the workload of data center operation and maintenance personnel, more importantly, the data center has brought huge losses, even the world's leading Internet giant also often enjoy such a "treatment . "

Not to scapegoat Man, operation and maintenance work is not back Essentials Guide pan Man, operation and maintenance work Essentials Guide

Internet giant continues to downtime, operation and maintenance work into the problem

March 3 morning, Ali cloud downtime failure, resulting in the purchase of Ali cloud services business Web site or Internet company APP does not work properly. A large wave of programmers, operators, and operation and maintenance work have to get up from the bed. Ali goes down for this, 58 senior architect Shen Jian said the incident lasted about three hours, and afterwards observed two hours.

At 3:43 on May 3, Microsoft Azure a large area of ​​downtime on a global scale, the whole process lasted nearly two hours, until 5:30 was fully restored. By the impact of downtime Azure, Microsoft's main services include Microsoft 365, Dynamics and DevOps, including all if a problem occurs.

June 25 news, Amazon's official website confirmed the emergence of cloud computing service downtime, resulting in network connectivity of the network users and multiple AWS regions affected. The failed node in the AWS US-East District 1, a total of 33 services are affected, of which nine are in completely interrupted state.

Downtime accidents, operation and maintenance difficulty "higher level"

Again and again downtime incident demonstrated the importance of data center operation and maintenance work, but can not seem to avoid. Today, with the advent of the Internet era of technological advances in all things, data centers play an important role as a critical infrastructure, although the data center development in the country only ten years time, but had only from UPS, air conditioning and IT equipment common room era, to include into the Internet, big data, AI, cloud services, full-service, with tens of thousands at every turn cabinets, natural cooling, new technology wind wall, underwater data center, liquid-cooled servers continue to be the creation and application of new era . As a result, the operation and maintenance management faces greater challenges, operation and maintenance difficulty "higher level."

首先,超大规模的数据中心带来的人员、组织和效率的变化。以前万平米以内的数据中心,人工巡检一次2-4小时,现在数十万平米,需要更多的运维人员分布在不同的责任区,增加了管理的难度和成本;其次,电压等级提高,安全风险增加。以往运维人员接触的是低压,现在供电设备、发电机、冷机都是高压供电,维护安全要求提升;此外,规模集中,导致风险集中,事故影响更大。例如上文中谈到的数据中心宕机事故,导致全球大面积的服务和应用中断,损失惨重,因此运维管理的压力超前。

减少人为失误,提升运维管理的专业技能

据数据调查显示,数据中心的宕机事故70%是由人为失误造成的,因此在数据中心规模不断扩大的同时,运维人员要通过提升自身的技能和专业水平以应对数据中心意外事件的发生:

  • 建立一套完备的人员技能评价体系,从多方面考核运维人员技能能力,能够有效帮助运维人员提高运维技能,促进运维人员主动学习自动提升。

  • 运维经验在线学习,建立运维经验库,实现在线运维经验共享交流平台,提供运维知识在线实习和学习的渠道。

  • 实操环境在线模拟,提供运维模拟实践操作环境,有效隔离操作风险,帮助快速提高运维实际水平。

  • 理论技能在线评测,依托海量IT云平台组件题库,定期考核,随机出题,实现运维理论能力的在线实时自动测评。

  • 实操技能在线测评,构建轻量化在线运维操作、在线编程环境,实现运维操作技能与研发技能的在线实时自动评测。

  • 通过自动评测提升效率,实现运维理论技能与实操技能的在线科学自动评测,提高评测效率,确保能力客观公正的体现。

弥补人工运维不足,智能运维应运而生

Today, the digital age has arrived, the data center scale and capacity are doubled, followed by operation and maintenance management complexity and has become increasingly difficult, from a script operation and maintenance, operation and maintenance tools to the platform operation and maintenance has evolved human has been close to the limit, then came into being intelligent operation and maintenance. Today, more data center companies such as Tencent, Huawei, Jingdong began to increase research and development into the wave of intelligent operation and maintenance in the past, artificial intelligence combined with Yun-dimensional, based on the existing operation and maintenance data (logs, monitoring information, application ), through machine learning method information and the like to improve the efficiency of operation and maintenance, so as to gradually replace the manual operation and maintenance. I believe the future will be more and more intelligent data center.


Guess you like

Origin blog.51cto.com/14535459/2440052