How to carry out operation and maintenance management in the way of project operation

about the author:

image

Chen Yitai, Senior Manager, Operation and Maintenance Department of Vancl Eslite System

Currently doing operation and maintenance work in Vanke, responsible for the technical operation of IDC computer room and website business, as well as the internal IT system and network operation and maintenance of the enterprise.
Engaged in IT infrastructure for more than ten years. He used to do IT planning and construction for various industries, small and medium-sized enterprises in Wuhan Microsoft Technology Center. Later, he joined Vancl Eslite and participated deeply in the construction of the company's system and network infrastructure.

introduction

As the main technical person in charge of enterprise IT, in the process of gradually establishing an enterprise IT system that supports tens of thousands of employees across the country, he is responsible for the Internet operation and maintenance, the external network and the internal network, and the relationship between Party A and Party B. Deep insights in technology application and management practice.

Let me talk about my superficial ideas and specific applications in operation and maintenance management or enterprise IT management.

How did I think about operation and maintenance work in the past?

Earlier, I heard from Tencent Liu Qitong that O&M is technical operation. I think it’s very good, very tall, and the artistic conception is very desirable! I used to think that the operation and maintenance is to take care of a bunch of network equipment, servers, various operating systems and application software... so that they can run efficiently, safely and stably.

For operation and maintenance, it is common to sweat a server and put it on the shelf. When he said that, I think of website operation, enterprise operation, and operation of certain large projects... Anyway, I feel that the word "operation" is more atmospheric. This makes the operation and maintenance sweat that has been forced to shake off, and it seems to be proud.

image

My simple classification of operation and maintenance work

Repeated monotonous work also needs to have a detached mentality to face, jumping out of oneself in order to do a better job. Operation and maintenance is a relatively monotonous task. According to technical classification, operation and maintenance includes computer room management, network management, system management, database management, and various application system management;

From the perspective of management, regardless of the type of technology, I divide the operation and maintenance into two major categories according to the characteristics of the process, such as the length of the cycle, the large amount of the small matter, the daily repetition, and the emergency. One is daily (class) operation and maintenance, and the other is project (class) operation and maintenance.

My classification stems from my perceptions during the PMP training process five or six years ago. There is a passage in the PMPBOOK book: “Projects originate from organized activities of humans. With the development of human society, organized activities of humans gradually differentiated into Two types: one is continuous and recurring activities, which people call "operations" or "operations", such as the activities of a large number of products on the production line of a company; the other is temporary and one-off activities. People call it a "project".

Daily operation and maintenance belong to the first category of activities. From a larger point of view, our entire operation and maintenance work may not be a project. But how to make the work that was originally a repetitive work become a one-time job? Remember when I was in school when I talked about the concept of differentiation! How to find the derivative? These two problems seem to have created sparks in a wonderful way. Through differentiation, we can regard a curve function as a straight line of segments, so that the derivative can be obtained.

What is project operation and maintenance?

In actual work, can a continuous work be transformed into a project through differentiation of continuous and fluctuating work?

I think it is possible. By cutting and coordinating tasks at different stages or periodic tasks, a periodic operation and maintenance can be divided into several small projects. Establish the entire operation and maintenance system through the management of small projects.

The management method of small projects is also called task management. This task-based management method can help us alleviate long-term fatigue of operation and maintenance work. It can also form a rapid iterative system, making the method more flexible, focusing on the delivery of results and at the same time focusing on the process.

Let's illustrate based on a few examples. In the operation and maintenance team of dozens of people, we actually realized the understanding of classification based on meeting communication and daily work:

Further explanation of operation and maintenance classification

Daily operation and maintenance is the work content that our operation and maintenance personnel often deal with on a daily basis. such as:

  • System operation and maintenance personnel deal with the problem of insufficient disk space in a certain directory of a server;

  • Add or modify a DNS domain name A record

  • Computer room personnel replace a faulty hard disk

  • Network personnel have abnormal traffic inspection on the bandwidth of a certain outlet line

  • Desktop support staff install an office for colleagues

  • ……

In dealing with this kind of things, what we pay attention to is  "short, flat and fast . "

项目运维,就是非日常运维的内容了。大到包括一个IDC机房或者办公楼的系统网络建设,小到比如升级系统内核,因为涉及重要和关键的业务,或因技术上升级过程比较繁琐,需要考虑的方面比较多,也会放到非日常运维这块。

要重点说明的是,团队在日常运维中遇到一些故障,在快速解决后,会在统计中发现经常出现类似现象,也总会拿出来作为问题来解决。不管是理论意义上真正的项目,还是问题类项目,或者其他具有项目特征的事情,只要不能在日常运维类别中快速了结,都会考虑以项目的方式来进行处理。

这里指的具有项目特征,是指要处理的事情是很多事情的集合,涉及面比较广泛,成功完结后有从无到有的深远影响,也像项目一样是计划内的,周期也相对比较长,涉及的资源和人员也可能比较多。

具体其他特征可以参考下项目管理方面的书,但是可不能硬套。所以这类事情个人认为按照项目管理的方式去落实和推进非常合适,这也是为什么称为项目类运维。

总之,通过综合处理各类运维事情的共性,做了一个二分法,日常运维和项目运维。非此即彼,也好划分。

如何立项?

在实际操作中,由于没有太明确的定义,一般同事也不好掌握。但既然是项目,还是有立项门槛的,最后能不能立项,还是需要几个人讨论后才能说了算的。但这几个人怎么确定?

答案是,当然不是终身制的所谓立项委员会,原则上根据这件事的利害关系及简单好操作来确定。

在实际工作中,团队的例行会议中就可以了,毕竟负责各个技术方向的主管人员都是技术出身的,能够把握好方向。举个例子:

我们发现日常运维中某个路由器CPU始终很高,连续很多次触发报警,日常运维中通过分流可以缓解。但是实际报警时候流量负载并没有到达设备的设计上限。初步推断就知道需要进行更深入的排查。这时候由谁来发起立项呢?

  • 通常网络管理员会在周期工作报告中汇报这个问题,希望提升成为项目,以查找问题根源;

  • 当然这种情况也可能是他的主管领导,在查看日常运维处理报表中发现这个事情经常出现,而希望提升为项目;

  • 另外还可能是服务器系统管理员,发现最近某些服务器或者应用网络延迟很大,进而发现这个问题比较严重,于是在运维部门较高的例行会议上立项。

无论哪种,在内部技术类的周期例会上,或运维管理层会议上,都会分析这些情况,大致评估对业务的影响程度和主要解决这个问题的技术类型,决定立项和负责人、大致的项目目标和起止时间。

项目工作如何流转?

假设这个问题是在网络组内部会议讨论要立项的,那么项目就在网络组内部自行组织人员解决。后续处理过程中如果发现需要涉及线上业务的正常运行,可能需要机房组和系统组人员协助。甚至问题根源可能就在系统组负责的某个服务器上,那么项目会升级到较大团队级别。

但升级就升级,一般习惯是不会变更之前既定的项目负责人的,除非特殊,否则不会临阵换将。

过程中管理层可以多出些力来协助项目负责人,尤其是负责人的直接主管领导。我想这对培养团队人员个人技术综合素质和提升整个团队的协作能力是非常有益的。

如何落实运维工作?

既然运维工作分为日常运维和项目运维,就可以分别来落实了。基本原则是思想上要认识清楚每项工作的意义,制度上要落实到位。落实到位最好的办法就是将思想和制度技术化。

“技术化”通俗的讲就是通过各种软件系统来管理运维工作。打个很形象的比喻:

我们日常开车,要对安全有很高的认识(思想层面上),当然还需要制定交通法规(制度上)来指导我们开车,路上也会设置各种行车线。
比如实线和虚线,路中间的实线就是不能碾压和跨越的,高速上的实线处还设立了很高和厚实的水泥防护栏,这个水泥防护栏就是思想和制度技术化的极端体现。实线拦不住不守规矩的车,但是水泥防护栏能!

所以思想需要形成文档来固化,当文档最好要通过技术化的实体软件系统来固化以协助我们更正确的工作。

有了体现思想的制度和软件系统,最关键的是:要用,天天用。还有,不是所有的文化思想都能固化的,还要培训和沟通,这些无形的和有形的都需要讲,换着方法的讲,日日讲。

当然思想文化、文档制度、系统软件不是一天能完善的,也不是完善了就能高枕无忧的,需要集众人智慧,与时俱进,不停的进化下去。因为开放、向上、探索本身应该是一个良好运维团队的文化核心之一。

如何做好日常运维?

对于日常运维,这类事情是运维的主体工作,虽然琐碎、技术含量一般不高,但是非常影响客户(外部用户和公司同事)的用户体验,影响运维团队提供的服务质量。ITIL中的事件管理系统可帮助我们管理日常运维工作。

我们就基于ITIL的IT服务管理思想,结合自身业务情况,公司自己开发了一套事件管理系统。个人认为这套系统最有意义的地方有两处:

1. 使各个团队或者部门的服务接口化了。

用户可以根据自己选择的事情类别由系统分配给最适合的团队来处理。原理是各个团队将自己的工作职责提前进行了菜单化,用户根据自己的需求“点菜”即可。

比如上海办公室的用户outlook有问题了,就可以在事件管理系统中输入outlook,找到outlook相关的服务项,选中提交即可,系统会根据用户账户里面的属性分配给上海的IT桌面支持团队处理。

系统也有分配错误的时候,被分配者可以重新替用户转给认为正确的团队处理……我甚至认为应该将这个系统推送给公司所有部门使用,而不是仅仅局限于技术中心。

2. 服务质量的把控技术化了。

用户的问题根据重要情况是分级别的,不同的级别有不同的初始响应时间,响应不及时以及后续处理不及时会升级。

不是原本不重要的事情变成重要,而是无论哪种事情,响应不及时都会逐级报给事件处理人的领导,甚至领导的领导。

当然,还有相关的统计报表,来统计个人和团队的事件处理数量和质量。所以无论是个人还是团体部门,都像有一根鞭子在背后飞舞。

image

如何做好项目运维?

对于项目运维,这类事情一般涉及比较广泛和深远,更是重中之重了。项目运维类的事情在实际中我一般用来监控比较长期的事情,比如部署某某系统,或者作为问题管理。

基本上是运维部门内部的事情,或者是已经转化为内部的事情了。因为用户少,只面向运维部门,所以我们直接拿开源的Redmine作为管理软件。

Redmine很灵活,需要先理解它是基于任务(issue)的,至于具体怎么用,就需要结合标签来做,具体就不细谈了,感兴趣的各位可以慢慢摸索。

通过这个软件系统,可以弥补事件管理系统的不足。那么事件管理哪里不足呢?

最主要的不足是事件管理最(只)适合对单个零散的、短平快的事情管理。而项目类的事情需要拆分成N个子任务,任务之间也有前后依赖关系等等。另外项目类的运维周期有时候还很长。

这么长的时间没有处理完,要是在事件管理系统中记录,那你的KPI就完蛋了。-_-|||

通过项目管理软件我们实现了扁平化的管理,可以查看到所有正在进行的任务情况,可以细致到下面的一个个子任务。这样向领导汇报的时候不至于抓瞎,和团队成员沟通也便于就事论事。

一般情况,子任务都是项目负责人和任务被指派者相互沟通协商确定的,最终干活的人有很大的自主权。

image

最佳实践

在不影响上级任务目标的情况下,给予子任务实施人较大的自主权,比如自己定制细节的任务目标,有助于调动当事人的主观积极性,因为他在完成自己的目标。

运维都用数据说话

因为运维工作被分成了日常运维和项目运维,并分别有事件管理系统和项目管理系统来监管,有了很好的运维管理平台,现在基本上可以说整个运维团队的工作大体上都实现了数据化了。

同时作为一般运维人员来讲,这二者也是一个非常好的知识和沟通平台,工作的好与不好不是领导说了算,是自己平常在日常运维和项目运维中的表现说了算。这样作为运维管理人员来讲同样就有了管理的利器,团队的表现也是用数据说话。

写在最后的话

The above views and practices are my own personal words. It is purely for communication. Everyone has their own management experience. I personally feel that as long as it conforms to the actual situation of the company and runs smoothly and without hindrance, it is a good way.

There is always the light of universal wisdom in good ways and methods, and I hope to provide you with some value for reference.


Guess you like

Origin blog.51cto.com/chenyitai/2668432