New Proposition of Operation and Maintenance Value: Fine Technical Operation Optimization

About the Author:

Xiong Pujiang

Tencent architect

2016 operation and maintenance craftsman, responsible for the company's business resource planning and technical architecture review. Since entering the Internet in 1997, he has served in technology and Internet companies such as Supreme, Pacific Networks, and PPTV in the United States. He has served as the director of network operations and director of operations and maintenance. He joined Tencent in 2012. With more than 17 years of Internet industry background, he has rich experience in large-scale network architecture planning and construction, massive user platform planning and operation technical support, super-large-scale business resource planning and technical architecture management optimization.

Preface

In addition to ensuring the stable, efficient and safe operation of the business system, what else can O&M do to reflect its value? Listen to what Tencent architect-Xiong Pujiang said.

1. Thinking of operation and maintenance value

The theme I share with you today is the new proposition of operation and maintenance value. Before I talk about this topic, I will briefly introduce myself. In fact, there are already some introductions in the promotional materials of the conference. I have been in operation and maintenance for more than ten years and have experienced starting from an engineer to becoming an operation and maintenance director.

I joined Tencent in 2012. At present, I am mainly in the company's operation management department at Tencent, responsible for the review of the technical structure of the company's business and the planning of business resources. You may not have any idea about the content of this work. In fact, the planning of business resources is more popular than capacity management. This may be clearer for us to do operation and maintenance.

Resource planning or capacity management is also very related to my topic today, that is, refined technical operations. How can we use technical means to refine operational management and highlight the value of operation and maintenance?

Today I will share with you these five aspects. It is worth mentioning that I will talk about how to perform technically refined operations based on the current network changes and the WeChat case, so as to bring the company and the business. Value.

Let's first think about the operation and maintenance value, what is the operation and maintenance value.

9bdfca6ea0c1954cec8547adffd8ca51.webp

The traditional view of operation and maintenance is to ensure the stable, efficient, and safe operation of the business, but today, is this view of operation and maintenance value sufficient? On the right side of this PPT, I put a picture of the US Marine Corps.

We all know that they are very powerful, and there are not many people on the team. They usually train with great attention to details and are very well prepared, but once they do something decisively, they do it very efficiently and truly protect American citizens and territories. Security.

I hope that our operation and maintenance will also have this value and become one of the company's core competitiveness. This is where we want to think, where is the value of doing operation and maintenance? I think value includes two aspects. One is that I can help users provide value. Where is the value of users? The user is comfortable with it, can meet a certain actual demand, save him time and save money, this is all user value.

For the same reason, it also brings value to the company. How to bring value to the company, the value includes the ability to help business development, to improve our products, and to help the product to the extreme from the direction of operation and maintenance. This is also the value of operation and maintenance. So starting from these aspects, we should do more in-depth thinking at this stage, not just make the system run stably, this is far from enough.

2. Operation and maintenance challenges under the mobile Internet

2.1 User experience comes first

Now is the era of mobile Internet, and everyone uses micro-credit a lot. In fact, mobile phones are basically our organs, which can help us communicate, perceive, and process things. I believe everyone has this experience: it is fine to not see your family or friends for a few days, but if you leave your phone, you will be anxious for a few hours. This is the very obvious change now.

The mobile Internet has changed our way of life. Given the objective facts of the entering mobile Internet era, what challenges will we have? We look at it from several aspects. As operation and maintenance, the first thing we should pay attention to is the user experience on the mobile terminal. We no longer only consider the experience of the PC or other business services, but should focus on the mobile terminal.

What are the characteristics of mobile terminals? For example, the boundary of WIFI is fuzzy and open, and there will be security problems, so I will also talk about its security aspects. Including the cost of traffic, how can we save traffic for users? This is not considered in the PC era, or it is not the main factor we consider. Including user storage capacity management, which are all aspects that need to be considered. Another particularly important thing is power consumption. In the traditional PC era, these aspects will definitely not be considered, so this is our new challenge.

179b6eba7a2c1db7e808c42582c539dd.jpeg

2.2 Complex network and equipment

The second challenge we have to consider is the network. The Internet is also very different from the Internet in the PC era. First of all, the Internet has "Chinese characteristics" interconnection problems, which are inherent in the PC era, and they are very complicated. The original plan for the interconnection of the three major operators in 2016 was more than 6 TB, but only more than 800 GB was completed in one year. Therefore, the interconnection of China's network is still a very important issue.

Secondly, the complexity of the network in the mobile era lies between the mobile phone and the base station. The signals of the mobile phone and WIFI are often strong and weak. There is also protocol conversion. The mobile protocol is still different from the Internet TCP/IP protocol, and there is a conversion process. There are also various terminals. In addition to knowing that there are Apple mobile phones, there are also a variety of mobile phones in China, not only of various brands, but also various models and configurations, which are very complicated than the PC era. many.

There is also a particularly important point. As I said just now, we cannot do without mobile phones. Mobile phones are always by our side. It can also be used when sleeping in bed at night. Even when we wake up, we need to look at the content on the phone. This time is different from before, users are always online. If you want to consider doing well in operation and maintenance at this time, the challenges are different.

2.3 Challenges of Mass Information Dissemination

In the era of mobile Internet, due to the convenience of use, it also brings a challenge, that is, the spread and spread are very real time. The bigger challenge is that if good and bad things spread very quickly, immediately the whole world knows that suddenness is more obvious. This requirement for gray scale is even higher for disaster recovery plans.

Let’s look at a picture on the PPT. In the past, it might not have spread so obvious or had such a big impact. For example, Amazon Cloud was down for four hours because of a missing letter in the script; and Gitlab was also because Typing the wrong command deleted the entire database. This problem spreads very quickly, and its impact spreads very quickly.

There are also sudden bursts when moving. The picture on the right is the red envelope of the Spring Festival. You can see that at the moment of the New Year on January 1, the peak is ten times the constant. How to deal with this peak pressure at this time ? It is a challenge for us.

83badce036e3bafd02232d4ec548e813.webp

Here is how to ensure the operation and maintenance services in the new situation. In the era of mobile Internet, a special reminder is that services should be compromised. In particular, the number of mobile phone users is very wide, and in the case of a large number of users, we have to compromise support. You can see several types of business requests that are frequently seen, including peak types, such as spikes, double eleven events, and event types when equipment fails.

7eb3bcf451c04af0bb89a0513838a9ae.webp

In the era of mobile Internet, we must make full preparations at every moment. If the capacity of our system must support these peaks or bursts, the cost is very high and it will cause us to be exhausted.

Therefore, we are required to do all kinds of plans, all kinds of disaster tolerance, and all kinds of switch loss services. Everyone uses WeChat. You should remember that we had a red envelope in a circle of friends to see photos during the 2016 Spring Festival. WeChat did a very good job.

在上线前就已经做好了各种柔性开关,可以在任何时间上去,也可以在任何时间下掉,这就是开关的作用。所以如果做到有损的话,对业务的稳定性带来非常大的帮助,这个稳定性也是业务的价值体现。

三、设备精细技术运营实践

我刚才讲到了在移动互联网上面运维新的挑战,接下来我们结合微信看一看,怎么做技术的精细化运营,体现出运维的新价值。业务资源规划或者讲容量管理,往往也是我们运维要做的管理。容量管理一般针对的是设备、服务器和带宽以及专线这些资源。对于设备的精细化运营,将以微信的几个例子来说明。

微信是一个非常海量的服务,在全球的微信注册用户是超过20亿,月活数据现在是9亿,每天收发消息量是超过5千亿条。要支撑这么大体量的消息收发,肯定需要很多设备资源来支撑,我们怎么做能够使得既保证业务发展又能体现我们的价值。

我先讲几个数字:在2014年我们通过技术精细化运营,单微信就节省了9千台设备。9千台服务器什么概念呢?单采购的现金流大概是超过1.7亿。尽管可能我们很多业务不一定有这么大的体量或资源用量,但从数字上看,节省了1.7亿现金流,相当于我们运营直接给公司创造了1.7亿的利润?这是非常有价值的。

231e9d114831ce9f73823644df6f0b95.webp

3.1 微信收发消息场景

具本到收发消息,表面是上我们发一条消息出来,推送出去,对方收到。但实际上并不是我们想象的那么简单。仔细来看,技术上要实现怎么保持连接,发一条消息的流程涉及有,接入处理,帐号及状况信息处理(如要验证这个账号是否有效,属性,状态,是否登录,是否有权限发,是不是垃圾信息或者有害消息,对方是不是好友),获得发送消息的序列号(它保证你这个消息是唯一的,不会丢消息的,这就是序列号的服务)

有了序列号之后,才会将消息组合发过去,存到对方的索引下面,存好之后;才会推送一个新消息或通知到对方手机上,有新消息到了。

也就是说发一条消息这么一个简单的动作,也是要经过很多处理的。几千亿消息的体量,显然它的资源使用是跟消息量直接相关。我发1千亿条消息,假设用了2万台服务器,如果我要发5千亿条消息,原则上直接要涨到10万台。这显然不可接受,我们就要考虑发消息到底占用了哪些服务器资源。

3.2 微信收发消息精细技术运营

首先我们能不能把提升操作系统单机网络包处理能力,比如说多核并行的处理能力;我们能不能减少调用的层次,不用调用这么多。怎么减少?当然是有一些技巧的,比如说两个人已经在会话状态,是有些调用流程或步骤可以省略。我在这里列出的只是一部分,实际上调用关系是非常复杂的,在某些情况下还可以省掉调用层次。

还有就是收发消息很不均匀,落在每台服务器上的处理量差异非常大,最开始我们发现不均匀的现象非常厉害。我们知道服务器扩容的时候都是按照峰值来扩容的。这会在扩容的时候,这种调用量的分配不均的导致很多资源用不起来。所以我们就会考虑优化,尽量使得所有调用请求平均分配到每台服务器上,或者根据服务器的能力来均衡分配。

而且,有些请求是可以合并的,合并就减少了很多网络调用也提升了性能。

还有一些模块比如春晚很多人都发红包,其他一些部分的功能模块使用量就会相对比较少,这时候就可以在其他功能上布置发红包模块,这就是错峰调用

新服务区也是,新功能上线后跑多少业务量很难预先知道。甚至也许这个新功能业务压根就发展不起来,或者这个功能并没有预想的那么多人使用。如果我们申请了一堆机器的话,可能就产生浪费。所以我们针对新功能新业务,设立一个专区,这个专区专门用来上线新服务。

就是说这个新服务不会单独部署,除非它的请求量或者调用量达到了一定程度,我们才给单独拿出来。其实我们知道,新功能或者新产品,成功率还是非常低的。真正有发展比较好的再单独部署,这就是新服务器的管理。

3.3 微信收藏视频精细技术运营

设备要做精细化运营,要求我们深入到业务里面去。我们来看一个微信里的收藏视频功能。我们知道很多消息或者重要的东西是可以收藏的,最早这块的产品设计,视频收藏起来之后可在收藏里是有两个操作:一个是直接播放,一个是下载。之所以这样设计,是为了用户可以收藏里直接播放,这也是合理的产品设计。

60502de9712f71b4b401afefea5a37bb.webp

但我们发现,要达到直接播放,要投入很多的东西,因为我们发的消息都是加密的,如果我要单独播放拿出来的话,要给它解密,放在一个地方,所以多了一份存储出来。要知道收藏是永久保存的,跟消息不一样,消息收完了就可以删掉,这个收藏和朋友圈的内容都是永久保存,这个量会越来越大,经过短短三年发现这个存储量已经到达7个P,这是巨大的量。

我们通过数据看,实际上从收藏里面直接播放的量,一天才几万的级别,而我们却花了非常大的代价。给用户一个看似很合理的功能,我们通过数据分析,发现这理很不合理。

我们反馈给产品,取消直接播放的功能,节省了大量的存储设备资源。通过这个例子我们可以看到,产品设计也可能存在不合理的地方,我们要有数据来分析说明,有些地方是可以优化改进,用户体验、用户价值都能够得到提升。

3.4 微信朋友圈精细技术运营

我们再来看朋友圈精细化运营的例子。微信朋友圈相册内容是永久保存的。朋友圈实际上有两块存储:

  • 一个是时间线,是咱们看好友动态内容的,时间线只保存最近好友所发的2千条状态内容(如果要看超过两千条只能去点某个用户的个人相册);

  • 另一个是个人相册,只存放自己发表的内容。个人相册保存了用户发第一个朋友圈消息以来的所有数据。

用户发一条朋友圈实际上有两个存储动作,一是在自己的个人相册中存储一条朋友圈消息;另一个是在未屏蔽的好友的时间线里去插一条记录,这样用户的好友就能看到这条朋友圈信息。显然,时间线的存储基本是因定的,不是随时间推移而增加。但个人相册则会累积增长,所以我们重点要分析个人相册的存储。

d0a6836c4055d18f78996c52e99f3cf6.webp

朋友圈的个人相册中,每天有数十亿的照片上传,有过亿的小视频上传,每天下载有几百亿次,显然新增的存储量是非常大的,由于要为用户永久保存,都是刚性增长。如何精细化?

我们还是做数据的分析,这也是我后面会讲到运维的未来重要的思考点。我们通过数据分析发现,人们访问朋友圈,基本上70%的请求都是请求当天的数据。

而这70%的访问请求量所对应的数据存储量,只是占0.3%。还有一个数据,就是90%的访问请求都会落在一个月里,即90%的朋友圈请求都是访问一个月的数据,而这个数据也只占总数据量的6.5%。有了这个数据,我们就可以推动整个数据架构做改变。

四、微信带宽精细技术运营实践

同样,针对带宽资源,我们也可以进行精细化技术运营,还是以微信为例。我们先看下图右边的一组数据:2015年做微信的带宽精细化运营,单月最高节省了3.5T 的带宽,这会节省多少成本呢?一个G可能最便宜的带宽都要有一两万一个月,3.5T 的话一个月就是几千万,前年我们是节省了8个亿的成本。

2016年同样做带宽的精细化运营是节省了6个T的带宽,给公司节省的成本是14亿。也就是我们通过这些技术的精细化运营,可以推动业务进行架构升级,体验改善,提升用户价值。

4b610e706d686938a36d50eaae387c06.webp

带宽的精细化技术运营是怎么做呢?一般我们会有几个步骤,先把带宽拆细,拆到每个最细的业务模块上,会建一个跟带宽相匹配的业务资源模型,而且应当以公式或函数来表述这个模型。

比如说朋友圈的富媒体--小视频,它的带宽模型主要影响因子是下载次数和平均大小,基本上就可以得出一个模型:峰值下载次数*平均大小。

有了带宽资源的因子,我们就可以给出优化的建议。优化后还要再反过来分析数据,看我们所做的这个优化是不是合理,最后发现问题,再来解决。由于产品会升级,有些优化措施或手段会失效,因此,这个精细化运营需要一直持续不断的做,包括今年也同样在推动微信做带宽的精细化运营。

006f30fd1f164e56e0859a088eae72c6.webp

我们来看实际案例,第一个是公众平台的带宽优化。我们在公众号里收到订阅的消息,就像 PPT 右边的对话框里的样子,这就是公众号推送过来一条消息。点进去就是一整篇文章的详情。

分场景来看的话,公众平台的带宽主要消耗在看会话框消息、看正文(文章详情)。包括我们会点文章里面的大图,还会看一些历史消息。文章里面的图片又有几种,有640,有300等规格的。

最早的时候,我们甚至发现这个产品,这上面的图没有640和300之说,全部是使用原图。因为公众平台的文章大部分带宽都是消耗在图片上的,图片有很多格式,主流的格式基本上就是 JEPG,WEBP,PNG,GIF。

数量上以 JEPG 为主,尺寸也不小,但是不是最好的图片格式呢?而 WEBP 相比JPEG 小30%-40%,我们就有想法,是不是把 JEPG 尽量转为 WEBP?我们还发现,GIF 只有 7% 的请求量,但是这占了60%以上的带宽,显然我们要重点优化这个 GIF,这就是我们通过精细化数据,可以做出一些优化。

4.1 GIF图片精细化技术运营

GIF就是动图,很多动图我们可能都看过了,是不一定看的。正常情况下,GIF动图需要全部请求拉下来并自动播放。因此,我们就想能否做成不自动播放,可以只拉第一帧,如果用户想看动图,再点一下就可以播放了。

所以我们就做试验,发现用户有点击GIF欲望的比例只有8%,如果 GIF 做成这样的话,可以节省85%的带宽,这是一个巨大的优化点。

8c19dd9805f77543c16ca638f9c35edf.webp

还有很多优化点,比如用户自己做的GIF,采用了很多帧,有些是则不是必须的。还有颜色,比如说左边的这张图是128色,右边是64色的,不是很复杂的需求的话,比如一些示意图,并不需用很高的色彩像素。

另外,未来GIF还可以转为 HEVC,这个可以再节省了 30% 的图片大小,所以要关注一些新技术的变化,包括谷歌最近也提出一个 RAPID 技术理论,可以通过AI预测把图片复原,一个马赛克图片也可以复原出来,如果能够做成这种,就可以很少的存储和像素,同时保证精度很高的图片。

4.2 C2C视频带宽精细化优化

我们再讲微信的视频的例子。C2C就是用户对用户的意思,微信里面的富媒体消息非常多,这些富媒体消息拉取占了带宽的大头。事实上微信里面80%的带宽是来自于视频图片,我们要精细化的时候就要看重点看。

提高压缩率,合理的质量系数这些优化方向与优化点都很好理解,就是我用新的格式可以压缩率更高,质量又不变化。也可以调质量系数,把质量调低一点。

我这里主要讲边下边播这个优化点。早期版本的微信,我们发一个视频消息,需要等待这个视频下载完成才能播放观看;现在已经不是那样,立即点开就马上可以播放看了,这个就是视频的边下边播

5c10cda73553a9770d4522cd066d0981.webp

早期我们收取视频,往往可能要等几秒或者十几秒钟,下完点开播放,发现是看过的,会马上关闭掉,这实际上是浪费了很多带宽,更重要的是,还浪费了用户的时间。如果改成边下边播的话,体验会提升很多,用户一点就可以开始播了,而且不必下载不想看的部分,这是我们推动做的比较大的一个精细化优化点。

454bb1e9b17cb8e06e1b0e0cb02dfd97.webp

最后讲一下“减少变种”这个技术精细化运营点。视频的变种是很容易产生的,比如我们看到一个视频想转发出去或者保存下来,就可能发生变种。因为我们在上传的时候可能会缺少一两个象素,整个文件就变了,还有一些人会改视频的描述,还有手机终端不一样,压缩的时候也有不一样,所以会造成同一个视频,有很多种变种,这里一个技术的价值就出来了:减少变种可以大幅减少存储空间,提高 CDN 命中率。最终我们使用创新的技术手段让变种减少。

传统的方法判断一个视频是不是一样,都用 MD5,MD5 的局限性太多。我们实现有一种自研的算法来判断视频是不是同样的视频,主要原理是会抽取一些关键的信息来判断是不是唯一的。

最后我们归纳一个精细化运营的方法论。不管是设备也好,带宽也好,要把它细分,细分到我们能细分的最小力度,把资源大头拉出来,因为我们的精力有限,不可能看所有东西。我们只需抓住TOP5或者TOP3解决掉就可以了。所以我们要 抓大放小,最后就要设法建立这种模型,看技术架构是不是能够优化,算法是不是能够更优,甚至是有没有新的技术可以应用。

除了技术手段,还有就是 产品运营策略 也是精细化运营要考虑的。比如以往版本微信里发的朋友圈是小视频,是很小尺寸的。随着手机终端硬件改进及网络条件提升,我们还要进一步放开用户体验,比如小视频尺寸调整为全屏了。全屏上看,用户体验改进了很多,但是像素多了,画面大了,带宽与存储空间都有更大的压力。

这时我们就从产品运营策略出发。以前我们朋友圈小视频是直接自动播放。朋友圈有小视频,看到就是自动播放的。自动播放当然有好处,它会增加用户的活跃度,因为一看到就知道视频好不好,甚至可能会参与评论或者转发。

自动播放显然会浪费我们的带宽,特别是像素提升之后,变成大的视频的话。后来,我们将产品策略修改为:用户点击才能播,所有东西都在变化,所以我们要不断的升级。

结合我刚才讲的,我们发现通过精细化技术运营是可以帮我们做出更大的价值,帮助公司提升产品的体验,能够带来更多的成本节省,甚至说是创造收入。

五、运维价值的未来之路

精细化技术运营是运维价值的新主张,但我们也要思考,未来运维的价值还会在哪里?

b524f27d9ad97a2da38a9015800b8997.webp

  • 移动化。我们要全面抓住移动化。

  • 其次云化。这个趋势也是显而易见的。云的技术已经非常成熟了,在云上所有监控和扩容都非常非常方便,所以业务能够跑在云上的,尽量转迁跑在云上。现在的游戏基本上都是跑在云上的,能够快速扩容,应对突发。IDG的甚至预测,未来服务器都是在云厂商,所以要有云化的能力。

  •  Devops 也不需要讲,天天在讲了,咱们耳朵都快起茧了。前面讲到的的技术精细化运营要掌握很多数据,这些数据怎么来?咱们肯定要做一些开发,做一些工具,类如怎么部署,怎么发布,怎么监控,怎么处理数据等等,这些跟 Devops 都是有关系的。

  • 数据运营。我刚才一直讲到的就是怎么能够说服产品做一些改变,所以不懂产品的运维不是好运维,能够通过数据运营优化产品体验,帮助优化成本,甚至带来改良创新。

  •  在移动互联网时代安全也很重要。万物互联,任何时候咱们的手机都在连网,手机上面的应用越来越多,基本上任一个环节与方面都有可能涉及到安全问题。未来如果咱们继续做运维,但现在还没有安全的岗位,现在应该设起来。

运维也是可以成为企业的核心竞争力,我通过这种精细化的技术运营,能够改善产品的体验,能够给用户带来价值,能够降低我们的运营成本。今天想讲的运营新主张就是让运营成为公司的核心竞争力。


Guess you like

Origin blog.51cto.com/14996608/2548420