Ali search Taiwan to develop an integrated operation and maintenance practices

Ali sister REVIEW: by the end of 2015, Ali announced the launch of a strategic Alibaba Group in Taiwan. Strategy is defined as: DT build more innovative in line with the times, the flexibility of "medium and large units, small reception" organizational mechanisms and operational mechanisms. Among them, the front desk as first-line business, more agile and more quickly adapt to the market, the station will be a set number of operational capacity, product technical capacity of the entire group, the formation of strong support for various business prospects and the Group a very important in the table layout in a ring Taiwan is in search of, but because of the complexity and scale of business challenges of search technology itself, so that the search units on the technology, products have met the challenge of world class.

 

Challenges, Ali choose to embark on the road in the development stage of the operation and maintenance practices integrated. How to how to go this route? Here follow Ali Senior Technical Expert Search Division Liu Ming, work together to understand.

 

background

 

Ali search early heart stage is to support the foreground business more agile and more quickly adapt to market changes, the vision is to make the world is not difficult to use search, we search the station from 0-1 construction based on early heart and vision for three years, during three years in DevOps , AIOps, there are a lot of industry on the leading edge of precipitation offline platform, and I, as a search veteran Ali, was able to witness the technological development of the entire search Ali in Taiwan, so here is limited by some of my personal experiences with you how to share a back-end services that solve large-scale, cost, efficiency, quality, experience toward the platform products forward.

 

Search technological development in Taiwan

 

The figure below that is a judgment from a technical point of Taiwan in search trends, but also more than three years after landing a process of practice.

 
We can see from the chart should be the first stage when I joined Ali, both the search division or open source search technology are relying on people to be responsible for system operation and maintenance and operations. At that time human resources is directly proportional to the scale as the business growth, during which consumes a lot of human resources and in doing inefficient duplication of work, this is the artificial control of the stage.

 

After the experience with precipitation, PE gradually found some common repetitive operation and maintenance work can be achieved through automated scripts, reducing labor costs to a certain extent, improve the efficiency of operation and maintenance, but also have preliminary expertise and domain knowledge precipitation shadow, this is an automated script operation and maintenance stage, the stage of which is the vast majority of open source technology system. But this way naturally divided operation and maintenance of the development and operation and maintenance both roles.

 

Because everyone's different mission, so that the two roles stood on opposite natural, fast iterative development of hope, hope that guarantee the operation and maintenance line as stable as possible to reduce the number of iterations, because we all know that most of the failures are actually online because the configuration changes and software upgrades due to natural causes division there is mistrust of the other side of each other, so the two sides will have a final compromise: fixed weekly Tuesday and Thursday release window release, but this the expense of efficiency of business operations as a precondition.
 
In fact, here there is a big gap on one side of the iterative system capabilities and business needs, in order to resolve these contradictions based on the operation and maintenance of devOps development of an integrated concept of the new building control system came into being, and will have the first phase of development operation and maintenance of integrated construction, but also solved some of these issues through an iterative release ops.

 

But our business scenario management and control is a natural technology system, so we should not think devops still remain on a single integrated operation and maintenance of the system development methodology cognition, so we want to define the devops is on top of a single system ops " ops ", so in essence we and other so-called Group of devOps platform has a very big difference in nature.

 

Group more representative devops space-based platform is the platform that the main solution is to deploy a service source code management and then to the whole process of the upgrade, the user-oriented nature or operation and maintenance personnel. So on this basis, Greig day with IAC (Infrastructure As Code, Infrastructure as Code) dimensions + Git deployment configuration management products is already enough to build, which is a typical devops platform design ideas, but only such a design in fact, for us, maybe not enough for us, because our customers are end users, who do not have the online system operation and maintenance expertise, only to see the configuration or code, he will Yun Cai.

 

So fundamentally speaking we will need to continue to move forward one step on DevOps understanding, toward the point of view of the product platform for further forward: First, shields the user code or configuration or knowledge of the complexity of the field, the second is to change the system collaboration into a control end to end experience, because only did it simplify the complexity and the whole end to end experience management and control link can really make complex search operations of iterations essentially improved efficiency, in order to achieve the above goals we have gone through two years efforts to gradually landed sophon, bahamut, Maat and other systems, also made good business iteration efficiencies.

 

But only do DevOPS for such a body mass Ali platform is perfect yet? Obviously not, DevOps full link is only effective solution to the problem of R & D, PE, users with efficiency and user experience, but for the platform side is concerned with the rapid expansion of business scale, as well as complex and diverse types of search services and changeable contradiction with the business platform actually took place in the nature of the transfer: how to provide better security and stability rational resource utilization, and higher efficiency of iterations for each business under massive scale has become our The new target platform.

 

Currently we landed in AIOPS practice based on three years of data operations in the Hawkeye - online service optimization platform, Torch- capacity management platform, Heracles- measured daily press service platform, CostMan- cost service systems. These services help desk system has achieved some initial results on capacity management, routine inspection, a key diagnostic optimization, let us also about the future unified group search operation and maintenance control, even more than the number of business at 10000 + platform can scale deal with it smoothly, and establish a firm confidence.

 

Although proven data of 3 years of operation, but we have more distant from the real AIOps, because our previous performance bottleneck analysis, problem diagnosis, self-healing, operation and maintenance of complex decisions mainly stay in the precipitation expertise that white people or the experience of precipitation into the system to solve the problem of online operation and maintenance, and AIOPS looking forward to is the ability to use data and algorithms to help us to automatically discover the laws of the problem and solve the problem, this point of view in our AIOps platform still has a lot of potential to be tapped, so we hope in the future to enhance the efficiency, quality assurance, cost optimization can really help AI's ability to help the platform to better accommodate future development.

 

Search Taiwan to develop an integrated operation and maintenance practices -Sophon

 

Integrated development operation and maintenance -DevOPS

 

Before we introduce the development of an integrated -sophon operation and maintenance of the system, we need to look at the system when it comes to the scene of a slightly more complex search service access and how they coordinate the work.

From the chart we see that in fact the whole system module is roughly divided into three modules, OPS, Online, Offline. Wherein the layer as shown in FIG Ops obviously divided stateful services online ops, the online and offline stateless service ops ops.

 

That is until before each service is individually OPS separate control, but in fact, as shown above a complex business is a multi-service system combined with the results, so in my memory when tisplus not on the line, we have access to complex business the first thing is to convene a stateful service team online, online stateless service team, offline DUMP team, the business side, PE whether the exchange would open, then arrange how to promote cooperation on this line item, line changes and the on-line problem is also supported by the group shouted to each other: "I have done this step, you can do the next step," and "you wait longer operate under, I would like to re-issued." So imagine such access was how low co-efficient operations, we believe from what I have just described also know why we support the previous 10 months business has been the cause of the limit of the bar.

 

With these pain points, needs, and then look back and say that we think that we in the process of practice devops build complex search system must have:

 

  1. Experience the full link to provide end OPS is what we considered to meet the standard definition of devops our scene.

  2. Operation and maintenance of complex procedural manner of operation and maintenance control links based on our knowledge of cognition need to upgrade to operation and maintenance control based on the target-driven.

  3. Better operation and maintenance of abstract and abstract product, enabling a better user.

  4. Improve business efficiency iteration must be to protect the stability of the business is based.

 

With these requirements pain points, we will have a technology platform layout -Sophon in this area, then we will be divided into sections detail under the system.

 

Search Taiwan devops practice -Sophon

 

Goal-driven operation and maintenance

 

What is the basis of the target-driven operation and maintenance? In fact, at first glance, it would feel too abstract, in fact, if listening to my explanation, you will find it very simple, we give an actual search operation and maintenance scenarios to illustrate perhaps easier to understand why we should promote goal-based operation and maintenance control .

比如我们的搜索系统现在的索引版本是A版本,然后要求系统执行切换索引B版本,但正在rollingB版本的时候,我后悔了我要rolling C 版本。这其实在早些年的时候,线上这种状况是非常让人崩溃的,如果这事让PE去做的话 , 只能杀掉切换流程,检查系统每个节点到哪一步了,清理中间状态,重新发起运维流程,可以想象过程式的运维管控方式在复杂运维体系下是多么低效的事情。

 

但如果是基于目标驱动的调度,我们只需要重新给系统设定新的rolling C版本,那么系统将会获得最新目标和当前执行渐进的目标进行对比,发现目标状态存在变化,系统会马上终止掉当前执行路径和自动清理系统存在的不一致状态,开始下放最新目标状态关键路径执行通知,各个节点接受到最新命令后开始逐步向新的目标渐进,所以只看最终状态的渐进式最终一致性运维方式自然而然屏蔽了运维中间状态的复杂性,让复杂运维管控变得更加简单更灵活,这也是为什么我们平台自上而下所有的运维方式都升级成了基于目标驱动的原因。

 

运维概念简化

 

我们平台一直提到从托管到赋能,言下之意是希望让最终用户承担起自己应当要承担的责任才能享受更强大的搜索能力。但谈到要赋能,那也不能将搜索系统复杂的领域知识和运维概念直接暴露给最终用户,否则这肯定不叫赋能用户,而是叫做折腾用户了。所以如何将系统的运维概念简化,将复杂和潜在领域知识留给系统内部就是sophon需要解决的核心问题之一。

上图下方是从PE视角看到的各个数据中心的基础设施和各种在线服务,如果没有一层管控抽象,让最终用户和PE看到的是一样的复杂度,我相信用户一定会晕菜。

 

所以sophon做的一个事情就是将运维管控对象抽象成一组数据关系模型,也就是运维管控模型,如上图右侧所示,但是这一层运维抽象依然足够复杂,用户不应该也不需要去了解这层运维抽象,我们应该给用户看到的是触达业务场景的业务抽象,所以sophon在第一层运维抽象之上又抽象了业务抽象,如左上角的三层概念:业务逻辑(插件、配置)、服务(部署关系)、数据(数据源&离线数据处理)。这层的定义用户是几乎无成本就能接受的,所以通过sophon做到的抽象运维概念和简化业务概念的能力也让我们平台从托管到赋能用户成为了可能。

 

稳定性保障

 

sophon保障服务稳定性主要体现在2个方面:

 

  • 当平台支持越来越多的头部核心业务,我们需要对业务的搜索服务进行SLA保障,同时也能适应各个业务根据自己的稳定性要求进行灵活的在离线服务的部署,同时还需要具备自动容灾切换能力。目前sophon服务稳定性方面能够支持搜索在线服务单元化、在离线服务单元化、离线数据冷备部署以及查询链路和数据回流链路自动容灾切切换的能力,如下图所示:

  • 我们前面提到迭代效率提升有一点就是让原先基于时间窗口的线上发布迭代变成了可以24小时随时随地可以发布,但我们说的随时随地并不是代表我们只是提供了发布按钮功能,而不去考虑快速发布过程可能带来的潜在危险,所以高效且安全的发布迭代才是我们追求的目标,这个背后非常重要的基础就是我们设计和标准化了一套发布迭代规范。

    例如一次正常的业务迭代,需要经过日常、预发2套环境进行验证,同时在预发发布线上的发布流程中我们加入了多重校验机制来进行发布的稳定性,比如插件、算法策略升级时,我们会要求clone压测对比,如果性能差距太大,发布流程会被回退,同时基于单机房切流灰度发布和冒烟验证等能力可以在发布流程里被定义,所以有了sophon提供的强大的多重校验机制和快速容灾切换的能力,让业务快速迭代中再也没有了后顾之忧,可以将业务运营迭代效率提升到极致,如下图所示:

专家经验沉淀

 

搜索技术体系虽然功能强大,但强大的背后也有很多专业潜规则,所以如果平台把复杂的运维管控和业务迭代需要遵循的专业知识暴露给普通用户,用户肯定歇菜,所以我们在devops这层一定要将引擎服务领域知识下沉让平台去屏蔽这些复杂性。

 

举个真实的搜索场景来说,如果业务方有一个字段的修改,但真实情况下一个字段的修改其实是可能涉及到在线和离线的配置联动修改,换句话说你不能说让用户在修改配置的时候让他判断我这次修改是只会影响到在线服务、还是影响到离线服务,还是在离线服务都会影响到,此外配置推送需要先离线服务生效还是在线服务先生效,还是说配置必须做全量后一起生效等等,这些都是引擎服务的专家知识。

 

目前我们依靠sophon devOPS这层将这些领域知识都在背后默默消费掉了,用户完全不需要关注这些潜在知识,运维平台内部会分解复杂运维操作,然后会根据我们定义好的专家运维DAG图来有条不紊分阶段的进行运维执行,如下图所示:

通过我们不断将运维专家经验沉淀到系统(运维DAG执行流程图),用户对平台的使用成本会不断变小,同时迭代效率也会越来越高。当然如果运维操作变得越来越复杂(比如我们暴露给用户的业务视角需要涵盖越来越多的服务),运维DAG执行链从简单就会发展到可能存在多种执行分支,那么如何在运维执行中寻找到最优执行链路就会成为一个有趣的话题(如上图右边所示),目前我们称之为最短路径选择,这是智能化运维一个有趣的尝试,这也是未来我们持续努力的方向。

 

从系统到全链路

前面其实也介绍了我们的所有业务场景都是一个技术体系协同的结果,而这个过程中最重要也最具挑战的点便是如何将在线和离线高效协同提供给用户端对端的体验。

 

从上图可以看到最终用户使用离线数据永远看到的是可视化数据关系定义和简单的dump->Build->switchindex任务执行列表。但是实际上是我们把所有的复杂度屏蔽掉,系统背后却是有一个复杂的状态机在管控在离线的协同,这张图不打算展开讲,整个在离线协同,状态机不是关键,关键是我们如何将每个在线搜索业务对离线数据处理的个性化需求转换成一种抽象,最后通过平台方式来支撑的。

 

在展开介绍离线平台技术前,稍微跟大家介绍下一个搜索业务对离线处理的普世需求,而这些需求也是没有离线平台之前支持复杂业务在离线跨团队合作中被重复讨论过多次的话题。那就是到引擎的业务数据并不是一个简单的数据库表,它可能来源于多个同构或者异构数据源,同时每个搜索业务都有全量和增量的需求,所以如何将这些根据业务不同而不同数据源关系处理变成一种高层抽象并且屏蔽内部处理环节和统一增量和全量处理流程就变得非常重要,否则来一个业务我们都要为其实现全量和增量数据处理代码简直是不可忍的事情。

 

现在来回顾之前我们离线支持效率低的原因还是我们之前对引擎schema定义的数据源都是被弱化成一对对的资源进行抽象和管理,也就导致我们没有把本应该的基础的抽象给提炼出来,其实仔细想下来我们目前接入的所有数据资源都是Dynamic Table,所以如果我们以表的抽象去定义这些资源,那一些通用的类似创建表、删除表、修改表、增删改查表数据,定义表之间关系等API都应该可以被收敛掉而不会存在重复开发问题,所以有了这样一个思考,也就有了我们打造离线组件平台-bahamut的整体设计思路。

 

平台支持用户在平台画布上定义好各自数据源信息和表之间关系定义后(我们可以支持异构表之间的join,例如odps和mysql),我们会将这个前端的Graph提交给Bahamut进行翻译,bahamut将这个前端的Graph解析、优化、拆分、翻译成成若干个blink可执行的graph,比如增量的syncBlink 、全量的BulkLoad MR任务,和Blink Join 任务等。

 

这里最重要的两个关键的graph节点是merge和left join。merge是将所有的1:1和1:N关系表的处理通过行转列到一个HBASE中间表,而N:1的关系处理以下图的例子来说,我们目前只支持主表N这边(商品表)驱动,也就是说N这方的通过blink sync更新后利用blink Join合并1这方(即用户表)成完整的行记录发送到SwiftSink(增量)&HDFSSink(全量)最终回流到到BuildService构建索引,如下图所示:

通过在线离线管控协同和BaHamut组件平台的打造,可以让用户通过可视化的手段就能享受到强大的离线复杂数据关系处理和计算能力,极大地提升了业务支持效率,同时也让我们平台成为第一个可以整合离线提供在离线端对端体验的里程碑式的产品。另外我们还在做一件事情将离线能力变成在线服务通用能力,相信不远的将来离线组件平台不会是HA3搜索场景的离线组件平台,而是整个搜索在线服务的离线组件平台。

 


 

Guess you like

Origin www.cnblogs.com/chenliangcl/p/12091648.html