Full Dual Link 11 behind the observability: Alibaba Hawkeye comprehensive upgrade in the "cloud native era"

Click to download the "double 11 different technologies: Alibaba Cloud economies native practice" Adapted from "double 11 different technologies: Alibaba Cloud economies native practices," a book, click on the image above to download!
ban.jpg

Author:
Zhou Xiaofan (Inheriting) Ali cloud middleware technology, senior technical expert
Wang Huafeng (water Yu) Ali cloud middleware technology department of technical experts
Xu Tong (Shao wide) Ali cloud middleware technology department of technical experts
Xia (ya sea) Ali cloud middleware technology technical experts

REVIEW: As a deep link tracking technology for many years (Tracing) and Performance Management Services (APM) team, Ali Baba middleware Hawkeye team of engineers have witnessed a number of upgrades Alibaba infrastructure, every schema upgrade the system will be observable capabilities (Observability) a huge challenge, and this "cloud native" upgrade, what has brought us new challenges is it?

Cloud native and observability

In the past 11 double 2019, we again witness a technological marvel: This time, we spent a whole year, Alibaba's core electricity business overall cloud business, and the use of technology infrastructure Ali cloud top live zero transaction peak 540 000 pen / sec; our research and development, operation and maintenance mode, also officially entered the era of cloud-native.

Cloud native advocated new paradigm to traditional R & D and the significant impact of operation and maintenance mode: micro-services, DevOps and other concepts to make research and development to become more efficient, but the problem is caused by a mass of micro-services troubleshooting, fault location difficulty becomes greater; mature containerized, Kubernetes containers such as scheduling technology allows large-scale software delivery easy, but the challenge is how to bring more accurate assessment of capacity, resource scheduling, cost and ensure the best stability balance.

This year Alibaba explore Serverless, Service Mesh and other new technologies, the future will completely take over the operation and maintenance work from the hands of users middleware and IaaS layer, the degree of automation in terms of infrastructure is a greater challenge.

Automation infrastructure (Automation) is a prerequisite for cloud native dividends can be fully released, while observability is the cornerstone of all automated decision-making .

If the efficiency of each interface, success or failure can be precise statistics, each user can request the ins and outs of being a complete retrospective, applications and dependencies between applications and the underlying resource can automatically sort, then we can based on This information is automatically determine the root abnormalities in business due to what? Need to do business on the underlying resource implications of migration, expansion or removal? We will be able to double the peak 11 automatically calculate whether the required resources to fully prepare and do not waste each application.

Observability ≠ monitoring

Many people ask, "observability" is is "monitoring" for a statement, the industry's definition of these two things are in fact very different.

It is different from " monitor" to monitor more emphasis on discovery and early warning of problems, and "observability" The ultimate goal is to give all reasonable interpreted as a complex distributed system occurred . Monitoring software delivery more attention during and after (Day 1 & Day 2) delivery, that is, we often say "things with hindsight", and "observability" will have to be responsible for the full development and operation and maintenance life cycle.

Back to "observability" itself, is still commonplace "link (Tracing)" , "indicators (Metric)" and "log (Logging)" constituted a separate pull-out look is very mature technology. But these three things and how to integrate cloud infrastructure? How to better link between them, together? And how they can be better and do online business cloud era of combination? This is the direction our team the past two years trying to explore.

What we've done this year

11 double, Hawkeye team of engineers in four new exploration technology direction for the overall group's business on the cloud, automation and global stability preparing for dual 11 provides strong protection this year:

For a scene of traffic observability

As the state continues to Alibaba electrical business complexity and diversity, has also been preparing for the big promotion tends to refinement of the scene .

Prepare the way of the past, each person is responsible for micro-service system based on the case itself and is responsible for the upstream and downstream systems, fighting each other. This divide-and-conquer approach, while effective enough, but inevitably there are omissions, the fundamental reason lies in the application stage and the actual business scenarios dislocation relations. Transaction system, for example, a trading system will carry multiple types of Lynx, horse box, barley, flying pigs and other services, and the expected amount of each service call, dependent on the downstream path, etc. are not the same as the transaction system responsible person, very difficult to sort out the details of a clear impact on the downstream business logic of each of their own system.

Hawkeye team this year launched a capability scene of the link, combined with business metadata dictionary, by non-invasive automatic marking implement flow dyeing means, the actual flow of business, to break down the data service with downstream middleware in the past from application-centric view, turned into to the business scene as the center , and therefore closer to the real big promotion model.

cs1.png

As shown above, this is a case of query goods, four systems A, B, C, D, respectively, to provide "Product Details", "commodity type", "price details" and "offer details" of query capabilities. A portal application provides a S1 interface to query product by Eagle Eye, we can quickly find application B, C, D belonging to the dependent application A, but also the interface of the S1 downstream, with respect to the stability of the system management, there is such a link is sufficient data.

But in fact do not have such a business perspective observability, because both services scene contains such a dependency structure, the two links corresponding to the scene is completely different: Group A chain corresponding to the commodity road is A-> B-> CD, and B is the link corresponding commodity A-> B-> C. The proportion of these two types of goods assume normal day is 1: 1, and a large proportion of promoting state is 1: 9, then only from the perspective of the system or business perspective to sort out the link, can not get a reasonable pre-flow assessment model.

所以,如果我们能在系统层通过打标的方式把两种流量染色,就能很方便地梳理出两种业务场景所对应的的链路,这样一份更加精细化的视角对于保证业务的稳定性、以及更加合理的依赖梳理和限流降级策略的配置显得尤为重要。

这样的业务场景化能力在今年的 双11 备战中发挥了巨大的价值,很多业务系统都基于这样的能力梳理出了自己核心的业务链路,备战更加从容且不会有遗漏;同时,一系列的服务治理工具,在鹰眼的赋能下,进行了全面的场景化升级,例如针对场景化的流量录制和回放、场景化的故障演练工具、场景化的精准测试回归等等。配合这些更加贴合业务场景的服务治理工具,帮助整个 双11 备战的可观测性颗粒度走进了“高清时代”。

基于可观测性数据的智能根因定位

云原生时代,随着微服务等技术的引入,业务规模的增长,应用的实例数规模不断增长,核心业务的依赖也变得愈加复杂。一方面我们享受着开发效率的指数提升的红利,同时也在承受着故障定位成本居高不下的痛苦。特别是当业务出现问题的时候,如何快速发现问题和止血变得非常困难。鹰眼团队作为集团内应用性能的“守护神”,如何帮助用户快速完成故障定位成为今年的新挑战。

要完成故障定位,首先要回答,什么是你认为的故障?这背后需要运维人员对业务深层次的理解,很多维护人员喜欢使用穷举式的手段配上所有可观测性的指标,各种告警加上,显得有“安全感”,实际上当故障来临时,满屏出现指标异常、不断增加的告警短信,这样的“可观测性”看上去功能强大,实际效果却适得其反。

团队对集团内的历年故障做了一次仔细梳理,集团内的核心应用通常有四类故障(非业务自身逻辑问题):资源类、流量类、时延类、错误类。

再往下细分:

  1. 资源类:  比如 cpu、load、mem、线程数、连接池;
  2. 流量类:业务流量跌零 OR 不正常大幅度上涨下跌,中间件流量如消息提供的服务跌零等;
  3. 时延类:系统提供的服务 OR 系统依赖的服务,时延突然大幅度飙高了,基本都是系统有问题的前兆;
  4. 错误类:服务返回的错误的总数量,系统提供服务 OR 依赖服务的成功率。

有了上面这些故障分类作为抓手后,后面要做的就是“顺藤摸瓜”,可惜随着业务的复杂性,这根“藤”也来越长,以时延突增这个故障为例,其背后就隐藏着很多可能的根因:有可能是上游业务促销导致请求量突增导致,有可能是应用自身频繁 GC 导致应用整体变慢,还有可能是下游数据库负载过大导致响应变慢,以及数不胜数的其它各种原因。

鹰眼以前仅仅提供了这些指标信息,维护人员光看单条调用链数据,鼠标就要滚上好几番才能看完一条完整的 tracing 数据,更别说跨多个系统之间来回切换排查问题,效率也就无从谈起。

故障定位的本质就是一个不断排查、否定、再排查的过程,是一个“排除掉所有的不可能,剩下的就是真相”的过程。仔细想想可枚举的可能+可循环迭代的过程,这个不就是计算机最擅长的处理动作吗?故障定位智能化项目在这样的背景下诞生了。

提起智能化,很多人第一反应是把算法关联在一起,把算法过度妖魔化。其实了解机器学习的同学应该都知道:数据质量排第一,模型排第二,最后才是算法。数据采集的可靠性、完整性与领域模型建模才是核心竞争力,只有把数据化这条路走准确后,才有可能走智能化。

故障定位智能化的演进路线也是按照上面的思路来逐步完成的,但在这之前我们先得保障数据的质量:得益于鹰眼团队在大数据处理上深耕多年,数据的可靠性已经能得到非常高质量的保障,否则出现故障还得先怀疑是不是自己指标的问题。

接下来就是数据的完备性和诊断模型的建模,这两部分是智能化诊断的基石,决定了故障定位的层级,同时这两部分也是相辅相成的,通过诊断模型的构建可以对可观测性指标查漏补缺,通过补齐指标也可以增加诊断模型的深度。

主要通过以下 3 方面结合来不断地完善:

  • 第一,历史故障推演,历史故障相当于已经知道标准答案的考卷,通过部分历史故障+人工经验来构建最初的诊断模型,然后迭代推演其余的历史故障,但是这一步出来的模型容易出现过拟合现象;
  • 第二,利用混沌工程模拟常见的异常,不断修正模型;
  • 第三,线上人为打标的方式,来继续补齐可观测性指标、修正诊断模型。

经过以上三个阶段之后,这块基石基本建立完成了。接下来就要解决效率问题,从上面几步迭代出来的模型其实并不是最高效的,因为人的经验和思维是线性思维,团队内部针对现有模型做了两方面的工作:边缘诊断和智能剪枝。将定位的过程部分下沉到各个代理节点,对于一些可能对系统造成影响的现象自动保存事发现场关键信息同时上报关键事件,诊断系统自动根据各个事件权重进行定位路径智能调整。

智能根因定位上线后,累计帮助数千个应用完成故障根因定位,并取得了很高的客户满意度,基于根因定位结论为抓手,可观测性为基石,基础设施的自动化能力会得到大大提升。今年的 双11 大促备战期间,有了这样的快速故障定位功能,为应用稳定性负责人提供了更加自动化的手段。我们也相信在云原生时代,企业应用追求的运行的质量、成本、效率动态平衡不再是遥不可及,未来可期!

最后一公里问题定位能力

什么是“最后一公里”的问题定位?“最后一公里”的问题有哪些特点?为什么不是“最后一百米”、“最后一米”?

首先,我们来对齐一个概念,什么是“最后一公里”?在日常生活中,它具备以下特点:

  • 走路有点远,坐车又太近,不近不远的距离很难受;
  • 最后一公里的路况非常复杂,可能是宽阔大道,也可能是崎岖小路,甚至是宛如迷宫的室内路程(这点外卖小哥应该体会最深)。

那么分布式问题诊断领域的最后一公里又是指什么呢,它又具备哪些特征?

  • 在诊断流程上,此时已经离根因不会太远,基本是定位到了具体的应用、服务或节点,但是又无法确定具体的异常代码片段;
  • 能够定位根因的数据类型比较丰富,可能是内存占用分析,也可能是 CPU 占用分析,还可能是特定的业务日志/错误码,甚至只是单纯的从问题表象,结合诊断经验快速确定结论。

通过上面的分析,我们现在已经对最后一公里的概念有了一些共识。下面,我们就来详细介绍:如何实现最后一公里的问题定位?

首先,我们需要一种方法,可以准确的到达最后一公里的起点,也就是问题根因所在的应用、服务或是机器节点。这样可以避免根源上的无效分析,就像送外卖接错了订单。那么,如何在错综复杂的链路中,准确的定界根因范围?这里,我们需要使用 APM 领域较为常用的链路追踪(Tracing)的能力。通过链路追踪能够准确的识别、分析异常的应用、服务或机器,为我们最后一公里的定位指明方向。

然后,我们通过在链路数据上关联更多的细节信息,例如本地方法栈、业务日志、机器状态、SQL 参数等,从而实现最后一公里的问题定位,如下图所示:

cs2.png

  • 核心接口埋点: 通过在接口执行前后插桩埋点,记录的基础链路信息,包括 TraceId、RpcId(SpanId)、时间、状态、IP、接口名称等。上述信息可以还原最基础的链路形态;
  • 自动关联数据: 在调用生命周期内,可以自动记录的关联信息,包括 SQL、请求出入参数、异常堆栈等。此类信息不影响链路形态,但却是某些场景下,精准定位问题的必要条件;
  • 主动关联数据: 在调用生命周期内,需要人为主动记录的关联数据,通常是业务数据,比如业务日志、业务标识等。由于业务数据是非常个性化的,无法统一配置,但与链路数据主动关联后,可以大幅提升业务问题诊断效率;
  • 本地方法栈: 由于性能与成本限制,无法对所有方法添加链路埋点。此时,我们可以通过方法采样或在线插桩等手段实现精准的本地慢方法定位。

通过最后一公里的问题定位,能够在日常和大促备战态深度排查系统隐患,快速定位根因,下面举两个实际的应用案例:

  • 某应用在整体流量峰值时出现偶发性的 RPC 调用超时,通过分析自动记录的本地方法栈快照,发现实际耗时都是消耗在日志输出语句上,原因是 LogBack 1.2.x 以下的版本在高并发同步调用场景容易出现“热锁”,通过升级版本或调整为异步日志输出就彻底解决了该问题;

  • 某用户反馈订单异常,业务同学首先通过该用户的 UserId 检索出下单入口的业务日志,然后根据该日志中关联的链路标识 TraceId 将下游依赖的所有业务流程、状态与事件按实际调用顺序进行排列,快速定位了订单异常的原因(UID 无法自动透传到下游所有链路,但 TraceId 可以)。

监控告警往往只能反映问题的表象,最终问题的根因还需要深入到源码中去寻找答案。鹰眼今年在诊断数据的“精细采样”上取得了比较大的突破,在控制成本不膨胀的前提下,大幅提升了最后一公里定位所需数据的精细度与含金量。在整个双11 漫长的备战期中,帮助用户排除了一个又一个的系统风险源头,从而保障了大促当天的“丝般顺滑”。

全面拥抱云原生开源技术

过去一年,鹰眼团队拥抱开源技术,对业界主流的可观测性技术框架做了全面集成。我们在阿里云上发布了链路追踪(Tracing Analysis)服务,兼容 Jaeger(OpenTracing)、Zipkin、Skywalking 等主流的开源 Tracing 框架,已经使用了这些框架的程序,可以不用修改一行代码,只需要修改数据上报地址的配置文件,就能够以比开源自建低出许多的成本获得比开源 Tracing 产品强大许多的链路数据分析能力。

Hawkeye team also released a fully hosted version of Prometheus service , to solve the open-source version of the deployment footprint is too large, too many write performance issues when monitoring nodes, for long-range, multi-dimensional query speed too slow query problem optimized. The optimized cluster Prometheus hosted in Alibaba full support of the Service Mesh monitoring as well as several heavyweight Ali cloud customers, we will also be a number of optimization points to feed back to the community. Similarly, the hosted version is compatible with the open source version of Prometheus, on the container Ali cloud services can do a key move to managed version.

Observability and stability are inseparable, Hawkeye engineers this year will be the years a series of articles related to observability, stability, construction and other tools to do the finishing, included in the Github on, be welcome to work together to build .

ban.jpg

The book highlights

  • Dual 11 large scale cluster K8s practice, problems encountered and solutions detailed
  • Cloud original biochemical best combination: Kubernetes + + Dragon container, to achieve 100% of the core system of the technical details cloud
  • Dual 11 Service Mesh ultra-large-scale landing Solutions

" Alibaba Cloud native concern micro service, Serverless, container, Service Mesh and other technical fields, focusing cloud native popular technology trends, cloud native large-scale landing practice, most do understand the developer's native cloud technology circles."

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/12074729.html