Interpretation: Observability development direction under cloud native

Introduction: I am very fortunate to participate in the cloud native community Meetup Beijing station. I have the opportunity to discuss cloud native related technologies and applications with many industry leaders. In this Meetup, I shared with you about observability under cloud native Topic, this article is mainly a textual summary of the video, welcome to leave a message for discussion.

The origin of observability

Observability first came from the field of electrical engineering. The main reason is that with the gradual complexity of the system development, there must be a set of mechanisms to understand the internal operating status of the system for better monitoring and problem repair. For this purpose, engineers have designed many Sensors and dashboards are used to represent the internal state of the system.

A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated using only the information from outputs.

Electrical engineering has been developed for hundreds of years, and the observability of various sub-fields is being improved and upgraded. For example, transportation (cars/aircraft, etc.) can also be regarded as a master of observability. Aside from the super engineering such as airplanes, there are hundreds of sensors inside a small car that can go on the road normally to detect various states inside and outside the car, so that the car can drive stably, comfortably and safely.

image.png

Observable future

With the development of hundreds of years, the observability under electrical engineering has not only been used to assist people in checking and locating problems. From the perspective of automotive engineering, the entire observability development has gone through several processes:

  1. Blindness: On January 29, 1886, German Karl Benz invented the first car in human history. At that time, the car only had the most basic ability to drive and had nothing to do with observability.
  2. Sensors: As cars officially entered the market later, people needed to better know whether the car was out of gas or water, so the basic sensor dashboard was invented.
  3. Warning: In order to better ensure the safety of the car's form, people began to use self-checking and real-time warning systems to proactively notify the driver of some abnormal information, such as dead battery, high water temperature, low tire pressure, and brake pad wear.
  4. Auxiliary: Although the alarm can be sent out immediately, sometimes people still can't handle it or don't want to handle it. At this time, the auxiliary system comes in handy, such as cruise control, active safety, autonomous parking, etc. These auxiliary systems are a combination of sensors and automatic control, which can partially solve things that the driver may not be able to do or do not want to do.
  5. Autonomous driving: The above-mentioned functions still require human participation in the end. Autonomous driving does not require human participation at all. The observability system + control system can make the car run automatically.

Core elements of autonomous driving

image.png

As the pinnacle of observability in electrical engineering, autonomous driving maximizes the various internal and external data obtained by the car. In summary, there are several core elements:

  1. Rich data sources: There are multiple laser/image radars on the periphery of the car, which can achieve high frame rate and 360° real-time observation of surrounding objects and their states; internally, they can know the current vehicle speed, wheel angle, tire pressure and other information in real time. To know the enemy and confidant.
  2. Data centralization: Relative to assisted driving capabilities, a core breakthrough of autonomous driving is the ability to centralize all data inside and outside the car for processing, so as to truly bring out the value of data, instead of operating independently of each module's data as an island.
  3. Powerful computing power: Centralized data also means that the amount of data is rapidly expanding. No matter which autopilot has a powerful chip support, only enough computing power can ensure that sufficient calculations can be performed in the shortest time.
  4. Software iteration: computing power + algorithm constitutes the ultimate goal of intelligence. However, the algorithm cannot be flawless. We will continuously upgrade the algorithm based on the accumulated automatic driving data, so that the software system can be continuously upgraded to obtain better automation Driving effect.

Observability of IT system

With decades of development, monitoring and troubleshooting in IT systems have gradually been abstracted into observability engineering. At the time, the most mainstream way was to use a combination of Metrics, Logging, and Tracing.

image.png

The above picture is very familiar to everyone. This is a blog post published by Peter Bourgon after participating in the 2017 Distributed Tracing Summit. It briefly introduces the definition and relationship of Metrics, Tracing, and Logging. These three types of data have their own space in observability, and each type of data cannot be completely replaced by other data.

Look at a typical troubleshooting process introduced in Grafana Loki:

1. 最开始我们通过各式各样的预设报警发现异常(通常是Metrics/Logging)
2. 发现异常后,打开监控大盘查找异常的曲线,并通过各种查询/统计找到异常的模块(Metrics)
3. 对这个模块以及关联的日志进行查询/统计分析,找到核心的报错信息(Logging)
4. 最后通过详细的调用链数据定位到引起问题的代码(Tracing)

image.png

The above example introduces how to use Metric, Tracing, and Logging to jointly troubleshoot problems. Of course, there can be different combinations according to different scenarios. For example, a simple system can directly use the error information in the log to alert and directly locate the problem, or according to the call The basic indicators (Latency, ErrorCode) extracted by the chain trigger an alarm. But overall, a system with good observability must have the above three types of data.

Observability under cloud native

Cloud native brings more than just application deployment that can be deployed on the cloud. The entire definition is a set of new IT system architecture upgrades, including development models, system architecture, deployment models, and complete evolution and iteration of infrastructure.

image.png

  1. Higher efficiency requirements: With the popularization of the DevOps model, the efficiency requirements of planning, development, testing, and delivery are getting higher and higher, and the problem that comes with it is the need to know in real time whether the release is successful or not. What is the problem, where is the problem, and how to solve it quickly.
  2. The system is more complex: the architecture has evolved from the initial integrated development to the layered model to the current microservice model. The upgrade of the architecture has brought advantages such as development efficiency, release efficiency, system flexibility, and robustness, but with it The complexity of the system will be higher, and the problem location will be more difficult.
  3. Enhanced environmental dynamics: Whether it is a microservice architecture or a containerized deployment model, a feature brought about is that the dynamics of the environment will be enhanced, and the life cycle of each instance will be shorter. After a problem occurs, the site is often destroyed. The way to log in to the machine to troubleshoot problems no longer exists.
  4. The upstream and downstream rely more on: the positioning of the problem will eventually be investigated from the upstream and downstream. In the environment of microservices, cloud, and K8s, there will be more upstream and downstream, including various other business applications, various products used on the cloud, and Kind of middleware, K8s itself, container runtime, virtual machine, etc.

Saver: OpenTelemetry

I believe many readers will have a deep understanding of the above-mentioned problems, and the industry has also withdrawn from various observability-related products in response to this situation, including many open source and commercial projects. E.g:

  1. Metrics:Zabbix、Nagios、Prometheus、InfluxDB、OpenFalcon、OpenCensus
  2. Tracing:Jaeger、Zipkin、SkyWalking、OpenTracing、OpenCensus
  3. Logging:ELK、Splunk、SumoLogic、Loki、Loggly

image.png

Using a combination of these items can more or less solve specific types or types of problems, but in real application you will find various problems:

  1. Multiple sets of solutions are interwoven: at least three solutions of Metrics, Logging, and Tracing may be used, which is expensive to maintain
  2. Data is not interoperable: Although it is the same business component and the same system, the data generated is difficult to interoperate in different solutions, and the data value cannot be fully utilized
  3. Vendor binding: Data collection, transmission, storage, calculation, visualization, alarms, etc. may be bound by vendors. Once the observability system is online, the cost of replacing it is extremely huge
  4. Cloud native is not friendly: Many of these solutions are for traditional systems, and the support for cloud native is relatively weak, and the solution itself is expensive to deploy and use, which does not meet the "cloud native" one-click deployment and out of the box Ready-to-use usage.

image.png

In this context, the OpenTelemetry project was born under the CNCF of the Cloud Native Foundation, which aims to unify Logging, Tracing, and Metrics to achieve data interoperability.

Create and collect telemetry data from your services and software, then forward them to a variety of analysis tools.

image.png

The core function of OpenTelemetry is to generate and collect observable data, and support transmission to various analysis software. The overall architecture belongs to the following figure, among which Library is used to generate observable data in a uniform format; Collector is used to receive these Data, and supports the transmission of data to various types of back-end systems.

The revolutionary advancements OpenTelemetry brings to the cloud native include:

  1. Unified protocol: OpenTelemetry brings us unified standards for Metric, Tracing, and Logging (under development, LogModel has been defined), all of which have the same metadata structure and can be easily interconnected
  2. Unified Agent: The collection and transmission of all observable data can be completed by using one agent. There is no need to deploy various agents for each system, which greatly reduces the resource occupation of the system and enables the overall observability of the system architecture Also becomes simpler
  3. Cloud-native friendly: OpenTelemetry was born in CNCF, and it is more friendly to all kinds of cloud-native system support. In addition, many cloud vendors have announced support for OpenTelemetry, which will be more convenient in the future.
  4. Vendor-independent: This project is completely neutral and does not lean towards any one vendor, so that everyone can have full freedom to choose/change the service provider that suits them, without receiving monopoly or binding from certain vendors
  5. Compatibility: OpenTelemetry is supported by various observability schemes under CNCF. In the future, there will be very good compatibility for OpenTracing, OpenCensus, Prometheus, Fluntd, etc., so that everyone can seamlessly migrate to the OpenTelemetry scheme.

OpenTelemetry limitations

From the above analysis, OpenTelemetry is positioned as an observability infrastructure to solve the problem of data specification and acquisition, and the follow-up part depends on each Vendor to achieve. Of course, the best way is to have a unified engine to store all Metrics, Logging, and Tracing, and a unified platform to analyze, display, and associate these data. At present, no manufacturer can support OpenTelemetry's unified backend very well. Now you still need to use the products of various manufacturers to implement it. Another problem that this brings is that the association of various data will be more complicated, and it is also necessary to solve the problem of data association between each manufacturer. Of course, I believe this problem will definitely be solved in 1-2 years. Now many manufacturers are working hard to achieve a unified solution for all types of data in OpenTelemetry.

image.png

Observability's future direction

image.png

Since the beginning of the Feitian 5K project in 2009, our team has been responsible for observability-related work such as monitoring, logging, distributed link tracking, and has experienced minicomputers to distributed systems to microservices and cloudification. Some of the architecture changes and related observability schemes have also undergone a lot of evolution. We feel that the overall observability-related development is very consistent with the setting of the autonomous driving level.

Autonomous driving is divided into 6 levels, among which level 0-2 is mainly determined by people. After level 3, unconscious driving can be carried out, that is, hands and eyes can temporarily ignore driving, and at level 5, people can completely leave. Driving this boring job, you can move around freely in the car.

In terms of the observability of IT systems, 6 levels can be similarly divided:

  • Level 0: Manual analysis, relying on basic Dashboard, alarm, log query, distributed link tracking and other methods for manual alarm and analysis, which is also the scene used by most companies at present
  • Level 1: Intelligent alarm, which can automatically scan all observable data, use machine learning to identify some abnormalities and perform automatic alarms, eliminating the need to manually set/adjust various baseline alarms
  • Level 2: Abnormal association + unified view, for automatically identified anomalies, contextual association can be carried out to form a unified business view, which is convenient for rapid location of problems
  • Level 3: Root cause analysis + problem self-healing, automatically locate the root cause of the problem directly based on the abnormality and the CMDB information of the system. After the root cause is located accurately, the problem can be self-healed. This stage is equivalent to a qualitative leap. In some scenarios, the problem can be self-healed without human involvement.
  • Level 4: Failure prediction, there will always be losses when failure occurs, so the best case is to avoid the occurrence of failures. Therefore, failure prediction technology can better ensure the reliability of the system, and use some of the accumulated failure precursor information to achieve " Unexpected Prophet"
  • Level 5: Change impact prediction. We know that most of the failures are caused by changes. Therefore, if we can simulate the impact of each change on the system and possible problems, we can evaluate in advance whether it can be allowed This change.

image.png

Alibaba Cloud SLS's work on observability

At present, our SLS is working on cloud-native observability. Based on OpenTelemetry, the future cloud-native observability standard, we realize the unified collection of various observable data, covering various data sources and various data types, and achieve more Language support, multi-device support, unified type; upward we will provide unified storage and computing capabilities that can support various types of observable data, support PB-level storage, ETL, stream computing, and second-level analysis of tens of billions of data, which are upper-level algorithms Provide strong computing power support; IT system problems are very complex, especially involving different scenarios and architectures, so we combine algorithms and experience to perform abnormal analysis. Algorithms include basic statistics, logical algorithms, and AIOp Related algorithms, experience includes manual input of expert knowledge, various problem solutions accumulated on the Internet, and some external events; at the top level, we will provide some auxiliary decision-making functions, such as alarm notification, data visualization, Webhook, etc., In addition, it will provide a wealth of external integration capabilities, such as docking three-party visualization/analysis/alarm systems, and providing OpenAPI to facilitate the integration of different applications.

image.png

to sum up

As the most active project under CNCF other than Kubernetes, OpenTelemetry has attracted the attention of major cloud vendors and related solution companies. It is believed that it will become the standard for observability under cloud native in the future. Although it has not yet reached the level of production availability, the SDK and Collector of each language are basically stable. The production-available version will be released in 2021, which is worth looking forward to.

Author: Ethylene

Original link 

This article is the original content of Alibaba Cloud and may not be reproduced without permission

 

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/112984597