What are the technical points for implementing an observability platform? [Book Donation Event|The Ninth Issue of "Observability Engineering"]

As the concept of observability is deeply rooted in the hearts of the people, the observability platform has begun to enter the implementation stage, and its advancement is beyond doubt; and there is another boot: how can it take root in the enterprise as a unified and integrated platform?

Observability is not just a buzzword, nor is it a keyword hype. You might as well review the evolution of operation and maintenance management as we know it, and put aside the red tape about processes and people in operation and maintenance management. Let us only focus on: changes in infrastructure and application architecture, and on the aspects of these endless technical tools.

Insert image description here

Compatible with global semaphores

From the perspective of telemetry methods: any type of signal has its own purpose and reason. It is a relatively extreme idea to arbitrarily select one of them as a synonym for observability. On the road to debugging the production environment, it is difficult for us to rely on a single signal. method. We need to choose a reasonable SLI combination based on the characteristics and service types of different application systems, and use appropriate semaphores to cover the target application system. The goal is to create the "observability attributes" of the application system itself. In this way, you must choose, add or change signal types wisely, and be able to tailor the treatment according to your needs. Here it is not that the more monitoring data sources the better, blindly comprehensive coverage is also an approach that achieves twice the result with half the effort; in the scenario of dealing with high-dimensional, high-cardinality operation and maintenance big data, we can easily lead to a situation where storage costs soar, and invalid noise data is still there. Can severely dilute valuable information points.

Insert image description here

What are the so-called global semaphores?

Log: Text records system and application activities, events, and errors, providing detailed context.

Metric: Quantitative performance measurement, such as CPU usage and request rate, to help monitor system status.

Distributed tracing Trace: Track the path of requests and performance bottlenecks in the distributed system.

Streaming data: Data generated in real time, such as user behavior, for real-time monitoring and analysis.

User experience data RUM: records user interactions, operations, and reactions in the application to evaluate the quality of experience.

eBPF: Extends Berkeley Packet Filter to collect kernel-level data for analysis and monitoring.

Network performance management NPM: monitor network bandwidth, latency and connection status, and optimize network performance.

Profiling: Analyze the performance characteristics of the code when it is running to help optimize the application.

Cloud service Cloud: Monitoring data obtained from cloud providers to track resource usage and performance.

Dial test data Uptime/synthetics: Regularly conduct external testing of the system to monitor the system's availability and performance in different locations and conditions.

New technologies of the future: unknown types of data.

The "observability management platform" should be designed with an inclusive and comprehensive semaphore as its initial design goal. This means that: in the entire process of observation data collection, uploading, storage, display and correlation analysis, all types of data need to be processed correctly, so that cross-type data correlation can be performed more reasonably and effectively; in data drill-down, During the process, you can freely jump and explore between various timelines.
Of course, monitoring the known "unknowns" is a basic management requirement, and you should be able to use some kind of semaphore to achieve this. Observability is more about discussing: managing changes between "unknown" states; this requires an "observability platform" that can handle the high "complexity" of multi-level, high-dependency, multi-cloud environments, and distributed systems. "Comprehensive preparation and on-demand access to semaphores is often only a necessary condition.
There are already many operation and maintenance management platforms on the market that call themselves "observability" management platforms. But most of them start with a specific monitoring type and gradually expand to cover other signal types. Generally, only platforms that can cover more than 3 signal types are likely to have excellent practical effects; for those "observability" products that are already 3 to 5 years old, they are unlikely to achieve gorgeous results in the short term. Even if you turn around, you won't be able to rebuild your product from scratch.

Unified collection and upload tools

In an era when physical machines are popular, a host (virtual machine or physical machine) is likely to play multiple roles. Moreover, according to the management needs of different teams, a variety of management monitoring agents will be installed in their operating systems, such as: operating system indicators, logs, databases, middleware, security inspections, etc.; this stacked form not only provides operators with System resources have brought serious consumption, and even brought a lot of trivial matters to server management. For example, the database monitoring agent also needs to create a dedicated user account, etc. In order to solve this problem, many companies hope to use as few single collection agents as possible. For example: BMC's Patrol monitoring product has a variety of collection modules KM (database, middleware, web server, etc.), and users can do it as needed. configuration without the need to deploy multiple collection agents. However, BMC will gradually acquire many new products. Later products include dynamic performance baseline management, automated configuration management, etc. From the perspective of tool manufacturers, they cannot carry out rapid product integration and it is difficult to maintain a single collection agent.
In the environment of Party A's enterprise, different departments will purchase different management tools according to their own needs. Differences between departments lead to repeated construction of tools and repeated collection of data, and data is not easily shared among departments. This not only brings about the overlay deployment of collection tools on the same host, but also leads to the independent operation of a large number of isolated island operation and maintenance data databases with duplicate data. This situation has further led to other problems. For example, the same fault on the same host will trigger multiple alarm events in various tools; an event storm is coming. This chaotic situation gives AIOps tools room to survive. Even though it can produce some benefits in event convergence and compression, there is an obvious mistake of "treating the symptoms but not the root cause".

Time has traveled to the era of virtualization & cloud native, and the above situation has not fundamentally changed. Instead, it brings about the dilemma of nesting doll-like deep dependence. We will not run web, middleware, database, message queue and other functions in a POD, but after deploying them independently in sub-services (container services) that can be horizontally expanded, this will bring about the number of management objects. Showing an exponential surge. The container era has brought fresh monitoring tools, including: Prometheus, Grafana, FluntD, Graphite, cAdvisor, Loki, EFK, etc. We can observe that the new tools will not completely change the situation of coexistence and superposition of multiple collection function agents. After seeing the problem of deploying multiple similar agent programs, Elastic has quickly integrated various previous Beats programs (projects acquired multiple times) into a unified agent Elastic Agent in recent years. However, this program is currently only a multi-purpose agent. A vest (packaging shell) program for the Beats program.

Not only do multiple collection toolsets cause a lot of deployment and configuration chores on the endpoints, but their backends also correspond to their own independent database deployments. The field descriptions of the same management object in different databases are basically different. This makes it difficult for users of the toolset to implement correlation analysis in various databases. The human brain carries the debugging context and performs debugging in a bunch of consoles. Jumping between tasks is quite labor-intensive, and aligning timelines and monitoring objects will quickly exhaust one’s cognitive upper limit.

CMDB may be a solution, but the design and construction of CMDB is no less difficult than building any monitoring system project itself. Using CMDB to solve this problem is difficult and costly to implement. Data governance will also be a common practice, and the solution to do ELT and data governance between these operation and maintenance database collections, and finally realize the normalization of heterogeneous operation and maintenance information, is just a helpless step. For example, the relevant implementation personnel will definitely experience the bitterness of taking advantage of the situation during the project.

It seems that the Unified Data Model (ECS) first launched by Elastic is a feasible way to move data towards standardized definitions. We also saw it: the OpenTelemetry project quickly adopted Elastic ECS. CNCF later launched a similar observation data definition model. I believe that CNCF must have seen the rapid prosperity of similar & similar tools in observability and analysis categories in its technology blueprint. These standards can only quench our thirst, because we have yet to see most manufacturers and a large number of open source projects quickly follow the implementation and implementation of compatibility.

Observation Cloud's DataKit is a multifunctional collection agent that is designed to solve the above problems. It is already compatible with and connected to a wider technology ecosystem. After any collection agent collects or connects the target data, it actually needs to deal with a series of details, otherwise it will still be unable to achieve "source governance" and avoid the dilemma of "garbage in garbage out". First of all, when DataKit organizes and encapsulates data, the definition of all fields follows a data dictionary defined by the observation cloud (equivalent to Elastic ECS); secondly, before the reported data package is packaged, it can also perform data pipeline processing, realizing data Issues such as field discarding, quality control, governance and desensitization. Finally, DataKit's collection can also be connected to open source & closed source ecosystems, such as receiving DataDog's APM probe data, connecting to OpenTelemetry data, and so on. It can also realize the forwarding of observation data on the Internet and between networks.

Unified storage backend

In the process of building an observability platform, each type of semaphore deserves its best place:

Elasticsearch: With the blessing of Elastic's ECS, it seems to be a very appropriate one-inventory-all solution, but the premise is that you need to be able to maintain the cost-effectiveness.

Time series database: Not listed one by one, suitable for indicator time series data.

Column database: A column database for real-time data analysis represented by ClickHouse, which is compatible with a variety of signals.

Relational database: WHY NOT.

From the perspective of data storage, configuring the best database type for each semaphore seems to be a happy situation for everyone. This does not disappoint the current situation where various open source databases are blooming.
Skip the data silos and governance issues already mentioned above. From a query perspective, users will have to learn multiple query languages. There are n kinds of SQL syntax ahead that you need to learn, otherwise you have to develop and maintain a one-to-many query interface. Let’s not discuss here: how you will implement cross-database data correlation analysis of observability data.
Question: Is there a multi-modal unified database that integrates multiple types of semaphore data into a unified data warehouse?
In fact, the current observability SaaS providers have provided their users with such a unified and integrated data backend, at least from the perspective of querying and exploring the use of observability data. . The Observation Cloud is also launching such a database to solve the above unified, integrated, polymorphic coexistence management needs. Observation Cloud users will soon be able to use this technology in SaaS services and on privately deployed products.

Freedom to explore and synthesize data

The value of observable data is reflected in its use. Only by being able to freely explore and comprehensively use various data can the value of the data be amplified. When considering the usage scenarios of observability data, the editor strongly recommends that you use "first principles" to think, so as to avoid reliance on experience and rule out the assumption that new observability technologies can replace all old technologies. Only through pure fantasy can we return to the conceptual origin of observability technology.

Insert image description here

Summarize

This article discusses the technical points of implementing an observability platform from four levels with a certain depth and time span. Hope: In your working environment, a unified and integrated observability platform can be implemented soon. By wearing two boots, you can escape from the previous dilemma of going into battle barefoot and fighting fires barefoot. We hope that the observability platform can help everyone in the software delivery pipeline, use observability to complement Ops, increase the power of SRE, and embolden Dev.

Recommended reading "Observability Engineering"

Insert image description here

Reason for recommendation: Written by Google SRE core experts and observability community leaders, and lovingly translated by the observation cloud team of domestic unicorn companies in the field of observability. A guide to the implementation of observability technology, which effectively solves the pain points of difficult operation and maintenance of software systems in the cloud native era. Promote IT systems to achieve efficient delivery, unified operation and maintenance, and lasting optimization.

Purchase link https://u.jd.com/nb2cA1B

Live broadcast preview

Live broadcast theme:
Forum on New Trends in Modern Software Engineering and New Book Launch of "Observability Engineering"

Live broadcast time
September 20 (Wednesday)
19:00 - 20:30

"Observability Engineering" was published in 2021 and has been widely praised overseas. It is a must-read book for every engineer who wants to understand observability technology. At 19:00 on the evening of September 20, 2023, the Huazhang Branch of the Machinery Industry Press will join hands with the "Observation Cloud Team", the Chinese translator of this book, to hold an online new book launch conference to explore the development of observability technology with guests in the circle. New trends and a new future.

Book a live broadcast

Video number: CSDN Live broadcast reservation reminder: "Lecture" - New trends in modern software engineering; CSDN official website live broadcast room will also be broadcast simultaneously!

Insert image description here

Lottery method

  • Follow+Like+Collect articles

  • Leave a message in the comment area: Learn the full stack of knowledge and find a winner (you can enter the prize pool by following and leaving a message, each person can leave a maximum of three messages)

  • Random drawing at 8pm on Sunday

  • This time we will give away 2~5 books [the more you read, the more you will give away]
    500-1000 2 free books
    1000-1500 3 free books
    1500-2000 4 free books
    2000+ 5 free books

Guess you like

Origin blog.csdn.net/weixin_44816664/article/details/132976671