Seven years and four stages: Didi’s observable architecture evolution and practice

A quick overview of the highlights in one minute

At the current stage, there is no unified execution path for the construction of observability. Each company will form a unique set of practices based on its own business needs, operating model and scale. In order to cope with the expansion of business scale and changes in requirements, the observable team must continuously optimize and upgrade its architecture, and always ensure the high availability of the observable system itself.

This article describes in detail the technical challenges Didi has encountered in four different stages from 2017 to the present, such as resource bottlenecks in the single application stage, rising operation and maintenance costs, communication problems in distributed services, etc. Didi has gradually overcome these technical problems by finding and applying appropriate technical solutions, so that its observable architecture can always provide strong support for its business.file

about the authorfile

Didi Chuxing observable architecture leader——Qian Wei

Member of the TakinTalks stability community expert team and head of Didi Chuxing’s observable architecture. He has been deeply involved in the observable field for many years, focusing on architecture design and optimization. Led the team to complete Didi’s second-generation to fourth-generation architecture iterations. Contributor to multiple observable open source projects. The current focus is on the stability construction of Didi observability and the realization and implementation of observability in Didi scenarios.

Warm reminder: This article is about 7,500 words, and it takes about 12 minutes to read.

"TakinTalks Stability Community" public account backend reply "Communication" to enter the reader communication group; reply "1026" to obtain courseware information;

background

Let’s take a look at a story first——

"At the beginning of the 20th century, Ford was in a period of rapid development. One day a motor broke down and related production work was forced to stop. Many workers and experts could not find the problem. Until a man named Steinmenz was invited After inspecting the machine, Steinmenz drew a line with chalk on the motor casing and told him to turn on the motor and reduce the coil at the mark by 16 turns. After the repairman did so, the fault was eliminated and production resumed immediately."

When we are working or in the development process, we often encounter such scenarios - problems that make you confused and don't know where to start, but there are always one or two "experts" who can see the problem at a glance. So, we need to think about it, is this a good thing or a bad thing?

As a travel platform, Didi's business covers express trains, private cars, ride-hailing, shared bicycles and other fields. Tens of millions of users and drivers interact and use the platform every day, forming complex dependencies between services. In such a large-scale distributed system, troubleshooting and performance optimization is undoubtedly a complex task.

Reliance on the experience of individual experts every time is obviously beyond control and results cannot be guaranteed. Therefore, we prefer to support rapid business iteration and innovation by continuously evolving the observable architecture.

1. What problems does the evolution of observable architecture solve?

1.1 General architecture of Didi observable system

The general architecture of Didi’s observable system mainly includes several parts, as shown in the figure below.

file

We will collect the target host or other related indicators. After passing through the transmission link, some indicators may be processed by the computing module and then written back to the system. This data is then stored. Based on these stored data, the query function can provide data display for upper-layer applications, such as dashboards, data dashboards, alarms and events, etc.

It should be noted that each module needs to complete different tasks or implement different functions. For example, the query module may be responsible for data routing, aggregation, and implementing DSL functions, which are usually implemented in the query layer.

There are many ways to implement data storage, such as InfluxDB, RRDtool, Prometheus, Druid, ClickHouse, etc., which can all be used as storage solutions for observable systems.

The transmission module plays the role of connection in the system, and the common message queue is used in this module. When we mention message queues, the first thing that comes to mind may be Kafka. Of course, there are also some more niche choices, such as NSQ.

The task of the calculation module is to convert a large number of indicators into the form we need, possibly removing some dimensions for calculation. Tools such as Flink and Spark are common choices in this module.

For data collection, there are many rich tools to choose from, such as Telegraf, Node exporter, and the recently launched Grafana Agent.

1.2 4 stages of observable architecture evolution

1.2.1 Phase 1: Before 2017

When business requirements change, storage module performance problems are usually the first to be exposed. Before 2017, Didi mainly used InfluxDB as its storage choice. We split the InfluxDB instance according to the dimensions of business services, and this design caused some problems.

First, there is a bottleneck in the performance of the stand-alone version. For example, we may encounter a situation where the query volume is large, such as the query span is long or the query data is large. In this case, an out-of-memory (OOM) problem is likely to occur. This is also a frequently discussed issue in the community.file

Furthermore, there are also problems with the sharding method we adopt. We split it by service, so for example, if you have 50 services today, you might need 50 instances or less. But if the number of services increases to 500 tomorrow, the operation and maintenance costs will increase significantly. Especially in the current situation where microservice architecture is widely adopted, this kind of operation and maintenance cost will be very high.

1.2.2 Phase 2: 2017-2018

To solve the above problems, we introduced RRDTool in 2017. During this period, RRDTool replaced InfluxDB and became the main storage tool for Didi Observable.

In the design of RRDTool, we use a consistent hashing algorithm to perform sharding of multiple RRDTool instances in the read and write links. The process of this hash algorithm is to first level all Tags, then sort them, and finally hash them and distribute them to each instance.

file

In addition to this, we also introduced a service called "Index". The main task of this service is to meet product needs. For example, we may need to provide a service list. After users choose their own service, they need to know what indicators are under the service and what tags are under each indicator. This requirement requires an efficient indexing service to complete.

The architectural improvements based on RRDTool have brought two major results. First, it solves the hot spots of InfluxDB. We originally split instances according to services, but now we spread these curves to each instance. Secondly, this also reduces the operation and maintenance costs of InfluxDB because we adopt a relatively automated sharding method.

1.2.3 Phase Three: 2018-2020

After 2018, we faced new challenges. Since the design principle of RRDTool is one file for each curve, when the data scale increases, the demand for IO also increases. Our IOPS has exceeded 30,000, which requires us to add more devices, such as machines with high IO performance, to solve this problem. However, this leads to gradually increasing costs and worsening the problem. At the same time, reading and writing in observability are orthogonal, and there is a conflict in read and write optimization - writing usually writes the latest part of all curves, while reading usually reads data from multiple curves or a certain curve for a long time.

file

(Vertical is Writes, horizontal is Reads)

So, how do we solve this problem? After analysis, we found that 80% of queries were concentrated in the last two hours, so we designed a hot and cold tiering strategy. The core of this strategy is to store the compressed data in memory. Compression mainly targets two aspects, one is timestamp, and the other is value. Since the time interval between timestamp generation is usually relatively fixed, and the value changes tend to be relatively gentle, this provides a basis for our compression strategy.

Based on this principle, we created a service called "Cacheserver" internally, which mainly serves the data of the last two hours and adopts a full-memory design. This design reduces the user query delay from 10 seconds to less than 1 second, and the storage of each data point is reduced from the original 16 bytes to 1.64 bytes.

file

The entire design can be understood through the above diagram. The first is hot and cold tiering. RRDTool and Cacheserver jointly complete the entire storage task. Taking the right half of the figure as an example, the original timestamps are 350, 360, 370, and 381, and 256 bits are required to store these data. But after compression, only 88 bits are enough. This is only the case with four timestamps. If there are more timestamps, the compression effect will be more significant.

1.2.4 Phase 4: 2020-present

As the number of components users access continues to increase, users’ query needs are becoming more and more complex. In our usage scenario, once RRDTool performs downscaling, we can no longer view the original data.

Faced with this situation, we began to think about how to design a system that can meet the current and future needs of users. We changed our problem-solving strategy and no longer designed individual solutions for each specific situation. For example, if there were new query forms in the past, we would need to code and launch a new function. Now, we choose to directly utilize the industry’s ecosystem.

At the time, Prometheus was very popular. We changed our goal from introducing the ecosystem to introducing the Prometheus ecosystem. The reason for choosing Prometheus is that with the popularity of K8s, Prometheus has become the de facto standard for monitoring systems. Many major industry players and popular vendors are continuing to contribute code and architecture to Prometheus.

However, if we choose to introduce the Prometheus ecosystem, we will not be able to continue using RRDTool because it is not compatible with the Prometheus ecosystem. This requires us to find new storage solutions.

Difficulty 1: How to choose a new storage solution?

When facing the choice of new storage solutions, we mainly considered Cortex, Thanos and VictoriaMetrics (VM for short). These solutions are designed to make up for some of the shortcomings of Prometheus itself, because Prometheus has been positioned as a stand-alone storage from the beginning, does not support long-term storage, and does not have high availability. As a result, Cortex and Thanos became the dominant solutions in the industry at the time.

file

(Research on Prometheus related solutions in the industry)

When comparing these solutions, we found that both Cortex and Thanos can effectively solve the native shortcomings of Prometheus. From a cost perspective, since both Thanos and Cortex use object storage, their costs are relatively low. However, these two solutions use a large number of third-party services. If the company does not have object storage or cloud services, the maintenance of these components may need to be completed by the observability team.

file

(Comparison of RRDTool and VictoriaMetrics solutions)

In contrast, VM is fully compatible with Prometheus compared to RRDTool. In addition, we mentioned the reduction strategy before. The data of RRDTool will be reduced after more than two hours. Once the extraction is reduced, we will not be able to view the original data. VM itself does not reduce mining, which brings us more possibilities. In terms of reducing storage costs, VM performs better. In our environment test, its storage cost is only about 1/20 of RRDTool. In terms of data reporting form, Prometheus is in Pull form, while RRDTool can only support Push form and only supports private protocols. However, VM supports both Pull and Push, and also has good support for popular data reporting protocols.

Difficulty 2: How to introduce the Prometheus ecosystem?

So, can we simply replace the storage solution with VM? Actually, the answer is no. When introducing a new ecosystem, we first need to consider existing corporate solutions. Introducing a new ecosystem does not mean completely subverting the existing product architecture and cannot simply replace it.

file

In order to introduce a new ecology, Didi has made some changes. As shown in the figure, the green part is the work required to use the Prometheus native solution. As long as the monitored object supports an interface such as "/metrics", Prometheus can pull data. For Didi, our original architecture was based on the Push model of collection, transmission, and storage. Therefore, we added an Adapter compatible with Prometheus in the collection part. On the original basis, for those new services that support Prometheus pulling, we can also use our own collection methods to pull data.

In terms of the results of ecological introduction, we have supported Prometheus' data collection and can support two common scenarios of PromQL's chart viewing and alarming. In addition, we have also added some new functions in the chart viewing dimension, such as adding the Outlier capability of TopK/BottomK and other chart dimensions. In this way, if a service has many instances, we can use functions such as TopK/BottomK to find outliers.

In terms of giving back to the community, we submitted some PRs to VM officials and the Prometheus community to contribute to the entire community.

2. How to ensure the stability of the observable system itself?

As we all know, the purpose of observable systems is to ensure business stability. So, how do we ensure the stability of the observable system itself? First, we need to explore how to monitor this observable system. Is it possible to configure some policies on my own system? Or build some dashboards? Or take some other approach? In this regard, I will share some of our experiments and reflections.

2.1 How to observe observable systems?

We cannot make an observable system observe itself. For example, if the storage system fails and the way to query data is to query it from its own storage, a circular dependency will be formed. Therefore, the first principle is that observable systems cannot be allowed to observe themselves. The second principle, related to the first, is that a separate set of data collection and alarm services are needed to conduct observations.

In our practice, two main methods are used.

file

The first method is used to monitor traffic and is suitable for data collection, transmission and storage. This approach mainly performs self-monitoring by using Exporter, Prometheus and Alertmanager. For example, if storage write traffic suddenly changes, you can use this system to self-monitor.

Another approach is monitoring capabilities. Taking alarms as an example, the simplest method is to set an alarm that will always trigger a threshold, but may not send real-time messages or SMS notifications. Once an alarm event is interrupted, it may be because there is a problem with the alarm system itself, or there is a problem with the storage query that the alarm system relies on. In this case, we can solve the problem by setting up detectors and doing end-to-end inspections.

2.2 How to ensure that the observable architecture is always stable?

We can consider it from two aspects: one is through architectural optimization, and the other is using common protection methods.

2.2.1 Architecture optimization

Point 1: Don’t put all your eggs in one basket

For architectural optimization, a simple principle is not to put all your eggs in one basket. We can achieve this through the following design.

file

(VictoriaMetrics storage multi-cluster design)

Didi is mainly engaged in the ride-hailing business. The observation data of our online ride-hailing and non-online ride-hailing businesses are stored on different storage clusters. This is the VM multi-cluster design we adopt. For example, if a problem occurs with a non-ride-hailing business instance, we hope this will not affect the ride-hailing business and vice versa. Therefore, we designed a multi-cluster storage system.

file

(Transport multi-cluster design)

In terms of data transmission, our design philosophy is similar, but one difference is that transmission and storage will use different sharding strategies because of their different load characteristics. For example, the transmission volume of a certain business is very large, but the storage query volume is very small. In this case, we will split the data on the transmission side, and only need to ensure that the data is written on the storage side. They can share the same storage cluster.

Point 2: Throw away bad eggs promptly

There is also a principle we call "throw away bad eggs in time". In the transmission module, in addition to writing to storage, there are other downstream modules, such as streaming alarms, etc.

file

Therefore, if a certain subsystem slows down for some reason, which affects the entire transmission module, we do not want to see it. We hope that when a subsystem slows down or a problem occurs, it can be removed from the system in time, that is, the circuit breaker strategy. In some cases, we can automatically perform circuit breaking and try to continuously restore this subsystem. If it recovers successfully, then we will reconnect the system.

2.2.2 Other common protection methods

Circuit breaker, downgrade, multi-dimensional current limiting:

In addition to circuit breakers and downgrades, we also have other protection measures, such as multi-dimensional current limiting. Multi-dimensional current limiting uses flexible strategies to limit requests. For example, for some continuous and high-frequency queries that span a long time, such as data queries for months or even years, we will apply multi-dimensional current limiting methods.

Slow check management:

Another safeguard is the management of slow checks, which involves querying a large number of curves. For example, if a query involves millions of curves, we need to perform a slow search to discover and then manage them. During some key protection periods, we will enable these strategies. Once an anomaly is identified, we will use multi-dimensional current limiting and limit or directly disable it based on its characteristics.

Live more:

For the internal observable multi-activity, the method we adopt is to unitize it. For example, if the dedicated lines between computer room A and computer room B are interrupted, we need to ensure that users can independently access the data in the corresponding computer room.

Capacity evaluation system:

We also have a capacity assessment system. Because the growth in observable architecture and business traffic or order volume may not be directly proportional, a set of own capacity evaluation systems are needed. Each company's business model may be different, so this system needs to be established, which is helpful as a means of protection.

Plans and drills:

We will also develop plans and conduct drills to ensure these methods are effective.

3. How is observability achieved in Didi?

3.1 Strategy selection

The topic of observability is a very hot topic in 2021 or 2022. One might feel that one is quite backwards if one does not talk about observability. Let’s first take a look at the definitions of observability by major manufacturers.

Observability is a tool or technical solution that helps teams effectively debug their systems. Observability is based on the exploration of properties and patterns that are not defined in advance. ——Source Google

Observability is the ability to monitor, measure, and understand the state of a system or application by examining its output, logs, and performance metrics. ——Source RedHat

Observability is how well you understand the internal state or conditions of a complex system based solely on what you know about the external output. ——Source IBM

I quote here the definitions of observability from Google, RedHat and IBM respectively, and they have two consensuses. The first is that observability is the ability to understand the internal states of a system from the outside, and these states do not need to be known. The second consensus is that there are many means of observability, including logs, indicators, events, etc.

So, how to achieve observability? Each major manufacturer has its own implementation method. Google recommends its cloud platform GCP, RedHat recommends OpenShift Observability, IBM has its own product Instana Observability, and Grafana recommends LGTM (Loki, Tempo, Mimir).

Taken together, there are roughly three ways to achieve observability. The first is to purchase the services of SaaS vendors, the second is to collect and store as detailed observable data as possible, and the third is to correlate multiple observational data.

3.2 Scheme comparison

For Didi, the first implementation method is not suitable, so we exclude it first.

As for the second implementation method, it is "as detailed as possible", so we divide the observation data into two dimensions, namely Dimensionality and Cardinality. Dimensionality is a concept similar to tags, such as timestamps, versions, customer IDs, etc. Cardinality takes customer ID as an example, and may have data from 10,01 to 19,999. The advantage of this solution is that it can collect a large amount of data, but the disadvantages are high implementation cost, large resource consumption, and low data utilization.

The third implementation method is to associate multiple observation data. Common observation data include Metric, Trace, and Log. Metric data is a high-level abstraction that can tell you the number of errors, but cannot provide specific error information. Trace data is mainly used for cross-service correlation, such as which services a request has experienced. Log data is developer-preferred information, providing the most detailed, human-readable data. However, the disadvantage of this way of correlating multiple observational data is that the architecture implementation is relatively complex.

3.3 Architecture design

At Didi, we borrowed from the above two methods and divided the data into two categories: low cardinality and high cardinality. Low cardinality refers to metric data, while high cardinality refers to log data. We store these two types of data in different databases and establish their correlation.file

For example, if we collect two error logs within a period of time, we will report the error number "2" to the time series database. At the same time, we will sample one of the error logs and store it in Exemplar DB. Then, we will associate the time series database with the Exemplar DB through tags.

3.4 Practical results

Didi’s observability practice results are very significant. Before establishing observability, we need to log into the machine and retrieve the logs when troubleshooting. If you are lucky enough to find the machine with the problem, consider yourself lucky. But if the problem is not with this machine, or even with this service, we need to repeat the above operations. Moreover, even after such operations, it is uncertain whether the problem can be found.

file

However, after establishing observability, when we receive an alarm message, we can directly view the original log text associated with the alarm. After checking the original text of the log, if you think there is no major problem, you can leave it alone temporarily. If it is an emergency, we will initiate emergency procedures.

In addition, when we are looking at the chart, if we find that an indicator suddenly rises and want to know what is the cause, we can use the drill-down function. This function not only allows us to view the original text of the log, but also extracts the Trace information if the log contains Trace information. Trace information can then be drilled down to specialized Trace products for further processing.

4. Summary and Outlook The development of Didi’s observability architecture is actually based on different needs, scenarios and era backgrounds, and the most appropriate solution is selected.

We have connected with some mature ecosystems in the industry and integrated these ecosystems into our system, which has greatly helped us complete a lot of work and improved our work efficiency. At the same time, in the process of building the observability platform, we also adopted some strategies to ensure the stability of the observation system itself.

It is worth noting that there is no unified way to implement observability, and each company has its own characteristics. Therefore, each company needs to customize specialized solutions according to its own characteristics, and continuously select and adjust the most appropriate solution according to the actual situation. (Full text ends)

Q&A

1. Does Didi have a dedicated technical team to maintain the observable architecture? Prometheus has relatively limited horizontal scalability capabilities. What are the specific problems with InfluxDB?

2. How to measure the observability of an architecture? Any suggestions?

3. Is it necessary for the timeliness of Metric to be on the second level?

4. The interface times out occasionally. The call chain can only see the name of the timeout interface, but cannot see the internal methods. The root cause cannot be located and it is difficult to reproduce. What should I do?

For the answers to the above questions, please click "Read Full Text" to watch the full version of the answers!

Statement: This article was originally written by the public account "TakinTalks Stability Community" and community experts. If you need to reprint, please reply "Reprint" in the background to obtain authorization.

This article is published by OpenWrite, a blog that publishes multiple articles !

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite the history of the industry. The US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities. TIOBE December: C# is expected to become the programming language of the year. A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/5129714/blog/10315681