Link Tracing Made Easy: A Guide to Link Costs

Link costs in a broad sense include not only additional resource overheads such as data generation, collection, calculation, storage, and query generated by using link tracking, but also human operation and maintenance costs such as link system access, changes, maintenance, and collaboration. For ease of understanding, this section will focus on the resource cost of link tracking machines in a narrow sense , and the human cost will be introduced in the next section (efficiency).

Composition structure of link tracking machine cost

Link tracking machine costs are mainly divided into two categories: client and server.  The link tracking client (SDK/Agent) usually runs inside the business process and shares system resources such as CPU, memory, network, and disk with the business program. The client overhead of link tracking mainly includes link buried point interception, link data generation and transparent transmission, data processing such as simple pre-aggregation or compression/encoding, data caching and reporting, etc. Client overhead is usually a kind of hidden overhead, which will not directly lead to the increase of resource bills in the short term, but utilizes the resources of the idle part of the business process. However, when the business continues to grow or enters a peak period (such as a big promotion), this part of the resources consumed by link tracking will eventually lead to an increase in the actual bill. Therefore, this part of the expenses should be strictly controlled within a reasonable level, such as no more than 10%, otherwise it will significantly affect the normal operation of the business.

The server machine cost of link tracking is an explicit and immediate resource cost, and it is also an important consideration when evaluating link tracking options (self-built, hosted).  The server side of link tracking is usually composed of gateway, message buffer, flow calculation, storage and query end. The most well-known and concerned is the link storage module, and the tail-based sampling (Tail-based Sampling) discussed in many hot articles The main impact is the storage cost of the link tracking server. However, in short-period storage (such as 3 to 7 days) scenarios, the resource cost of link data reception, buffering, and processing will even exceed that of storage, which is also an important component that cannot be ignored.

In addition, another cost that is easily overlooked is the network transmission fee between the client and the server.  Especially in the inter-public network transmission scenario, not only the bandwidth cost is high, but also the transmission traffic will be limited. It is often necessary to enable Head-based Sampling to reduce the link data reporting volume.

In recent years, mainstream open source communities or commercial products have successively launched edge cluster solutions, that is, deploying a set of observable data unified collection and processing clusters under the user network (VPC), supporting multi-source heterogeneous data standardization, link data Features such as lossless statistics and error-slow full collection can further reduce the cost of link data reporting and persistent storage. The overall architecture is shown in the figure below.

Link tracking machine cost optimization

After clarifying the composition and structure of link tracking machine costs, we next analyze how to carry out targeted optimization. Here, first ask three questions: "Is the value of each link data equal?", "Is it better to process the link data as early as possible, or as late as possible?", "The link data Is the storage period as long as possible, or as short as possible?".

In order to answer the above questions, we need to figure out what is the usage and value of link data? There are three main usages of link data. One is to filter and query the detailed track of a single call chain according to specific conditions, which is used for diagnosis and positioning of specific problems; the other is to perform link pre-aggregation based on fixed dimensions and provide general granularity such as service interfaces. Monitoring and alarming; the third is to carry out customized aggregation based on link details to meet personalized link analysis requirements, such as the distribution of slow request interfaces of VIP customers greater than 3S.

Therefore, the value of link data with different characteristics is not equal. Links with wrong or slow links or links with specific service characteristics (collectively referred to as critical links) are often more valuable than ordinary links, and the probability of being queried is greater .  We should record more critical links and increase their storage duration; normal links should be recorded less and reduce their storage duration. In addition, in order to ensure the accuracy of statistical data, we should complete the pre-aggregation of link data as soon as possible, so that link sampling can be performed earlier, and the overall reporting and storage costs of detailed data can be reduced.

To sum up, we can explore aspects such as  "link oblique sampling", "left shift of link calculation", "separation of cold and hot storage", and  try to record the most valuable link data on demand at the lowest cost , to realize the dynamic balance and optimization of cost and experience, as shown in the figure below.

Link skew sampling to record more valuable data

The value distribution of link data is uneven. According to incomplete statistics, the actual query rate of the call chain is usually less than one in a million, that is to say, for every million call chains, only one will be hit by the actual query, and almost all other links are invalid storage. Full storage and call chains will not only cause huge cost waste, but also significantly affect the performance and stability of the entire data link.

Therefore, the concept of link sampling (Trace Sampling) was born. As early as when the Google Dapper paper came out, the method of link sampling based on a fixed ratio was proposed, and it was tested in practice in Google's internal production system. For example, one of every 1024 links is sampled and recorded, and the sampling ratio is 1/1024. However, fixed-ratio sampling only solves the problem of controlling link overhead, while ignoring the inhomogeneity of link value distribution. Critical links and common links have the same probability of being sampled, which leads to many The query results of the path are not hit, which greatly affects the efficiency of troubleshooting.

So, can we predict user behavior in advance and only record links that will be queried? 100% accurate prediction is very difficult and almost impossible to achieve. However, based on historical behavior and domain experience, it is more feasible to preferentially record links with higher probability of query, which is link skew sampling.

Link tilt sampling is usually to set a higher sampling ratio (such as 100%) or traffic threshold (such as the first N lines/minute) for specific link characteristics (such as faulty, slow, core interface or custom service characteristics). Links that meet key characteristics are sampled at a very low rate (such as 1%) or even not sampled. On the Alibaba Cloud link tracking custom sampling configuration page shown in the figure below, users can freely customize the feature sampling strategy according to their own needs. On the premise of ensuring a high query hit rate (such as 50%+), the link data is actually stored The volume can reach about 5% of the original data volume, which greatly saves the cost of persistent storage on the server side.  More link sampling strategies will be introduced in detail in the actual combat chapter, such as dynamic sampling.

To extend the idea, when we do problem diagnosis, in addition to the call chain, we usually need to combine related information such as logs, exception stacks, time-consuming local methods, and memory snapshots to make comprehensive judgments. If all the associated information of each request is recorded, there is a high probability that the system will crash. Therefore, referring to the concept of "link oblique sampling", discarding useless or low-value data, retaining high-value data at abnormal sites or meeting specific conditions, refined on-demand storage capabilities should become an important factor to measure the pros and cons of Tracing and even observable products one of the standards. As shown in the figure below, Alibaba Cloud ARMS product provides the ability to automatically preserve the complete local method stack in the slow call scenario, which can realize the row-level code location of slow calls.

Link computing is shifted to the left to extract the value of data

In addition to screening and recording more valuable data, data processing calculations can also be "left-shifted" from the server to the client or edge clusters, and data value extraction can be completed in advance, such as pre-aggregation or compression coding, so that users can be satisfied On the premise of query requirements, data transmission and storage costs can be effectively saved.

  • Pre-aggregation statistics: The biggest advantage of pre-aggregation on the client side is to greatly reduce the amount of data reporting without losing data accuracy. For example, after sampling 1% of the call chain, it can still provide accurate service overview/upstream and downstream monitoring and alarm capabilities.
  • Data compression: Compressing and encoding recurring long texts (such as exception stacks and SQL statements) can also effectively reduce network overhead. Combined with non-key field obfuscation works even better.

Separation of hot and cold storage, low cost to meet individual analysis needs

The idea of ​​link sampling and calculation left shift is to reduce the reporting and storage of link detailed data as much as possible, so as to achieve the purpose of cost reduction. These two approaches can better meet single-link query and pre-aggregation monitoring alarms in general scenarios, but they cannot meet diversified post-aggregation analysis requirements. For example, a certain business needs to count interfaces and sources that take more than 3 seconds. distribution, this kind of personalized post-aggregation analysis rules cannot be exhaustive. And when we cannot pre-define the analysis rules, it seems that we can only use the extremely high cost of full raw data storage. Is there no room for optimization? There is also an answer. Next, we will introduce a low-cost solution to the problem of post-aggregation analysis—separation of hot and cold storage.

The basis of the separation of hot and cold storage is that the user's query behavior satisfies the principle of locality in time.  The simple understanding is that the closer the hot data query probability is, the longer the cold data query probability is lower. For example, due to the timeliness of problem diagnosis, more than 50% of link query analysis occurs within 30 minutes, and link queries after 7 days usually focus on wrong and slow call chains. The theoretical basis is established, and then discusses how to realize the separation of cold and hot storage.

First of all, hot data has timeliness. If you only need to record hot data in the most recent period, the requirements for storage space will be greatly reduced. In addition, in the public cloud environment, the data of different users is naturally isolated. Therefore, the hot data computing and storage solution within the user VPC has better cost performance.

Secondly, the query of cold data is directional, and cold data that meets the diagnostic requirements can be filtered out through different sampling strategies for persistent storage. For example, wrong and slow sampling, sampling of specific business scenarios, etc. Due to the long storage period of cold data and high requirements for stability, unified management in the shared data center can be considered.

To sum up, the storage period of hot data is short and the cost is low, but it can meet the needs of real-time full-scale post-aggregation analysis; while the total data volume of cold data is greatly reduced after accurate sampling, usually only 1% to 10% of the original data volume, and It can meet the diagnostic demands of most scenarios. The combination of the two achieves the optimal solution for the balance of cost and experience. Leading APM products at home and abroad, such as ARMS, Datadog, and Lightstep, all adopt storage solutions that separate cold and hot data.

The specific implementation of the separation of cold and hot storage, as well as the comparison of different storage types, we will explain in detail in the practical chapter. The following figure shows a schematic diagram of the 30-minute full analysis of hot data provided by Alibaba Cloud Link Tracking.

summary

With the rise of cloud-native and microservice architectures, the amount of observable data (Traces, Logs, and Metrics) has shown explosive growth. More and more companies have begun to pay attention to observable cost management. FinOps has also become a popular A new collaborative paradigm. The traditional solution of full data reporting, storage, and reanalysis will face increasing challenges. Recording more valuable data by oblique sampling, extracting data value by shifting calculations to the left, and realizing more cost-effective data value exploration by separating cold and hot storage will gradually become the mainstream solution in the cloud native era. Let's explore and practice together!

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144201