SLS: Thoughts and practice of OTel-based mobile full-link Trace construction

Author: Gao Yulong (Yuanbo)

First, let's understand the background of full-link Trace on the mobile terminal:

From the perspective of the mobile terminal, an App product is produced from the concept to the final maturity and stability. The R&D personnel involved in the product development process, the number of lines of code in the project, the scale of the project architecture, the frequency of product releases, and the repair of online business problems Time and so on will undergo relatively large changes. These changes have brought us considerable difficulties and challenges in troubleshooting, and business problems are often difficult to reproduce and locate. For example, in the initial stage of a product, the project scale is often relatively small, the business process is relatively simple, and online problems can often be quickly located. When the scale of the project is relatively large, the business process often involves more modules. At this time, some online problems will be more difficult to reproduce and locate.

This article is the author's sharing at the 2022 D2 Terminal Technology Conference, hoping to bring you some thinking and inspiration.

Why is it difficult to reproduce and locate device-side problems?

Why are online business problems difficult to reproduce and troubleshoot? After our analysis, it is mainly caused by 4 reasons:

  • Mobile terminal & server terminal log collection is not unified, and there is no unified standard specification to restrict data collection and processing.
  • There are often many modules involved on the end-side, and the R&D frameworks are also different. The codes are isolated from each other, the equipment is fragmented, and the network environment is complex, which will make it difficult to collect data on the end-side.
  • From an end point of view, it is often difficult to obtain data between different frameworks and systems when analyzing problems, and there is a lack of contextual information between data, making data association analysis difficult.
  • There are often many business domains involved in business links. To reproduce and troubleshoot problems from the perspective of the end, students in the corresponding domains are often required to participate in the troubleshoot, and the cost of human operation and maintenance is relatively high.

How to solve these problems? Our idea is to go in four steps:

  • Establish uniform standards and use standard protocols to govern the collection and processing of data.
  • Unify data collection capabilities for different platforms and frameworks.
  • Automatic context analysis and processing of data generated by multiple systems and modules.
  • We have also made some explorations in automated experience analysis based on machine learning.

Unified Data Collection Standards

How to unify the standard? At present, there are various solutions in the industry, but the existing problems are also obvious:

  • The protocols/data types are not uniform among different schemes;
  • It is also difficult to be compatible/interoperable between different solutions.

Standard here, we chose OTel, OTel is the abbreviation of OpenTelemetry, there are two main reasons:

  • OTel is led by the Cloud Native Computing Foundation (CNCF), which is a merger of OpenTracing and OpenCensus, and is currently a quasi-standard protocol in the field of observability;
  • OTel unifies different languages ​​and data models and is compatible with both OpenTracing and OpenCensus. It also provides a vendor-independent Collectors for receiving, processing and exporting observable data.

In our solution, the data collection specifications of all terminals are based on OTel, and data storage, processing, and analysis are constructed based on the LogHub capability provided by SLS.

Difficulties in end-to-end data collection

It is not enough to just unify the data protocol, but also to solve some problems in data collection on the end side. In general, device-to-side acquisition currently faces three major difficulties:

  • Data concatenation is difficult
  • Performance guarantee is difficult
  • Difficult not to lose data

There are often many frameworks and modules involved in the end-side research and development process, and the business is also complex. There are multiple asynchronous calls to APIs such as threads and coroutines. How to solve the problem of automatic connection between data during the data collection process? The fragmentation of mobile devices is serious, the distribution of system versions is relatively scattered, and there are many models. How to ensure the consistent collection performance of multiple terminals? The uncertainty of app usage scenarios is also relatively large. How to ensure that the collected data will not be lost?

Difficulties in concatenating device-side data

Let's first analyze the main problems faced by the automatic concatenation of device-side data.

  1. During the end-side data collection process, not only business link data is collected, but also various performance & stability monitoring data are collected, and there are many observable data sources;
  2. If other R&D frameworks are used, such as OkHttp, Fresco, etc., the key data of the tripartite framework may also be collected for analysis and positioning of network requests, image loading and other issues. For business R&D students, we often don't pay too much attention to the technical capabilities of such three-party frameworks. When it comes to troubleshooting such framework problems, the process is often difficult;
  3. In addition, the terminal side is almost completely asynchronously called, and there are many asynchronously called APIs, such as threads, coroutines, etc., and there are certain challenges in link opening.

Here are a few common issues:

  • How to collect the data of the tripartite framework? How to concatenate?
  • How to connect different observable data sources?
  • How to automatically concatenate data distributed between different threads and coroutines?

End-side data automatic serialization solution

Let's first look at the scheme of automatic serial connection of end-side data.

In the OTel protocol standard, the serial relationship between different data is restricted through the trace protocol. OTel defines the necessary fields that each piece of data in the trace data link must contain, and we need to ensure the consistency of data in the same link. For example, in the same trace link, the trace_id must be the same; secondly, if there is a parent-child relationship between the data, the parent_id of the child data must also be the same as the span_id of the parent data.

We know that whether it is an Android platform or an iOS platform, a thread is the smallest unit that the operating system can schedule. In other words, all our code will eventually be executed in the thread. When the code is executed, if we can associate the context information with the current thread, the current context information can be obtained automatically when the code is executed, so that the problem of automatic association of trace data in the same thread can be solved.

In Android , the context information of the current thread stack can be stored based on the thread variable ThreadLocal, which can ensure automatic association of business data collected in the same thread. If it is used in a coroutine, there will be problems with the solution based on thread variables. Because in the coroutine, the actual running thread of the coroutine is uncertain, and thread switching may be performed during the life cycle of the coroutine execution. We need to use the coroutine scheduler and the coroutine Context to maintain the correctness of the current context. When the coroutine resumes, the associated context information takes effect in the current thread, and when the coroutine is suspended, the context information becomes invalid in the current thread.

In iOS, it is mainly based on the activity tracing mechanism to maintain the validity of context information. Through the activity tracing mechanism, when a business link starts, an activity is automatically created, and we associate context information with the activity. Within the scope of the current activity, all generated data will be automatically associated with the current context.

Based on these two solutions, when generating Trace data, the SDK will automatically associate context information with the current data in accordance with the OTel protocol standard. The final generated data will be logically associated in the form of a tree, and the root node of the tree is the starting point of the Trace link. This method not only supports automatic data association within coroutines/threads, but also supports multi-level nesting.

Data acquisition and concatenation of the tripartite framework

For the data collection of the tripartite framework, let’s take a look at the common practices in the industry. At present, there are two main types:

  1. If the three-party library supports the configuration of interceptors or proxies, it will generally be implemented by adding embedded codes to the corresponding interceptors;
  2. If the three-party library exposes fewer interfaces, it will generally add embedded code through Hooks or other methods, or it does not support the embedded points of the corresponding framework.

There are two main problems with this approach:

  1. The buried point is not complete. Take OkHttp as an example. The third-party SDK may also rely on OkHttp. Through the interceptor method, it may only support the buried point collection of the current business code, and the network request information of the third-party SDK cannot be collected. It will lead to incomplete buried point information;
  2. It may be necessary to intrude into the business code. In order to realize the embedding point of the corresponding framework, there needs to be a cut-in timing. This cut-in timing often needs to be realized by adding code configuration items when the corresponding framework is initialized.

How to solve these two problems?

The solution we use is to implement a Gradle Plugin, and insert the bytecode in the Plugin. We know that during the packaging process of Android App, there is a process that converts .class files into .dex files. During this process, the class files can be processed through the transform api. We use ASM to realize the instrumentation of class files. In the process of bytecode processing, it is necessary to find a suitable insertion point first, and then inject suitable instructions.

Here we take OkHttp's byte instrumentation as an example : the goal of instrumentation is to associate the context information of the current thread with the OkHttp Request when OkHttpClient calls the newCall method. In the Transform process, we first filter out the target class file according to the class name of OkHttpClient, and then filter the method to be inserted according to the method name newCall. Next, context information needs to be inserted into the request's tags object at the beginning of the newCall method. After our analysis, it is necessary to insert the target code when the newCall method call starts. For the convenience of implementation and debugging, we implemented an OkHttp auxiliary tool in the extension library, insert the bytecode calling this tool at the target position, and pass in the request object.

The inserted bytecode will be associated with the extension library. In this way, the problem of tripartite framework data collection and context automatic association can be solved.

Compared with the traditional method, using the bytecode instrumentation scheme, the business code will be less intrusive, and the embedding points can be effective for both the business code and the three-party framework, and the automatic association of the context can also be completed in combination with the extension library.

How to ensure performance

In the process of collecting observable data, a large amount of data will be generated, and there are certain performance requirements for memory, CPU usage, and I/O load.

We implement the core part based on C to ensure the performance consistency of multiple platforms, and optimize the performance from three aspects :

First of all, it is to optimize the protocol processing process. In terms of data protocol, the Protocol Buffer protocol is used. Compared with JSON, Protocol Buffer is not only faster, but also saves memory space. In the serialization of the protocol, we adopted the implementation of the manual encapsulation protocol. During the serialization process, we avoided the creation and copying of a lot of temporary memory space and the calling of irrelevant functions.

Secondly, in terms of memory management, we directly set a configurable size limit on the maximum memory used by the SDK. The use of memory can be configured on demand according to the business situation, so as to avoid the impact of excessive SDK memory usage on the stability of the App; secondly, a dynamic memory management mechanism is also introduced to increase the use of memory space as needed, and will not occupy the App all the time. Memory space, to avoid the waste of memory space. It also improves string processing performance. In the processing of characters, a dynamic string mechanism is introduced, which can record the length of the string itself. When obtaining the length of the character, the operation complexity is low, and it can avoid buffer overflow and reduce the cost of modifying the string. Number of memory reallocations.

Finally, in terms of file cache management, we also limit the upper limit of file size to avoid wasting the storage space of the peer device. In the processing of cache files, we have introduced the Ring File mechanism to store cache data in multiple files and assemble multiple files in the form of log file groups. The entire log file group is written from the beginning in the form of a circular array, written to the end and then returned to the beginning to re-write. Writing data in this way can reduce the random Seek when writing files, and the Ring File mechanism can ensure that a single log file will not be too large, thereby reducing the system I/O load as much as possible. In addition to the mechanism of Ring File, the logic of breakpoint saving and cache cleaning is also aggregated and executed together to reduce random seeks. The checkpoint file size is also limited. After the specified size is exceeded, the checkpoint file will be cleaned up to prevent the checkpoint file from being too large to affect the file read and write efficiency.

After the above optimization measures, the throughput of SDK collected data has been increased by 2 times, and the memory and CPU usage have been significantly reduced. It can support the collection of up to 400+ pieces of data per second.

How to ensure that logs are not lost?

It is not enough for the performance to meet the requirements. It is also necessary to ensure that the collected data cannot be lost. During the use of the app, the app may often crash abnormally, the mobile device restarts abnormally, and the network quality is poor, and the network delay and jitter are large. In such abnormal scenarios, how to ensure that the collected data will not be lost?

When collecting data, we used the write-ahead log (WAL) mechanism and combined with the self-built network acceleration channel to optimize this problem.

  1. The purpose of introducing the write-ahead log mechanism is to ensure that the data written to the SDK will not be lost due to abnormal reasons before being sent to the server. The core of this process is to cache the data on the disk of the mobile device before the data is successfully sent to the server, and then remove the cached data on the disk after the data is successfully sent. If the data sending fails due to App abnormality or device restart, because the cached data is still there, the SDK will resume the data sending progress according to the recorded breakpoint information. At the same time, the write-ahead log mechanism can ensure that the writing and sending of data are executed concurrently without blocking each other;
  2. Before the data is sent, multiple pieces of data will be aggregated and compressed by the lz4 algorithm, which can reduce the number of requests for data sending and the consumption of network transmission traffic. If the data transmission fails, there will be a retry strategy to ensure that the data can be successfully sent at least once;
  3. When sending data, the SDK supports access to the nearest acceleration edge node, and transmits data through the internal network acceleration channel between the edge node and SLS.

After these three main methods of optimization, the average size of data packets is reduced by 2.1 times, the overall QPS is increased by an average of 13 times, the overall success rate of data transmission reaches 99.3%, and the average network delay is reduced by 50%.

Multi-system data association processing

After solving the problems of serial connection and collection performance of end-side data, it is also necessary to deal with the problems of data storage and correlation analysis between multiple systems.

In terms of data storage, we directly store relevant data in a unified manner based on the SLS LogHub capability. Based on SLS, it can carry PB-level traffic on an average daily basis. This throughput can support the full collection of observable data on mobile terminals.

After solving the problem of unified storage of data, there are two main problems that need to be dealt with.

**The first question,** How to deal with the contextual association between observable data of different systems?

According to the constraints of the OTel protocol, we can handle the mapping relationship among root node, parent node and child node based on parent_id and span_id. First, when querying the Trace data link, all Trace data within a certain period of time will be pulled from the SLS. Then according to the constraints of the OTel protocol, the node type is determined for each piece of data. Because the data of multiple systems may be delayed, some data may not arrive when querying the Trace data link. We also need to virtualize the parent node that does not exist temporarily to ensure the accuracy of the Trace link. Next, it is necessary to regularize the nodes, aggregate the nodes belonging to the same parent_id, and then sort according to the start time of each node, and finally get a trace link information. Based on this link information, we The calling link of the system can be restored.

The second question is that when performing Trace analysis, we often need to further analyze data of different dimensions from a system perspective. For example, if you want to further analyze Trace data from different dimensions such as device ID, App version, and service calls, what should you do? Let's see how to solve this problem.

Multi-system data topology generation

When we analyze the problem from the overall perspective of the system, the scale of Trace data required is often relatively large, and there may be tens of millions of pieces of data per minute, and the timeliness requirements for data are also relatively high. Traditional stream processing methods are prone to performance bottlenecks in this scenario. The solution we adopt is to convert the stream processing problem into a batch processing problem, and transform the traditional link processing perspective into a system processing perspective. After the perspective transformation, from the perspective of the system, the most important core to solve this problem is how to determine the relationship between two nodes.

Let's take a look at the specific process. For batch processing, we use the MapReduce framework. First, in the data source processing stage, we aggregate data based on the scheduled analysis (ScheduledSQL) capability of SLS, and retrieve data from the Trace data source at the minute level. In the Map stage, group the data according to the traceID first, and then aggregate the data according to the dimensions of spanID and parentID. Then calculate relevant statistical data, such as basic statistical data such as success rate, failure rate, and delay index. In actual business use, some data related to specific business attributes is often collected, and this part of data often varies greatly depending on the business. For this type of data, grouping of results by other dimensions is supported during the aggregation process. At this point two intermediate products are obtained:

  • Aggregated data containing two node relationships, we call this type of data side information
  • and unmatched raw data

These two intermediate products will be aggregated again in the Combine stage, and finally the result data including basic statistical indicators and other dimensions will be obtained.

The final product will contain several main pieces of information:

  • Side information can reflect the calling relationship.
  • Dependency information can reflect service dependencies.
  • There is also indicator information, and other resource information, etc. Among them, the data related to the business attribute will be reflected in the resource information.

Based on these products, we can sift through multiple dimensions of resources, services, and other information to calculate the problem distribution and impact links of the corresponding dimensions.

Exploring the root cause of automation problems

Next, I would like to introduce to you some of our explorations in the direction of locating the root cause of automation problems.

We know that with the iteration of the App version, each App release may involve code changes for multiple businesses. Some of these changes have been fully tested, and some have not been fully tested, or conventional testing methods have not covered them, which may have a certain potential impact on online services and cause some services to be unavailable. The larger the scale of the app, the more business models, the corresponding business data volume, and the greater the uncertainty of the request link. After a problem occurs, multiple people are often required to participate in the investigation across domains, and the cost of human operation and maintenance is relatively high.

How to troubleshoot and locate problems on the terminal side, and speed up R&D efficiency through technical means? We have done some exploration based on machine learning technology.

Our current method is to first perform feature processing on the trace source data; then perform cluster analysis on the features to find the abnormal trace; finally, based on the graph algorithm, etc., analyze the abnormal trace to find the abnormal starting point.

First, the real-time feature processing stage reads the Trace source data, generates a feature for each Trace link by finding 5 nodes from the bottom up, and encodes the feature. Then perform hierarchical clustering analysis on the encoded features through the HDBSCAN algorithm. At this time, similar abnormalities will be classified into the same group, and then a typical abnormal Trace is found from each group of abnormal Traces. Finally, use the graph algorithm to find the starting point of this abnormal Trace, so as to determine the possible root cause of the current abnormal Trace. In this way, any data source that follows the OTel standard protocol can be processed.

Case: Multi-terminal link tracking

After processing the data, let's look at the final effect.

Here is a scenario that simulates Android, iOS, server, and end-to-end link tracking.

We use the iOS App as the sender of the command, and the Android App as the response end of the command to simulate the operation of turning on the car air conditioner remotely. We can see from the figure that after the operation of "turning on the car air conditioner" on the iOS side is triggered, it has successively gone through the links of "user authority verification", "sending instructions", and "calling network requests". After the Android terminal receives the instruction, it executes the steps of "remotely start the air conditioner" and "status check" in sequence. From this call graph, we can see that Android, iOS, server, and multi-terminal links are connected in series. We can analyze the call link from any perspective of Android, iOS, and server. The time-consuming of each operation, the number of requests corresponding to the service, the error rate, and the service dependency can all be reflected.

Overall structure

Next, let's look at the architecture of the entire solution:

  1. The bottom layer is the data source, which follows the OTel protocol, and the SDKs corresponding to each end are implemented uniformly according to the protocol specifications;
  2. The data storage layer is directly based on SLS LogHub, and the data collected by all systems is stored in a unified manner;
  3. Further up is the data processing layer, which preprocesses key indicators, trace links, dependencies, topology, and features.

Finally, the upper-layer application provides link analysis, topology query, indicator query, original log query, and root cause location capabilities.

Subsequent planning

Finally, summarize our follow-up plan:

  1. At the acquisition layer, we will continue to improve the support of plug-ins and annotations to reduce the intrusiveness of business code and improve access efficiency
  2. On the data side, observable data sources will be enriched, and the collection of related data such as network quality and performance will be supported in the future.
  3. On the application side, capabilities such as user access monitoring and performance analysis will be provided

Finally, we will open source our core technical capabilities and share them with the community.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/alimobile/blog/6251488