Tea Baidao full-link observable actual combat

Author: Shan Xie

Chabaidao is a local tea drink chain brand in Chengdu, Sichuan, founded in 2008. After 15 years of development, Chabaidao has become a benchmark brand in catering, with more than 7,000 stores across the country, covering 31 provinces and cities across the country, achieving full coverage in all provinces and cities at all levels in mainland China. On March 31, 2021, at the Chengdu-Chongqing Catering and Beverage Summit, Chabaidao won the "2021 Chengdu-Chongqing Catering and Beverage Benchmark Brand Award". In August 2021, it was selected into the latest "Top 15 Chinese New Tea Drink Brands in the First Half of 2021" released by iiMedia Ranking. On June 9, 2023, the new tea brand "Cha Baidao" received a new round of financing, led by Lanxin Asia, with participation from many well-known investment institutions, and its valuation soared to 18 billion yuan.

In April this year, Chabaidao held a brand upgrade conference at its Chengdu headquarters and announced that the number of stores exceeded 7,000. According to data from the China Chain Store and Franchise Association, as of December 31, 2020, 2021, and 2022, the number of Chabaidao stores were 2,240, 5,070, and 6,532 respectively. The epidemic has not slowed down its expansion.

With the rapid expansion of business scale, Chabaidao has accelerated its digital transformation strategy. However, because some of Chabaidao's early business systems were provided by external SaaS service providers, they could not meet the requirements of large scale, high concurrency, elastic expansion, agility, and observability brought about by the rapid growth of online business. In order to meet the needs of online and offline store customers and business growth, Chabaidao chose to combine comprehensive self-research with Alibaba Cloud's native capabilities for core link services such as store management, POS, user transactions, platform docking, store management, and catering production. Promote comprehensive upgrades of containerization, microservices, and observability capabilities.

The business value of cloud nativeization

The tea beverage industry is facing the pressure of market competition and the need to improve internal operational efficiency. In order to cope with these challenges, Alibaba Cloud and Chabaidao worked together to complete the transformation from cloud native to cloud and start a new journey of digitalization.

Container and microservice technologies are used to achieve lightweight and high portability of applications. It allows enterprises to more flexibly deploy and expand applications, quickly respond to market demands, and enables enterprises to achieve high availability and elastic scalability of applications, maintaining stable business operations regardless of sudden peak visits or system failures.

Introducing continuous delivery and continuous integration development methods to help enterprises achieve rapid iteration and deployment. Through automated processes, companies can launch new features and products faster, keep pace with the market, and seize opportunities.

The cloud-native cloud transformation not only brings higher security, availability and scalability, but also enhances the innovation capabilities and competitiveness of enterprises.

Observable challenges brought by cloud native

As an emerging catering brand with rapid business development, Chabaido receives a large number of online orders every day. Behind this is its close integration with Internet technology and its extremely high digital construction to support Chabaido's huge sales volume. Therefore, there are very strict requirements for the continuity and availability of business systems to ensure the stable operation of the core services of the transaction link. Especially during daily peak ordering periods, marketing activities, and sudden hot events, in order to provide users with a smooth experience, every aspect of the entire microservice system needs to ensure service quality under high concurrency and large traffic.

A complete full-link observability platform and APM (Application Performance Management) tools are the prerequisite for ensuring business continuity and availability. In terms of the construction of the observable technology system, the Chabaidao technical team has experienced a lot of exploration. Before the full realization of containerization, Chabaidao connected open source APM tools to some microservice systems and conducted verification for more than a year. However, it was ultimately unable to be promoted to the entire microservice architecture, mainly due to the following reasons. :

  • The balance between indicator data accuracy and sampling rate is difficult to choose

    Appropriate sampling strategy is an important means to solve the cost and performance of link tracking tools. If the APM tool uses 100% full link collection, a large amount of duplicate link information will be saved. Under the scale of Chabaidao's huge microservice system, 100% link collection will cause the storage cost of the observable platform to exceed expectations, and it will also have a certain impact on the performance of the microservice application itself during peak business periods. However, when the sampling strategy is set by open source tools, it will affect the accuracy of indicator data, making important observable indicators such as error rate and P99 response time lose their observation and alarm value.

  • Lack of high-level alarm capabilities

    Open source tools are relatively simple to implement in terms of alarms. Users need to build their own alarm processing and alarm distribution platforms to achieve basic functions such as sending alarm information to IM groups. Because Chabaidao's microservices have many service modules and complex dependencies. Often due to the abnormality or unavailability of a certain component, a large number of redundant alarms are generated throughout the link, forming an alarm storm. The result is that the operation and maintenance team is tired of dealing with a variety of and huge amounts of alarm information, and it is very easy to miss important information that is really used for troubleshooting.

  • Single troubleshooting method

    The open source APM tool mainly helps users locate faults based on Trace link information. For simple microservice system performance issues, users can quickly find performance bottlenecks or fault sources. However, many difficult and complicated problems in actual production environments cannot be solved through simple link analysis, such as N+1 problems, memory OOM, high CPU usage, full thread pool, etc. This places extremely high demands on the technical team. The team needs engineers with an in-depth understanding of the underlying technical details and rich SRE experience in order to quickly and accurately locate the root cause of the fault.

Connect to Alibaba Cloud application real-time monitoring service ARMS

In the process of fully cloud-native the Chabaidao system architecture, the Chabaidao technical team and Alibaba Cloud engineers conducted in-depth discussions on how to achieve better implementation of full-link observability.

As an important member of Alibaba Cloud's cloud-native observable product family, ARMS application monitoring provides thread analysis, intelligent insights, CPU & memory diagnosis, alarm integration and other capabilities that open source APM products do not have. At Alibaba Cloud's suggestion, the Chabaidao technical team tried to connect a business module to ARMS application monitoring.

Since ARMS provides automatic access to applications in the container service ACK environment, you only need to add 2 lines of code to the YAML file of each application to automatically inject probes to complete the entire access process. After a period of trial, the practical value provided by ARMS application monitoring has been continuously discovered by Chabaidao engineers. Chabaidao also uses Alibaba Cloud's performance testing product PTS to implement capacity planning for daily and major promotions. Due to the introduction of ARMS and PTS, Chabaidao's daily operation and maintenance and stability guarantee system have also undergone many upgrades.

Build an emergency response system around the ARMS alarm platform

Since we often encountered alarm storms when building an alarm platform based on open source products before, Chabaidao was very cautious in configuring alarm rules and tried to narrow the alarm targets to the most serious business failures. This way, alarm storms could be avoided. Frequent harassment of the SRE team will also cause a lot of valuable information to be ignored, such as a sudden increase in interface response time.

In fact, the industry has a set of standard solutions to the alarm storm problem, which involves key technologies such as deduplication, compression, noise reduction, and silencing. However, there is a certain complexity in integrating these technologies with observable products. Many open source products do not Complete solutions are available in this area.

These key technologies in the alarm field have complete functions on the ARMS alarm platform. Taking event compression as an example, ARMS provides two compression methods: tag-based compression and time-based compression. Multiple events that meet the conditions will be automatically compressed into one alarm for notification (as shown in the figure below).

Figure: Tag-based compression

Figure: Time-based compression

With the various technical means provided by the ARMS alarm platform, the problem of alarm storms can be very effectively solved. Therefore, the Chabaidao technical team began to pay attention to the use of alarms and gradually enriched more alarm rules, covering application interfaces, host indicators, and JVM parameters. , database access and other different levels.

Connected through the enterprise WeChat group, the alarm notification realizes the interaction of the ISTM process. When the on-duty personnel receive the alarm notification, they can directly use the IM tool to close the alarm, upgrade the event, etc., and quickly implement alarm processing. (As shown below)

Figure: Intelligent convergence and notification of monitoring alarm events

Flexible and open alarm event handling strategies meet the needs of different timeliness and scenarios. On this basis, Chabaidao began to build an enterprise-level emergency response system by referring to Alibaba's best safety production practices. The emergency scenario from the business perspective is used as the core model for incident emergency response, and the fault handling process corresponding to the flow is identified through different alarm levels. These are the experiences that Chabaidao has gained after fully going cloud-native, and has significantly improved the service quality of the production environment.

Introducing a sampling strategy

Extracting indicator data from link information is an essential function of all APM tools. Different from the simple and crude indicator extraction method of open source products, ARMS application monitoring uses end-side pre-aggregation capabilities to capture every real request, first aggregate, then sample, and then report, providing accurate indicator monitoring. Make sure that when the sampling strategy is turned on, the indicator data is still consistent with the real situation.

Figure: ARMS end-side pre-aggregation capability

In order to reduce the application performance loss caused by APM tools, Chabaidao adopts a 10% sampling rate for most applications, and adopts an adaptive sampling strategy for applications with very high TPS to further reduce application performance loss during peak periods. Through actual measurements, during peak business periods, the application performance loss caused by ARMS application monitoring is more than 30% lower than that of open source products, and the accuracy of indicator data is trustworthy. Indicators such as the average response time at the interface level and the number of errors can meet production-level business needs. .

Figure: Interface level indicator data

Automatically bury asynchronous links*

In the Java field, there is asynchronous thread pool technology and many open source asynchronous frameworks, such as RxJava, Reactor Netty, Vert.x, etc. Compared with synchronous links, automatic burying and context transparent transmission of asynchronous links are more technically difficult. Open source products do not have full coverage of mainstream asynchronous frameworks, and there is a problem of burying failure in certain scenarios. Once such a problem occurs, the most important link analysis capability of APM tools will be difficult to play its role.

In this case, developers need to manually dig in through the SDK to ensure transparent transmission of the context of the asynchronous link. This will cause a huge workload and make it difficult to promote it on a large scale and quickly within the team.

ARMS supports all mainstream asynchronous frameworks and can transparently transmit the asynchronous link context without any intrusion into business code. Even if there is no timely support for specific versions of some asynchronous frameworks, as long as the user side puts forward requirements, the ARMS team can Completed in the new version of the probe. After using ARMS application monitoring, the Chabaidao technical team directly cleaned up the previously manually buried code of the asynchronous framework, greatly reducing the maintenance workload.

Figure: Link context for asynchronous calls

Use of higher-level applied diagnostic techniques

When the coverage of hidden points is high enough, traditional APM tools and link tracking tools can help users quickly determine which link of the link (that is, Span) has a performance bottleneck, but when it is necessary to further investigate the root cause of the problem, it cannot Provide more effective help.

For example, when the system CPU usage increases significantly, is it caused by a certain business method that consumes CPU resources crazily? This problem is difficult to solve for most APM products. Because the resource consumption of each link cannot be known from the link view alone. Chabaidao engineers have encountered similar problems many times when using open source tools. At that time, they could only guess based on experience, and then go to the test environment to compare repeatedly to completely solve it. Although they have also tried some profiling tools, the threshold for use is relatively high. the effect is not very good.

ARMS application monitoring provides CPU & memory diagnostic capabilities, which can effectively discover bottlenecks caused by CPU, memory and I/O in Java programs, and conduct subdivided statistics according to method names, class names, and line numbers, ultimately assisting developers in optimizing program, reduce latency, increase throughput, and save costs. CPU & memory diagnostics can be turned on temporarily when specific problems need to be troubleshooted, and flame graphs can be used to help users directly find the root cause of the problem. In a scenario where the CPU of an application in a production environment soared, Chabaidao engineers used CPU & memory diagnosis to locate the problem in one step and determined that the problem was caused by a specific business algorithm.

Figure: Analyzing CPU time via flame graph

In addition, for online business problems, you can also use the Arthas diagnostic capability provided by ARMS to troubleshoot online. As a tool for diagnosing online problems in the Java field, Arthas uses bytecode enhancement technology to view program running status without restarting the JVM process.

Although there are certain thresholds for using Arthas and it requires a lot of effort to learn, Chabaidao engineers like to use this tool very much. For questions such as "What kind of special data causes a certain business anomaly?", there is no more convenient troubleshooting tool than Arthas.

Phased results

After 2 months of research and comparison, Chabaidao decided to completely shift from the open source observability platform to ARMS, and from the open source stress testing platform to PTS, and promote it within the team. **With the continuous deepening of use, the high-order observable capabilities provided by ARMS such as intelligent insights and thread pool analysis have gradually been applied by Chabaidao's technical team in daily operation and maintenance, and the online problem troubleshooting efficiency has also improved compared with before . Improved several times.

In terms of the cost of using the observable product itself, although ARMS appears to be improved compared to open source products, this is based on the single writing of data in the open source solution and the existence of a single point of failure. In fact, Chabaidao's technical team is also very aware that the previous open source solution has high availability risks. The failure of a certain component will cause the entire observable solution to be unavailable. It's just that everyone doesn't make heavy use of the observability capabilities provided by open source solutions, so they don't pay enough attention. So overall, the overall cost of ARMS is not higher than that of open source solutions.

Utilizing ARMS capabilities, Chabaidao achieves 100% coverage of the observable indicator sampling rate, full collection of links, and greatly improved monitoring data accuracy. It can quickly realize automatic discovery of business faults and effectively cooperate with sensitive business development.

After a fault occurs, the monitoring system needs to notify relevant personnel as soon as possible and perform preliminary positioning. The ARMS alarm warning capability implements ChatOps capabilities and is based on IM tools to quickly reach relevant personnel and provide preliminary positioning capabilities, which greatly improves the response ability to faults. .

The rapid recovery of faults is crucial to controlling the business impact. ARMS uses the full-link Trace capability to quickly locate specific applications, interfaces, methods, slow SQL, etc., which is a key assistant for rapid fault recovery. The person in charge of Chabaidao’s technical team said: “Under the premise that the cost is the same as that of open source solutions, ARMS’ rich and comprehensive full-stack observation and alarm capabilities enable Chabaidao to quickly establish operation and maintenance observation and response capabilities, and the fault recovery efficiency increases by more than 50% . , the fault recovery time is reduced by 50%, and we can truly use observability to protect the rapid development of business."

Fault prevention and convergence has a very high input-output ratio in the construction of a stable system. PTS uses the ability to pressure national traffic and second-level monitoring capabilities to verify site capacity and locate performance bottlenecks. Before the business went online, Chabaidao fully conducted stress testing on single applications and the entire link. A total of more than 800 stress tests were conducted, and performance problems were converged before going online to avoid evolving into online faults.

Next stage goal

In the observable field, Prometheus + Grafana are the de facto standards for indicator data storage, calculation, query, and display. The ARMS product family provides managed and enhanced Prometheus and Grafana services. The indicator data generated by ARMS application monitoring will also be automatically saved in the hosted version of Prometheus, and several Grafana disks will be preset. Chabaidao engineers are combining application layer indicators, key business indicators, and cloud service indicators based on Prometheus and Grafana to develop a multi-dimensional observable market.

In the near future, Chabaidao will establish a unified observable technology system covering the business layer, user experience layer, application service layer, infrastructure layer, and cloud service layer to implement large-scale microservice systems for tens of millions of users online at the same time. Stability guaranteed.

Qt 6.6 is officially released. The pop-up window on the lottery page of Gome App insults its founder . Ubuntu 23.10 is officially released. You might as well take advantage of Friday to upgrade! RISC-V: not controlled by any single company or country. Ubuntu 23.10 release episode: ISO image was urgently "recalled" due to containing hate speech. Russian companies produce computers and servers based on Loongson processors. ChromeOS is a Linux distribution using Google Desktop Environment 23-year - old PhD student fixes 22-year-old "ghost bug" in Firefox TiDB 7.4 released: officially compatible with MySQL 8.0 Microsoft launches Windows Terminal Canary version
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10117932