ARMS helps JiKrypton improve service emergency response and ensure safe travel

Author: Billan

01 Customer introduction and project background

Zhejiang Jikrypton Intelligent Technology Co., Ltd. was established in March 2021. In April 2021, the Jikrypton brand and its first product-Jikrypton 001 were released. Jikrypton is an intelligent, digital, and data-driven smart travel technology company. It adheres to the user-oriented corporate philosophy, focuses on the research and development of forward-looking smart electric travel technologies, builds a technology ecosystem and a user ecosystem, and "co-creates the ultimate experience of travel" With the mission of "life", from product innovation, user experience innovation to business model innovation, we are committed to bringing the ultimate travel experience to users.

As of April 2023, the delivery of Jikrypton mass-produced vehicles has exceeded 100,000 units. From 0 to 100,000 vehicles, Jikrypton only took two years, which is faster than other new power brands at least four years, and continues to refresh new power brands. The delivery record is not only a demonstration of "extreme krypton speed", but also the best interpretation of "China speed".

image

In order to ensure the rapid development and user experience of Haoji Krypton's automotive business, the technical team, while maintaining efficient functional iterations, is also constantly consolidating its system stability and emergency response capabilities. Starting in 2023, the big data team is piloting the construction of digital stability governance for Jishu BI business.

Jishu BI is a visual data analysis system for the entire Jishu operation and management system, covering multiple core business scenarios. Jishu BI is not only a reporting tool, but also provides the functions of global data interconnection, intelligent data analysis and panoramic data visualization, which can provide comprehensive solutions for other businesses "what happened, why happened, what will happen, and how to deal with it". Data support and decision-making capabilities. It is the development goal of Jishu BI to break the digital divide, create data value, and gradually realize the observation of business processes and the presentation of business results in the entire business domain.

In order to ensure the implementation of the digital stability governance of Jishu BI, Jishu has built an end-to-end full-link observable system, an enterprise-level emergency response mechanism and a cross-departmental team personnel collaboration mechanism, with the goal of ensuring business continuity. It has achieved the core stability indicators of "X minutes of fault discovery and reporting", "X minutes of emergency response and fault location", and "X minutes of fault recovery" for Jishu BI business.

image

02 Challenges and needs faced when implementing the project

Under the trend of cloud native, Serverless is gradually leading the next generation of application architecture due to its fully managed operation-free, cost reduction and elastic scalability features. Jishu BI business has determined the direction of serverless since the beginning of the project, and successfully implemented it based on Alibaba Cloud Serverless Application Engine (SAE). The application of serverless minimizes the operation and maintenance work, but still faces great challenges in the digital stability management of its own business:

How to cover and converge full-link alarm events from infrastructure to business application monitoring

From front-end business data and user experience, to back-end application service performance, to cloud services and basic resources, that is, the system resource layer, cloud service application layer, and business monitoring layer, although there are corresponding monitoring for different service modules, a relatively complete A complete indicator monitoring system, but due to the large number of service modules and complex dependencies after microservices, it is very likely that the abnormality or unavailability of a certain component will cause a large number of redundant alarms to be generated across the entire link, forming an alarm storm and causing operational problems. Maintenance teams are overwhelmed by the massive amount of alarm information, and it is very easy to miss important information that is actually used for troubleshooting. Therefore, for massive continuous alarm information, how to merge alarms and suppress the number of alarm messages without missing core alarm messages has become an important operation and maintenance problem.

How to build a unified alarm system, notification mechanism and cross-team emergency coordination mechanism

System resource layer, cloud service application layer, and business monitoring layer. In order to monitor these complex IT environments, since the resources at each layer belong to different teams for management, a variety of monitoring systems are used, such as Prometheus, Grafana, Skywalking, and Alibaba Cloud. Monitor, Alibaba Cloud ARMS, etc. to obtain more comprehensive monitoring data and better understand operating status and performance. However, one of the significant problems caused by the coexistence of multiple monitoring systems is the dispersion of alarm information. Different monitoring systems generate different alarm information and report it to alarm handlers in inconsistent ways. The troubleshooting of alarms usually requires multiple teams. Working together for processing, criss-crossing alarm processing increases the complexity and workload of personnel's response, and the level of fatigue often far exceeds the daily load of alarm processing personnel.

How to standardize the definition of fault levels, emergency response procedures and fault management systems

Business availability is a comprehensive reflection of the reliability, maintainability and maintenance support of a business system. Availability = MTBF / (MTBF + MTTR). Usually, the industry uses N nines to represent system availability, such as 99.9% (3-9 availability), 99.999% (5-9 availability). The downtime of a system failure directly reflects business availability. How to define a set of fault level definitions, emergency response procedures and fault management systems that are suitable for Jikrypton's own business will be an important support method to ensure the business availability promised by Jikrypton. By establishing a standard, full-process closed-loop fault management system that can be followed, and with the improvement of technical means, we can effectively reduce the probability of faults, shorten the MTTR of faults, and ultimately bring the destructiveness of faults to zero.

How to effectively measure business stability indicators and emergency response SLA

How to check which alarms have occurred in the system in the past period, and which type of alarms account for a higher proportion; a duty mechanism has been established, but the efficiency of alarm processing by duty personnel cannot be measured, and how to ensure the execution effect of the duty mechanism; a service is configured in multiple systems With multiple alarms, it is impossible to check the alarm processing efficiency and the SLA of the service from the service perspective; whether the proportion of alarms has been reduced and whether the proportion of alarm duration has been improved after targeted system optimization. These are typical problems faced in measuring alarm processing efficiency and service stability during daily operation and maintenance. These important data require complete data reports and a unified market to present.

03 Enterprise-level emergency response solution based on ARMS

In response to the above problems and challenges, Alibaba Cloud's cloud native observability team and big data team worked together to output the best practices for building an enterprise-level emergency response system based on ARMS after multiple rounds of communication and focus, and successfully implemented it in Jishu BI business Implemented, full-service, full-scenario monitoring and alarming are achieved. At the same time, the monitoring coverage and alarm efficiency have been comprehensively improved. According to the emergency response mechanism currently promoted by Jikrypton, the incident response rate of the entire team has been significantly improved, the average alarm claiming time (MTTA) has been significantly reduced, and the average alarm recovery time (MTTA) has been significantly reduced. MTTR) is significantly shortened, and cross-team collaboration efficiency is effectively improved.

image

The following focuses on the "event-centered alarm life cycle management" solution that focuses on the two aspects of "alarm and takeover" in the overall plan.

image

Use ARMS intelligent alarm to build a unified alarm event management center

The Jikrypton technical team uses a variety of monitoring systems according to its own business attributes, such as Alibaba Cloud Application Monitoring ARMS, Alibaba Cloud Log Service SLS, Zabbix, Prometheus, Grafana, and custom alarm integration, etc., in order to simplify contacts, notification methods, on-duty, etc. Operation and maintenance configuration; Unified alarm information format, alarm level definition, and unified management of alarm events. Jikrypton adopts ARMS intelligent alarm as an event management center for unified management of multiple alarm sources. The integration, event processing flow, notification strategy and other functions of ARMS intelligent alarm design are specially aimed at the scene of unified management of alarms, and solve many problems encountered in the process of unified management.

image

1) Access alarms in different formats.

ARMS intelligent alarms refer to the open source Prometheus alarm definition and use a semi-structured data structure to describe alarms. Alarms are described through highly scalable key-value pairs, so that the alarm content can be expanded very flexibly to access alarms generated by different data sources. Through the field mapping capability of alarm integration, key information in customized alarm content can be mapped to the ARMS alarm data structure. At the same time, it also provides quick access capabilities for a variety of monitoring tools, such as Alibaba Cloud application monitoring ARMS, Alibaba Cloud Log Service SLS, Zabbix, Prometheus, Grafana, etc. used by Jikrypton.

2) Unified definition of alarm levels.

Depending on the impact area and the degree of business damage, different alarm levels generally need to be defined, and alarm handlers need to perform different emergency handling processes according to different alarm levels. According to Jikrypton's fault level specifications, when configuring alarms, alarms are classified into four levels: P0, P1, P2, and P3 based on business conditions.

3) Normalized management of events and alarms.

Multiple alarm event sources are unified into ARMS intelligent alarm through integration, and alarm events are managed in a unified manner through unified event processing flow, notification strategy, notification object, upgrade strategy, etc. One notification object, one set of notification strategies, and a consistent alarm management model meet the needs of JiKr's unified alarm event center.

Building convenient and efficient ChatOps handheld operation and maintenance capabilities based on the enterprise WeChat used by JiKr

ChatOps is a collaboration approach and culture that integrates chat and automation tools to improve team collaboration efficiency and visibility. The goal of ChatOps is to increase the efficiency and visibility of workflows and facilitate collaboration and communication among team members. Jikrypton internally uses Enterprise WeChat as an office collaboration tool. ARMS smart alarm supports docking with Enterprise WeChat. By creating an Enterprise WeChat robot, you can specify the corresponding Enterprise WeChat group in the notification policy to receive alarms. The relevant alarm information is only available in Enterprise WeChat. Circulation within Ji Krypton’s internal corporate organization. When the matching rules of the notification policy are triggered, the system will automatically send an alarm notification to the designated enterprise WeChat group. After the enterprise WeChat group receives the notification, the alarm can be managed in the enterprise WeChat group anytime and anywhere.

image

When an alarm is sent to an IM group chat in the form of a card, you can add a set of operations to handle the alarm by modifying the card style. The full life cycle management of alarms can be easily carried out through IM’s alarm card:

1) Claim the alarm. By broadcasting messages claiming alarms, group members can clearly know who is handling the current alarm.

2) Mask the alarm. Some alarm triggers are expected behaviors and will not cause business impact, but the alarm cannot be closed directly. In this case, you can reduce the intrusion of the alarm notification by blocking the alarm.

3) Pay attention to the alarm. After paying attention to an alarm, the status change of the alarm being watched will be pushed to the follower in the form of a text message. For major failures, the team leader can subscribe to the progress of alarm processing in real time through the ability to pay attention to alarms, so as to provide data support for command decisions.

4) Solve the alarm. Close the alarm and send an alarm closure notification in the group chat. The status of the closed alarm will change to restored.

At the same time, in order to facilitate the alarm staff on duty of Jikrypton to quickly know the alarm notification situation and prevent too many group messages from being ignored, the alarm notification supports @designated handlers in the alarm notification group. By adding the alarm handler as an ARMS contact, and the notification object configured in the notification policy is the same as the mobile phone number of the bound handler, the alarm notification can be realized according to the schedule @duty personnel.

An event management system derived from the ITIL concept and suitable for the organizational structure and business attributes of JiKr

The core part of the digital stability governance mechanism that the big data team is currently promoting in Jishu BI business is to build a standard and standardized event management process. The process includes alarm discovery, alarm notification, alarm response/takeover, alarm location, command decision-making, alarm recovery, fault review and continuous improvement. In terms of personnel organization, it includes operation and maintenance teams, alarm duty personnel, alarm handling technicians, emergency command personnel, etc. When an alarm is triggered and the notification is received, the alarm duty personnel need to quickly establish an efficient collaboration channel for the alarm-related team after the fault emergency is initiated. Technical students need to complete the sign-in as soon as possible, and at the same time perform quick hemostasis of faults, root cause location and investigation, and information synchronization. If the sign-in is not taken over within the expected time, the alarm notification will be gradually upgraded. The operation and maintenance team and emergency command personnel are responsible for coordinating the collection of relevant data, synchronizing the impact surface, processing progress and recovery progress, broadcasting regularly, and updating the alarm processing status.

image

1) Scheduling management

ARMS intelligent management provides a shift management function. Alarm notifications can be set according to the duty hours of operation and maintenance personnel, and then the alarm notifications can be sent to the corresponding departments through phone calls, text messages, emails or corporate WeChat messages through notification policies. personnel on duty without disturbing operation and maintenance personnel during non-duty hours. The alarm duty personnel will then take over and handle the alarm according to the standard event management process.

2) Notification strategy

By setting notification policies, you can formulate matching rules for alarm events. When a matching rule is triggered, the system will send an alarm message to the notified object in the specified notification method to remind the notified object to take necessary measures to solve the problem. You can select a shift schedule in the notification strategy. Events that match the notification strategy will be notified according to the personnel in the schedule. In order to ensure the acceptance rate of the alarm and prevent the on-duty personnel from missing the alarm, you can also configure the repeat notification policy. When the alarm is not recovered, the alarm will send the alarm information at the set repetition frequency until the alarm is recovered.

In addition, JiKr’s event management process stipulates that alarms must be intervened and taken over by on-duty personnel, even if the alarm has been automatically restored. ARMS intelligent alarm provides the ability to manually recover alarms. When the alarm event is not triggered again within the automatic recovery time set in the alarm integration, the alarm will not automatically recover and manual intervention is required to adjust the status. Meet JiKr's requirements for assessment and measurement of on-duty personnel takeover rates.

3) Upgrade strategy

For alarms that have not been resolved for a long time, you can choose to upgrade the notification to remind the on-duty personnel to resolve them in time. After the escalation policy is added to the notification policy, the system will send an alarm message to the processor in the specified notification method to remind the processor to take necessary measures to solve the problem. Jikrypton’s incident management process stipulates that long-term untreated alarms need to be upgraded at two levels, one level is to the supervisor of the business department, and the other level is to the emergency command supervisor. This method is also used to increase the alarm acceptance rate as much as possible and reduce the alarm processing and recovery time.

Flexible and customizable ARMS Grafana emergency response data disk

Without data reporting and continuous operations, the understanding of the current situation will be full of ambiguity and uncertainty, and incident management will not be able to improve the overall business. Although objective data cannot replace communication and observation, it can effectively promote consensus through data sharing and information visualization. Everyone can jointly see and understand data changes and status quo, and promote mutual collaboration. By default, ARMS intelligent alarm provides two data disks: historical alarm overview and alarm processing efficiency. The disk provides a series of alarm metric data such as alarm statistics, alarm trends, MTTx indicators, and personnel efficiency. These data are stored in the default Prometheus instance. Based on its own operation and maintenance requirements, Jikrypton configured a customized emergency response measurement dashboard in the ARMS Grafana service based on original data, including duty status, alarm overview, alarm reception status, MTTx indicators, etc., to help the operation and maintenance team understand business alarms in real time Status and emergency response status, greatly improving emergency response efficiency.

imageFigure: Schematic diagram of test data market

04 Direction and planning of follow-up cooperation

The digital stability management implemented by Jikrypton’s full business is in full swing. While the overall emergency response efficiency has been greatly improved, more demand points that can further improve efficiency have been discovered. Alibaba Cloud’s cloud-native observability team will continue to follow up The data team collaborates in depth to improve the efficiency of alarm rule configuration and further shorten alarm recovery time.

The newly released static threshold recommendation, alarm number prediction, interval detection, and alarm rule testing capabilities of ARMS Smart Alarm will help Jikr further improve the efficiency of alarm rule configuration with the help of intelligent means. At the same time, ARMS smart alarms add new support for action integration, providing the action integration capabilities of Function Compute FC and custom Webhooks. The executable tasks provided based on action integration can be used as a pre-plan for rapid hemostasis of alarms. For alarms with deterministic features, it can provide Rapid hemostasis and recovery methods can effectively shorten the actual alarm recovery time.

The stability of business systems and emergency response efficiency are the cornerstones of brand reputation and user experience. Alibaba Cloud will unswervingly provide customers with the ultimate "stability, security, performance, and cost" products and solutions to help customers' businesses reach new heights. .

Currently, the application real-time monitoring service ARMS provides a 15-day free trial of the expert version to help enterprises quickly build observable systems.

Click here to claim your free quota now.

Guess you like

Origin blog.csdn.net/alisystemsoftware/article/details/132544370