Case sharing | How to do AIOps for intelligent operation and maintenance in the financial industry? It is enough to read this article

​To build a two-state IT system, AIOps is already an inevitable choice. The digital transformation of operation and maintenance is the general trend. The gradual onlineization of physical business puts forward higher requirements for the stability and security of IT systems . For the important problems faced by banks, intelligent operation and maintenance AIOps has become the main solution.

Intelligent operation and maintenance AIOps, according to Gartner's latest definition, refers to the extraction and analysis of IT data based on big data, machine learning and other capabilities to provide support for IT operation and maintenance management products. At present, the main application scenarios of AIOps in the banking industry include precise alarms, anomaly detection, root cause location, and capacity analysis, which significantly optimize operation and maintenance costs and improve operation and maintenance efficiency; at the same time, based on in-depth analysis of data to optimize operation and maintenance quality, it is worth explaining What’s more, in order to maximize the value of IT data, it will be the future trend to integrate multi-dimensional data based on a unified platform and interpret IT operation and maintenance from a global operational perspective.

1. Case background

A state-owned bank (E Bank) is one of the five largest state-owned banks. In recent years, E-Bank has elevated digital transformation to the group's strategic level, giving full play to the unique advantages of financial technology, continuously increasing the intensity of financial technology investment, and has achieved positive results in wealth management, digital development, green finance and other business levels.

  • Multiple problems coexist, and the traditional operation and maintenance system of banks is in urgent need of transformation and upgrading

With the continuous advancement of the digital transformation process of E-bank, the banking business system and infrastructure are becoming more and more complex, and the operation and maintenance data is increasing day by day. The operation and maintenance capability has increasingly become an important focus of the digital transformation of E-bank. The continuous growth of business volume has brought the following problems to traditional IT operation and maintenance:

First, data governance is difficult. With the evolution of digitalization and the deepening of bank-wide reform, the business volume of E-bank has increased, the data scale has expanded rapidly, and the data types and data structures have become more complex and diverse. Due to the inconsistent data standards, the data quality is low; and E-bank data Scattered in various applications, the concentration is not high, there is an island phenomenon between the data, and the data reusability is lacking.

Second, it is difficult to find problems. E-Bank has established an operation and maintenance system in the past, but with the continuous practice of this system in business, many problems have arisen. First of all, the monitoring is not comprehensive and lacks the monitoring of the overall business operation status; secondly, the original operation and maintenance monitoring system uses fixed threshold alarms, which has a high rate of false positives and false negatives; in addition, the original operation and maintenance system is relatively passive in finding problems and lacks trend prediction It cannot identify problems in time before users are affected, and relies heavily on the experience of operation and maintenance personnel, resulting in high operation and maintenance costs and low operation and maintenance efficiency.

Third, it is difficult to locate the root cause. The original operation and maintenance system and tools of Bank E are basically post-event statistical analysis, lacking real-time analysis capabilities, root cause analysis capabilities driven by business indicators, scenario-based correlation analysis capabilities, and cross-analysis of multi-dimensional data such as alarms, indicators, and logs. The ability to improve the operation and maintenance troubleshooting ability is very limited, resulting in low operation and maintenance troubleshooting ability.

Fourth, operational analysis is difficult. The traditional operation and maintenance system of E Bank mainly relies on manual experience and analyzes data through reports, lacking intelligent means for dynamic data analysis; the past operation and maintenance data analysis is mainly from the perspective of operation and maintenance rather than from the perspective of business, resulting in the analysis of data It is relatively one-sided, not strong in implementation, and the value mining of data is not sufficient, so it cannot provide guarantee support for comprehensive operations.

In addition, E-bank also has some customized requirements. E-bank has deployed a cloud platform, which is different from the traditional technical architecture. The cloud platform puts forward more requirements on the operation and maintenance side, such as in-depth integration with situational awareness and visualization tools to identify and solve security risks on the cloud; There are also requirements for security capabilities. With the increase in business volume, the probability of internal illegal operations increases. E-Bank puts forward new requirements for the detection and investigation of internal staff's illegal operations. The ability to integrate security data has become an important focus of E-Bank.

To sum up, with the help of certain means and methods, we can realize full centralized management of the customer's IT operation and maintenance data, realize real-time data processing, intelligent analysis and prediction, perform multi-dimensional and efficient root cause location, and realize the comprehensive upgrade of the operation and maintenance side. It has become an important demand for the digital transformation of E-bank. Based on this, E-bank chose to cooperate with Qingchuang Technology to conduct in-depth exploration on intelligent operation and maintenance AIOps.

Founded in Shanghai in 2016, Qingchuang Technology is the first provider of intelligent operation and maintenance AIOps landing solutions in China. Qingchuang Technology focuses on empowering operation and maintenance management with AI, activating the wisdom of operation and maintenance data, and helping customers with digital transformation. At present, its customer base has covered many industries such as banking, insurance, securities, manufacturing, energy and transportation.

Based on its powerful big data capabilities, stream-batch integrated processing capabilities, and AI algorithm capabilities, Qingchuang Technology provides multiple levels from the data governance layer (including data collection, data processing, and data storage), operation and maintenance application layer, and operation decision-making layer. E-bank provides a comprehensive solution for intelligent operation and maintenance Sherlock AIOps.

2. Specific Implementation Strategies

1. Build a digital operation and maintenance platform to comprehensively improve the bank's data governance capabilities

Data is the basis of scene construction. Therefore, in terms of data governance, Qingchuang has built a digital operation and maintenance center for E-bank that integrates multiple functions of data collection, data processing and data storage.

The first is multi-source data collection. Sherlock AIOps has the ability to collect data from multiple data sources such as data lakes, APIs, and customer data, covering various operation and maintenance data such as indicators, events, and logs. Whether it comes from the work order system, monitoring system or log platform, it can be integrated into the platform as a configurable data source. In addition, data collection will connect to different systems such as container cloud, K8s, etc.

Secondly, after the data collection is completed, data processing is required. On the basis of the two major technology stacks, Flink and Spark, abstraction is made, and the two are integrated to form a digital operation and maintenance platform, which makes the platform span the specialized development of stream batch processing. At the same time, use visualization tools to achieve data labeling, systematization, and standardization, so that data can be dragged and dropped to achieve basic processing and integrated query analysis.

Third, improve the operation and maintenance data storage capacity. After the data processing is completed, it is necessary to store the processed data. Qingchuang Technology provides the corresponding technology stack and supporting software for E-bank's operation and maintenance data storage, and also uses big data tools to help E-bank improve its operation and maintenance data storage capabilities.

In general, the digital operation and maintenance platform built by Qingchuang Technology for E-Bank provides three services for the construction of intelligent operation and maintenance scenarios: big data processing, stream-batch integrated processing, and AI algorithm platform, laying the foundation for E-Bank’s intelligent operation and maintenance. At the same time, it continuously improves the quality and governance level of its operation and maintenance data, and solves the difficult problem of E-bank data governance.

2. Diversified intelligent operation and maintenance scenarios to help discover problems and locate root causes

On the basis of the digital operation and maintenance platform, Sherlock's operation and maintenance application layer combines dozens of algorithms to help E-Bank flexibly build a variety of intelligent operation and maintenance scenarios and produce the analysis results it needs.

Intelligent operation and maintenance scenarios include automatic alarm suppression, fault scenario discovery, index anomaly detection, log anomaly detection, comprehensive root cause location, business multi-dimensional analysis, capacity analysis and prediction, etc., mainly abstracted into four major product applications - alarm identification and analysis center, index analysis Center, Log Analysis Center and Rizhi Quick Analysis Expert.

The alarm identification and analysis center is driven by machine learning algorithms to perform noise reduction and correlation analysis on a large number of alarm events, assisting E-bank to realize problem prediction and discovery and root cause location.

​Based on transaction anomalies, indicator associations, topology integration, and root cause recommendation capabilities, the Index Analysis Center helps E-Bank quickly discover and predict abnormal fluctuations in indicators, and determine the correlation between indicators to assist root cause location.

The log analysis center has a variety of out-of-the-box templates and intelligent analysis capabilities, assisting E-bank to comprehensively analyze the overall status of digital business, and improving its operation and maintenance capabilities such as root cause location of faults, log audit, and abnormal detection.

On the other hand , Rizhi Quick Analysis experts have realized the clustering of massive logs into a number readable by naked eyes, intelligently identifying log occurrence patterns, analyzing log exceptions and intelligently alerting, thereby helping E-Bank to find problems and locate root causes without knowing the log structure. Through the construction of the four major applications, E-Bank can quickly discover abnormalities and locate the root cause, thereby improving operational efficiency.

In the future, on the basis of intelligent operation and maintenance, Qingchuang Technology will also assist E-Bank to realize the leap from intelligent operation and maintenance to intelligent operation. Qingchuang Technology interprets IT operation and maintenance from the perspective of global operation. On the basis of opening up the global data of E-bank, it helps it customize the exclusive operation decision-making center, accurately, real-time, and dynamically display the system operation status, and through the extraction of data value Analysis, effectively supports operational decision-making, and highlights the influence of operation and maintenance on business.

3. Sherlock's comprehensive AIOps solution helps banks achieve efficient operation and maintenance

Through the implementation of the Sherlock AIOps comprehensive solution, Qingchuang Technology has helped E-Bank solve various problems and realized efficient and intelligent operation and maintenance:

First, the data quality and data governance capabilities have been improved. By using the digital operation and maintenance platform for data governance and centralized management of operation and maintenance data, data barriers have been broken, data standardization has been greatly improved, data quality has been improved, and support for subsequent data analysis and application has been provided.

Second, the ability to discover problems has been improved. By deploying the Sherlock AIOps intelligent operation and maintenance platform, E-Bank has reduced the false alarm rate and the workload of front-line personnel based on the four major intelligent operation and maintenance applications, and greatly improved the speed of early detection of abnormalities and capacity warning.

Third, to achieve efficient root cause location. E-Bank utilizes transaction index anomaly detection and correlation analysis with a variety of infrastructure index anomalies, combined with topology correlation and log anomaly pattern troubleshooting, to achieve efficient and comprehensive troubleshooting of fault sources at the minute level.

Fourth, the ability to analyze operations has been improved. Through the construction of intelligent operation and maintenance, E-bank has realized all-round management and intelligent analysis of alarms, logs and various indicators, reduced operational risks by about 70%, increased operational efficiency by about 6 times, and achieved overall SLA (service level) of the data center. greatly improved.


​Qingchuang Technology, a benchmark supplier in the field of AIOps continuously recommended by Gartner. The company is committed to assisting enterprise customers to improve insight into operation and maintenance data, optimize operation and maintenance efficiency, and fully reflect the influence of technology operation and maintenance on business operations.

The common choice of industry leading customers

​Learn more about operation and maintenance dry goods and technology sharing

You can follow with one click in the upper right corner

We have been deeply involved in the field of intelligent operation and maintenance for nearly ten years

AIOps Benchmarking Supplier Recommended by Gartner for Consecutive Years

See you next time

Guess you like

Origin blog.csdn.net/qq_37641528/article/details/130087146