Industry solutions|Intelligent operation and maintenance guarantees the development of operators' key businesses

background analysis

Business Background and Objectives

The 14th Five-Year Plan: The Outline of the 14th Five-Year Plan has made arrangements for the development goals, intelligent transformation and application, and safeguard measures of my country's artificial intelligence during the 14th Five-Year Plan and the next ten years. Information technology represented by artificial intelligence will become an important driving force during my country's "14th Five-Year Plan" period.

Changes in the IT environment: The gradual popularization of cloud computing, container technology, and micro-service architecture has made the IT system environment we face unprecedentedly complex, and the probability of failure in a complex IT environment has greatly increased. In the face of failures in a complex IT environment, how to find problems in advance and quickly locate problems from the relevant massive log data is more important for business stability and rapid recovery from business failures.

Insufficient business guarantee: Against the background of the normalization of the epidemic, technological anti-epidemic has become an important means of anti-epidemic. The "anti-epidemic" system has also experienced many failures. For example, on December 20, 2021, Xi'an's "One-Code Pass" crashed; on January 13, 2022, the communication itinerary code launched by an operator crashed, making some places inaccessible. Operator business is also a matter of people's livelihood, and business security is particularly important.

Business Challenges and Problem Analysis

  1. Challenges posed by distributed architectures
  • Monitoring object: geometric progression, human maintenance is not competent

  • The call bearer relationship is extremely complex, and it is difficult to locate business problems

  • There are deficiencies in the operation and maintenance model

  1. Decentralized operation and maintenance
  • Multi-level/multi-department maintenance system, business process support cannot be effectively controlled, and the overall problem/fault scheduling system is not smooth

  • The network-wide monitoring system is built according to specialties, the monitoring data is scattered, the monitoring methods are backward, and it is difficult to locate problems across specialties

  1. system oriented
  • Traditional O&M is oriented to single-system and sub-professional maintenance, and does not focus on end-to-end customer perception

  • Single-system cross-layer fault processing is slow, cross-domain problem/fault processing is slow, time-consuming, and cannot achieve accurate fault location and rapid fault recovery

solution

product value

  • Basic operation and maintenance: Provide flexible and powerful real-time log retrieval capabilities to improve the efficiency of fault location. In addition, with the support of algorithm capabilities, pattern recognition and anomaly detection are performed on log data, which shortens the time for problem discovery and improves the alarm accuracy.

  • Data analysis: Statistics and analysis of log data are performed through various reports to provide data support for operation and maintenance and operation activities, and to better mine the value of log data.

  • Unified log management: Collect, process, store, and query and analyze discrete logs in a unified manner to achieve effective log management, reduce log collection costs and log search complexity, and avoid failures.

  • Business intelligence analysis: Through log call chain tracking, it helps operation and maintenance personnel to quickly analyze the causes of system performance consumption, locate abnormalities, and solve problems. At the same time, based on the log data, business indicators such as business transaction volume, transaction success rate, and transaction response rate are counted to help users quickly understand key business information.

  • Security event management: Through the security-related information recorded in the log, analyze security events such as abnormal user behavior, internal threats, external attacks, and data theft, and improve the monitoring capability of enterprise security.

Overall structure

Relying on the empowerment of AI, big data, cloud computing and other technologies, build a data asset-based, business scenario-oriented, scenario-oriented approach of "guaranteed quality, improved efficiency, and manageable costs", and improve digital decision-making capabilities and management and control levels , to help promote intelligent management and control.

log analysis

Challenges facing logging applications

  • Real-time processing of massive data: Logs generated by large systems can reach tens of terabytes every day, and the log data generated by the system needs to be stored, retrieved, and analyzed in real time.

  • Unstructured log information: Thanks to its free textual representation, log information can contain rich system information, which also brings challenges to the analysis of log information.

  • Instability of log information: With the continuous update and iteration of modern IT systems, the type of log patterns and the number of patterns will change accordingly. Log analysis methods are required to be highly robust.

  • Correlate other system information: The occurrence of events in the operation and maintenance field is not only reflected in log information, but how to integrate the status of the system with information such as indicators and call chains is also a major challenge for log analysis.

Intelligent log management and analysis process

This solution is based on big data technology and intelligent algorithms to achieve unified collection, processing, storage and query analysis of discrete log data; it also supports log retrieval, log pattern recognition, log visual analysis, log monitoring, log desensitization, log correlation query and other functions , which can be applied to digital operation and maintenance and operation scenarios such as unified log management, log-based operation and maintenance monitoring and analysis, call chain monitoring and tracking, security audit and compliance, and various business analysis.

Unified log collection

This solution can use the CDC collector (full name Cloudwise Data Collector) developed by Cloudwise to collect log data. For data such as logs and monitoring information, the collector adopts a unified data collection framework and task scheduling mechanism to achieve unified collection of massive multi-source data and unified management of collection tasks.

Visual data processing flow

DOLA products provide a visual data processing pipeline, which supports dragging and dropping components from the component library to create a data processing process, which is flexible to use and easy to operate. Contains the following features:

  • Various data processing components: support more than 30 data processing components such as grok split, JSON conversion, XML parsing, CSV parsing, data desensitization, etc., to fully meet the needs of user data parsing.

  • Process single-step debugging: supports single-step debugging of the data processing process, and the correctness can be verified when the data processing process is defined.

Highly reliable log data storage

DOLA products use high-performance deep columnar storage clusters, which can meet the data storage requirements of PB-level data scale.

Highly reliable log data storage

DOLA products provide powerful log search capabilities, support SPL syntax, are easy to use, and are convenient for users to get started. For example: input association, search history, word analysis, log context, custom common search, result export, data desensitization, etc.

Flexible log correlation query

Establish an association model between business systems through the association fields between various business systems of the enterprise, and obtain the associated logs of a business process in each business system through the association model, helping operation and maintenance personnel to quickly sort out the logs under the business process , comprehensively analyze the problem, and realize the rapid positioning of the problem.

Log full link tracking

Under the distributed microservice architecture, with the development of business, the system will become larger and larger, and the calling relationship between various services will become more and more complex. How to quickly analyze the causes of system performance consumption, locate abnormalities, and solve problems in a non-invasive way is a major challenge for operation and maintenance personnel. DOLA products provide business-oriented service topology display, service analysis, and full-link tracking, helping operation and maintenance personnel to quickly analyze the root cause of system performance consumption, track transaction links, and accurately locate abnormal requests.

Rule-based log exception alerts

DOLA has the ability to identify abnormal logs through rules or algorithms and notify them of alarms. The rule alarm is an alarm method that triggers alarms based on keywords and manual rules. It supports creating and viewing monitoring rules, managing log monitoring rules, and enabling and stopping log monitoring rules in batches. And can set the grouping function for different alarm contacts.

Log exception alarm based on pattern recognition

Logs of the same type of schema often have certain common characteristics, such as similar log structure. Log pattern recognition uses clustering algorithms to aggregate data with high similarity in log texts, extract common log patterns, and help users quickly discover abnormal pattern logs.

Pattern Recognition: Unary Scenarios for Logs

Pattern Recognition: Log Semantic Classification

  • Abnormal log identification: Use machine learning models to automatically identify massive logs and find out abnormal logs. It is currently at human level in 12,000 test samples.

  • Exception log classification: Using a machine learning model, the exception log is further classified into 1: file 2: network exception 3: database exception 4: system exception 5: other exceptions.

Pattern Recognition: Log Conversion Metrics

Convert the original log pattern recognition into index data, and then use the index anomaly detection algorithm to detect anomalies in the logs, and perform multi-dimensional analysis on the logs when anomalies are found. The ability to quickly locate the cause of the fault is supported from the analysis results of anomaly detection

Pattern Recognition: Anomaly Detection for Log Conversion Metrics

Case Studies

System user behavior log analysis

The various behavior logs of the business system are uniformly managed on the platform and analyzed. The following figure shows the logs of the user login mode. The number of logs in each period of time within the time window, helping operation and maintenance personnel to narrow the scope and locate the fault point of the business system.

exceptions caused by emergencies

Model adaptability

Program summary

Summary of technical features

DOLA products adopt a simple, lightweight, efficient, stable, and scalable technical architecture, compatible with ES and dual-engine; adopt a column-based storage database Clickhouse, which has advantages in data writing, response time, deployment scale, high availability, etc. Good performance can meet the needs of customers for log storage in various business scenarios.

Summary of performance benefits

value presentation

Open source project recommendation

Cloud Wisdom has open source data visualization orchestration platform FlyFish. By configuring the data model, it provides users with hundreds of visual graphic components, and zero coding can achieve a cool visual large screen that meets their own business needs. At the same time, Feiyu also provides flexible expansion capabilities, supports configuration of component development, custom functions and global events, and ensures efficient development and delivery for complex demand scenarios.

If you like our project, please don't forget to click the code repository address below and click Star on the GitHub / Gitee repository, we need your encouragement and support. In addition, immediately participate in the FlyFish project to contribute to become a FlyFish Contributor, and there will be 10,000 yuan in cash waiting for you.

GitHub address: https://github.com/CloudWise-OpenSource/FlyFish

Gitee address: https://gitee.com/CloudWise/fly-fish

Wechat scan to identify the QR code below, note [Flying Fish] Join the AIOps community Flying Fish developer exchange group, and communicate face-to-face with the FlyFish project PMC~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunzhihui/blog/5585749