GOPS2018 | The Best CP of Huawei Cloud O&M Leads the New Trend of AIOps

640?wx_fmt=gif&wxfrom=5&wx_lazy=1


At the 9th Global Operation and Maintenance Conference GOPS, Cai Xiaogang, Chief Architect of Huawei Cloud Application Operation and Maintenance Domain, gave a speech on the theme of "Huawei's Trinity Exploring the Practice of Key AIOps Technologies" , sharing Huawei Cloud Operation and Maintenance with you from four aspects The practice of using the industry-university-research trinity model to promote the exploration of key technologies of the cloud management platform involves the management and control of large-scale Kubernetes container clusters, causal sequence tracking in a serverless environment, RCA analysis and exploration of multi-source data, and clustering algorithms to achieve Blackbox analysis of network packets.



As the world's leading cloud computing service provider, HUAWEI CLOUD provides two major operation and maintenance services - Application Operation and Maintenance Management ( AOM ) and Application Performance Management ( APM ) services, enabling end-to-end performance insights for complex cloud applications. HUAWEI CLOUD application operation and maintenance has invested in continuous research and development, and has made innovative explorations and made great progress in intelligent AutoScaling , serverless call tracking, AI -based anomaly detection and RCA analysis, and Clustering -based Blackbox analysis, enhancing large-scale cloud Intelligent operation and maintenance (AIOps) capabilities of applications.

 

640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1

On-site sharing by Chief Architect of HUAWEI CLOUD Application Operation and Maintenance Domain

Speech on "Huawei Trinity Exploring the Practice of AIOps Key Technologies"


Management and control of large-scale K8S container clusters


Huawei has successfully verified the management and control of millions of containers in the test environment. The complexity caused by the heterogeneous computing resources of container clusters, network virtualization, diverse cluster types, second-level expansion and contraction, etc., as well as the diversity of customer application technology stacks (such as microservices, serverless, and basic component services) The resulting complexity puts forward two requirements for management and control: one is to meet the OM operation and maintenance of the platform itself; the other is to meet the operation and maintenance requirements of the customer application system deployed on the platform. The targeted design of HUAWEI CLOUD application operation and maintenance domain realizes application and resource modeling---Inventory modeling, which realizes the capabilities of CMDB and OSLC in the traditional sense, and maps infrastructure and applications to provide cross-resource and cross-layer management. Associations offer real possibilities.


The decision of auto-scaling for container elastic scaling comes from HUAWEI CLOUD O&M service. In addition to predefined scaling, it also implements auto-scaling of machine learning algorithms, providing smarter choices for complex large-scale applications and maximizing customer resource savings cost overhead.


In addition to the above two points, HUAWEI CLOUD Application Performance Management (APM) and Application Operation and Maintenance Management (AOM) also implement out-of-the-box performance data collection, online perception and calculation, abnormal alarms, application topology, and call chain analysis. Combined with HUAWEI CLOUD Performance Testing Service (CPTS), big data intelligent analysis and other ecological services to achieve end-to-end performance insight into application operation and maintenance. A good cloud-native distributed architecture successfully solves the challenges of performance degradation caused by massive data and large-scale application deployment.


640?wx_fmt=png


640?wx_fmt=pngLarge-scale container application management and control - supporting tools and ecological environment


Causal sequence tracing in a serverless environment


Serverless allows developers to focus on business logic and simple deployment without paying attention to infrastructure, providing a fast development method. This also means that APM for Serverless is a brand-new subfield that requires a targeted application performance tracking and evaluation mechanism. HUAWEI CLOUD cooperated with professors from the University of California to conduct in-depth research on serverless scenarios, using Go language to implement and expand the theory of the distributed logging system Chariots: GoChariots. It essentially sorts through the queue before logging, appending log records in causal order.


By providing causal sequential tracking for serverless and microservice cloud applications, and across clouds (not tied to a specific cloud provider). It can run in replicated mode, so cross-datacenter applications can communicate with the closest replica, greatly reducing communication overhead and improving availability and schedules. Since the SDK uses HTTP POST to send events to the backend, there is no constraint on the development language of the function.


In addition, combined with the AWS Lambda environment, HUAWEI CLOUD has developed GammaRay, which is based on the third-party open source AWS Instrument SDK for Python (Fleece) library, and verifies the Causal Order Tracking (COT) theory. GammaRay is an extension to X-Ray, only applicable to AWS Lambda invocation relationship analysis.


(For details, refer to Huawei's full paper on IC2E: Tracking Causal Order in AWS Lambda Applications.)

 

RCA analysis and exploration of multi-source data


Root cause analysis RCA is an old topic. The single-point technology is constantly improving, accumulating and breaking through, but it is still a "tree" in the forest. In order to avoid the problem of blind people touching elephants, a comprehensive analysis must be carried out.


In a complex system, once a fault occurs, it will cause a chain reaction, which is directly reflected in the conduction chain of the fault. In this scenario, anomaly detection (anomaly detection) must be solved first; second problem delimitation and localization (RCA: root cause analysis) must be solved. For anomaly detection, in addition to the traditional static threshold comparison, HUAWEI CLOUD has also developed dynamic thresholds based on time series data analysis, typically the ARIMA algorithm. In most cases, performance bottlenecks or problems can be found using APM's application topology and transaction analysis. For a more comprehensive analysis, HUAWEI CLOUD, together with professors from European and American universities and Huawei overseas experts, used ML to conduct in-depth data analysis on the call chain data. For example, in a single event scenario predicted by multiple time series variables, the Hidden Markov Model (HMM: Hidden Markov Model) is used, and the Inventory data, topology data and call chain data in APM are combined in the engineering implementation to determine the event dependency. , so as to find the fault conduction chain. Currently, we are also collaborating on research to verify the real-time stream correlation analysis and early warning of unsupervised machine learning applications in logs and indicators.


Clustering Algorithm to Realize Blackbox Analysis of Network Packets


In addition to the two mainstream distributed tracing technology collection solutions commonly used in business, code-intrusive buried points and non-intrusive probes, HUAWEI CLOUD has developed a new non-intrusive method to implement call topology-level analysis. The data collection tool vProbe supports mainstream application protocols and obtains data through the bypass monitoring network. The data only involves basic performance data, not business or privacy data (data desensitization measures are taken if necessary).


In the BlackBox analysis research, there are many studies in the academic world, but its engineering implementation is far from meeting the product-level requirements. HUAWEI CLOUD keeps trying innovative methods. After theoretical analysis and prototype derivation verification, Hierarchical Clustering is used to deduce the causal path between services. The accuracy rate is basically over 90~95%, which is basically consistent with the application topology based on the Whitebox method. , the difference is that the performance tracking of a single transaction cannot be achieved, but the performance situational awareness and bottleneck identification of the entire application can fully satisfy the timely alarm and delimitation of the problem.


We believe that cloud computing and its application operation and maintenance, the pure human sea tactics have failed, and DevOps, AIOps and NoOps are inevitable choices. This road is a long way to go, and the fellows join hands to seek...


Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326760099&siteId=291194637