HCIP Study Notes-HUAWEI CLOUD O&M Solution-9

1. Overview of cloud operation and maintenance

1.1 Challenges faced by traditional operation and maintenance methods

image.png

  • Operation and maintenance personnel have high skill requirements, complex configuration, and multiple systems need to be maintained at the same time
  • Unable to correlate analysis. Although there are many indicators, they need to be checked one by one based on operation and maintenance experience
  • For distributed tracking systems, learning and use costs are high, and the stability is poor.

1.2 Requirements for operation and maintenance of cloud architecture

image.png

  • With the continuous evolution of the IT architecture, the system architecture has become more and more complex, and the operation and maintenance on the enterprise cloud is obviously different from the traditional IT operation and maintenance, and the operation and maintenance personnel are also facing many challenges.
  • Development and operation and maintenance are often two independent departments within the enterprise, and there are obvious differences in work and technical direction. This leads to poor communication when completing an application project together, which in turn leads to delays in application progress and corporate efficiency. Dropped significantly. Therefore, the entire system architecture needs to evolve continuously. From traditional operation and maintenance to automated operation and maintenance, the barriers of operation and maintenance engineers, development engineers, and quality assurance engineers will be broken, thereby forming an efficient work system.

1.3 Panorama of HUAWEI CLOUD O&M and Business O&M

image.png

  • Support users to focus on the operation and maintenance of the business layer and reduce the energy consumed by users in daily maintenance of the platform. Huawei is responsible for platform operation and maintenance to provide customers with a stable and reliable cloud platform
  • Console is a visual portal for cloud resource users to manage and distribute resources on a daily basis.
  • CES/AOM/APM provides users with a three-dimensional monitoring platform, which can fully understand the usage of resources on the cloud and the operation status of the business, and respond to abnormal alarms in time to ensure the smooth operation of the business.
  • Users can complete tenant business support activities by using a combination of tools such as Console (console), CES/AOM/APM (cloud monitoring service).

1.4 Challenges brought about by the diversity and flexibility of enterprise cloud applications

image.png

  • With the popularity of microservices, the relationship between applications is becoming more and more complex, and it is becoming more and more unrealistic to simply manage through maintenance personnel. It is necessary to use professional software tools to monitor all aspects of scenarios such as calls between applications, visually restore the execution track and status of the business, and assist in the rapid positioning of performance and faults.
  • After the application is migrated to the cloud, can the microservice dependencies be visualized, what is the end user experience, and how can problems be quickly tracked? Scattered logs cannot be correlated and analyzed? How to solve or improve such problems? HUAWEI CLOUD includes a variety of operation and maintenance services to help operation and maintenance personnel Simplify the operation and maintenance process and improve the operation and maintenance efficiency.

1.5 Panorama of Three-dimensional Operation and Maintenance Solutions for Cloud Applications

image.png

  • HUAWEI CLOUD launched a three-dimensional operation and maintenance solution for cloud applications, which integrates HUAWEI CLOUD's application operation and maintenance management service AOM, application performance management service APM and other services, real-time multi-dimensional monitoring of the infrastructure layer, application layer, and business layer, and through applications and Resource alarm correlation, log analysis, intelligent breakthrough value, distributed call tracking, mobile APP exception analysis and other technologies can realize rapid diagnosis and repair of minute-level problems and ensure long-term stable operation of applications.
    • For massive resource monitoring scenarios: AOM provides real-time monitoring of applications and cloud resources, collects various indicators, logs, events and other data to analyze application health status, and provides alarm and data visualization functions
    • For massive log management scenarios: LTS provides log collection, real-time query, storage and other functions to help users easily deal with daily operation and maintenance scenarios such as log real-time collection: 5 query analysis
    • For performance problem location scenarios: APM provides professional distributed application performance analysis capabilities to help operation and maintenance personnel quickly solve problems such as problem location and performance bottlenecks under the distributed architecture.

2. Open source operation and maintenance tools

2.1 Introduction to Prometheus

image.png

  • Prometheus is an open source monitoring tool inspired by Google's borgmon monitoring system. It was created in 2012 by former Google employees working at SoundCloud. It was developed as a community open source project and officially released in 2015. In 2016, Prometheus officially joined the Cloud Native ComputingEoundation, becoming the second most popular project after Kubernetes.
  • As a key part of observability practice (monitoring, logging, and tracking), monitoring has undergone many changes in the cloud-native era compared to previous system monitoring. One is microservices and containerization, leading to an exponential increase in monitoring objects and indicators Second, the life cycle of the monitoring object is shorter, resulting in an exponential increase in the amount and complexity of monitoring data.
  • This requires a tool that unifies monitoring indicators and data query language. Prometheus came into being. Pemetheus can be easily integrated with many open source projects to help us understand the operating status of systems and services. On the other hand, analyze the big data it collects. Can help us optimize the system and make decisions. It can be used not only in the IT field, but also in any situation where indicator data needs to be collected.
  • PromQL is a query statement developed by Prometheus for this kind of labeled time series data. It is completely different from query SQL for relational databases.
  • Prometheus can be understood as a time series database. Of course, it's not just a time series database. It covers the entire ecosystem toolset that can be bundled and their capabilities.
  • Prometheus is mainly used to monitor infrastructure, including servers (CPU, MEM, etc.), databases (MySQL, PostgreSQL, etc.), Web services, etc. Almost everything can be monitored through Prometheus. And its data is obtained through configuration and establishing a connection with the data source.

2.2 Prometheus architecture diagram

image.png

  • Prometheus is designed for reliability, allowing users to quickly diagnose problems. Each Prometheus server is independent and does not depend on network storage or other remote services.
  • The basic implementation principle is to pull data from the exporter, or indirectly through the gateway qatewav (if deployed in k8s, you can use the service discovery method), it stores all the captured data locally by default, and cleans it through certain rules And organize the data, 9 and store the obtained results in a new time series. The collected data has two destinations, one is alarm and the other is visualization.
  • Prometheus components use logic:
    • The Prometheus server periodically pulls data from statically configured targets or service-discovered targets
    • When the newly pulled data is larger than the configured memory cache, Prometheus will persist the data to disk (if using Remote storage, it will persist to the cloud)
    • Prometheus can query data regularly, and when the condition is triggered, it will push the alert to the configured Alertmanager.
    • When Alertmanager receives a warning, it can aggregate, deduplicate, and denoise according to the configuration, and finally send a warning.
    • Data can be queried and aggregated using the API, Prometheus Console, or Grafana
  • Prometheus has two data collection methods, pull active pull and push passive push
    • pull: means that the client first installs various existing exporters and runs Explorer in daemon mode to collect data and respond to http requests, returning metrics data. Prometheus accesses each node through pull (HTTP GET) exporter and return the data required by hw35802903.
    • push: Refers to the client (or server) installing the official pushgateway plug-in, organizing the monitoring data into metrics and sending it to pushgateway, and then pushgateway pushes it to prometheus. It should be noted here that pushgateway is just an intermediate forwarding medium

2.3 Introduction to Grafana

image.png

  • There are mainly the following six characteristics:
    • Display method: Fast and flexible client-side charts, with rich dashboard plug-ins, such as heat maps, line graphs, charts and other display methods.
    • Data source: supports multiple data sources, such as: Graphite/InfluxDB/OpenTSDB/Prometheus/Elasticsearch, etc.
    • Notification reminder: Define different alarm rules according to different indicators, calculate whether to trigger an alarm and send a notification
    • Mixed display: Mix different data sources in the same chart, you can specify the data source on a per-query basis, or even customize the data source.
    • Notes: Annotate graphs with rich events from different data sources, hovering over an event reveals full event metadata and tags.
    • Filters: Ad-hoc filters allow the dynamic creation of new key/value filters that are automatically applied to all queries using that data source.
  • TSDB is a database optimized for time-stamped or time-series data, built specifically to handle time-stamped metrics and events or metrics. Time series data can be metrics or events that are tracked, monitored, downsampled, and aggregated over time, such as server metrics, application performance, network data, and many other types of analytical data.
  • Introduction to the main components of Grafana:
    • filebeat: collect ftds data
    • metricbeat: collect system resource data
    • logstash: log cleaning,
    • influxdb: distributed time series database
    • grafana: data display.

2.4 Prometheus+Grafana

image.png

  • Fluentd_exporter log collection, processing and forwarding
  • Node_exporter host data collection.

2.5 Open source operation and maintenance solution architecture

image.png

  • Monitoring Kubernetes clusters through Prometheus can support:
    • Node: indicators such as cpu, load, fdisk, memory, etc.
    • Status of internal components: such as the running status of kube-scheduler, kube-controller-manager, kubedns/coredns and other components
    • Application: Data indicators such as Deployment status, resource requests, scheduling, and API delays.

3. HUAWEI CLOUD O&M Service

3.1 Cloud Monitoring Service CES

image.png

  • Cloud monitoring service mainly has the following functions:
    • Automatic monitoring of cloud resources: The cloud monitoring service does not need to be activated. After creating resources such as elastic cloud servers, the monitoring service will automatically start. You can directly go to the cloud monitoring service to check the running status of the resource and set alarm rules
    • Host monitoring: By installing the cloud monitoring service Agent plug-in in the elastic cloud service or bare metal server, users can collect ECS or BMS minute-level granularity monitoring data in real time.
    • Flexible configuration of alarm rules: When setting alarm rules for monitoring indicators, it supports adding alarm rules for multiple cloud service resources at the same time. After the alarm rule is created, the alarm rule can be modified at any time, and flexible operations such as enabling, stopping, and deleting the alarm rule are supported.
    • Real-time notification: By enabling the message notification service in the alarm rule, when the status change of the cloud service triggers the threshold value set in the alarm rule? Ability to grasp changes in the operating status of cloud resources in real time.
    • Monitoring panel: Provide users with a monitoring panel to view monitoring data across services and dimensions, and centrally present the key service monitoring indicators that users care about, which can not only satisfy the user's overview of the operation status of cloud services, but also satisfy the need to view monitoring details when troubleshooting demand.
    • OBS dump of monitoring data: The retention period of the original data of each monitoring indicator in Xing Monitoring Service is two days, and the original data will not be saved after the retention period is exceeded. Users can save raw data to OBS synchronously.
    • Resource grouping: Resource grouping supports users to centrally manage resources such as elastic cloud servers, cloud hard disks, elastic IPs, bandwidth, and databases involved in their business from a business perspective. In this way, different types of resources, alarm rules, and alarm history can be managed according to business, which can quickly improve the efficiency of operation and maintenance.
    • Site monitoring: Site monitoring is used to simulate real users' access to remote servers, so as to detect problems such as availability and connectivity of remote servers.
    • Event monitoring: Event monitoring provides the functions of event type data reporting, query and alarm. It is convenient to collect various important events in the business to the cloud monitoring service, and to send an alarm when an event occurs.

3.1.1 CES monitoring solution

image.png

  • Host monitoring is divided into basic monitoring, operating system monitoring and process monitoring
  • Basic monitoring: monitoring indicators automatically reported by ECS, and the data collection frequency is once every 5 minutes. Indicators such as CPU usage can be monitored. For details, see the list of services that support monitoring.
  • Operating system monitoring: By installing Agent plug-ins in ECSs or BMSs, users are provided with system-level, active, and fine-grained server monitoring services. The frequency of data collection is 1 minute. In addition to indicators such as CPU usage, it can also support indicators such as memory usage (Linux). For details, see the list of services that support monitoring.
  • Process monitoring: monitors the active processes in the host, and collects information such as CPU consumed by active processes, memory, and the number of open files by default.

3.1.2 Host Monitoring

image.png

  • Host monitoring function introduction:
    • Various monitoring indicators After installing the Agent, the cloud monitoring service will provide more than 40 monitoring indicators such as CPU, memory, disk, network, etc., to meet the basic monitoring operation and maintenance needs of the server
    • Fine-grained monitoring After installing the Agent plug-in, Agent-related monitoring indicators will be reported once every minute.
    • Process monitoring collects the CPU, memory, and number of open files occupied by currently active processes, allowing users to understand the resource usage of ECSs or BMSs.
  • Basic monitoring: Monitoring indicators automatically reported by ECS, the data collection frequency is once every 5 minutes, and indicators such as CPU usage can be monitored.
  • Operating system monitoring: By installing Agent plug-ins in ECSs or BMSs, users are provided with system-level, active, and fine-grained server monitoring services. The frequency of data collection is 1 minute. In addition to indicators such as CPU usage, indicators such as memory usage (Linux) can also be supported.
  • Process monitoring: monitors the active processes in the host, and collects information such as CPU consumed by active processes, memory, and the number of open files by default.

3.1.3 Event Monitoring

image.png

  • The difference between custom event monitoring and custom monitoring:
    • Custom event monitoring is used to solve the scenarios of discontinuous event type monitoring data reporting, query and alarm
    • Custom monitoring is used to solve the scenarios of periodic and continuous collection of monitoring data reporting, query and alarm

3.1.4 Alarm function

image.png

  • Supports creating alarm rules for all monitoring items of the cloud monitoring service
  • Supports creating alarm rules for all resources, resource grouping, log monitoring, custom monitoring, event monitoring, and site monitoring.
  • It supports setting the effective time of the alarm rule and customizing the effective time period of the alarm rule.
  • Support email, SMS, HTTP, HTTPS and other alarm notification methods
  • Support service calls based on alarm rules, such as triggering other cloud services (such as Functiongraph) to execute when a certain type of alarm is monitored.

3.1.5 Monitoring Panel

image.png

  • The Sishi monitoring panel supports the comparison and viewing of data of different services and dimensions in one monitoring item, helping users to realize the requirements of comparing and viewing performance data between different cloud services. Before adding a monitoring view, you need to create a monitoring panel.

3.1.6 Case: xx e-commerce platform monitoring

image.png

  • E-commerce business has high memory requirements, large data volume and large data access volume, requires fast data exchange and processing, and has extremely high monitoring requirements.
  • ECS is the core service, and the comprehensive and three-dimensional ECS monitoring system plays a vital role in business stability. The host monitoring function can provide system-level, proactive, and fine-grained monitoring services for servers. Escort the smooth operation of business.
  • The website is the entrance of the e-commerce platform, and large-scale shopping festivals such as Double 12 and 618 will cause problems such as slow web page opening and high network delay when different network users access the e-commerce website. Site monitoring can continuously dial test the elastic IP of the website or ECS, and monitor the availability and response time of the service portal.
  • For services such as RDS, ELB, and VPC used by e-commerce platforms, you can use cloud service monitoring to view the running status of cloud services and the usage of various indicators in real time on the cloud service monitoring page, and set alarm rules for monitoring indicators to accurately grasp the status of cloud services. operating conditions.
  • The e-commerce business mainly involves HUAWEI CLOUD ECS, CDN, AS, security service, RDS, ELB, OBS and other services. Viewing resource usage, alarm status, health status, and management alarm rules from a business perspective through the resource grouping function can greatly reduce the complexity of operation and maintenance and improve the efficiency of operation and maintenance.

3.2 Cloud Audit Service CTS

image.png
The log audit module is the core and necessary component of the information security audit function. It is an important part of the information system security risk management and control of enterprises and institutions. The cloud audit service
records the operation information of the user's cloud service resources by connecting with other services on Huawei Cloud. Realize the real-time recording function of user operation cloud service resource actions and results, and save the recorded content in the OBS bucket in real time in the form of event files.
The functions of the cloud audit service mainly include
recording audit logs: it supports recording operations initiated by users through the management console or API interface, as well as internal self-triggered operations of each service.
Audit log query: Supports
combined query of operation records within 7 days in the management console according to multiple dimensions such as event type, event source, resource type, filter type, operating user, and event level.
Audit log dump: Supports periodic dumping of audit logs to OBS buckets under the Object Storage Service (OBS for short). When dumping, audit logs are compressed into event files according to the service dimension.
Event file encryption: Supports encrypting event files using the key in the Data Encryption Workshop (DEW) during the dump process.

3.2.1 Function Introduction of Cloud Audit Service

image.png

  • Event file: An event file is an event set automatically generated by the system. The cloud audit service will generate multiple event files according to the two dimensions of service and dump cycle, and save them synchronously to the OBS bucket specified by the user. Normally, all events generated by a single service in a single dump cycle will only be compressed to generate one event file, but when the number of events is large, the system will adjust the number of events contained in each event file according to the current load situation. Event file format header json

3.2.2 Cloud Audit Service - Tracker

image.png

  • The management event tracker is used to record management events, that is, operation logs for all cloud resources, such as creation, login, deletion, etc.
  • The data event tracker is used to record data events, that is, operation logs for data, such as uploading, downloading, etc.
  • The cloud audit service only saves the events of the past 7 days, and you can add the related configuration of OBS dump to the tracker, and synchronize and save the events to the OBS bucket for a long time

3.2.3 Applicable Scenarios of Cloud Audit Service

image.png

  • Compliance Audit:
    • Audit compliance certification content is usually divided into two parts: the compliance of the customer's business system platform and resources that the cloud service provider is responsible for, and the compliance of the customer's own business system.
  • Key Operation Notification:
    • Customers can configure http/https notifications for their own independent audit system, and instantly synchronize the audit logs received by CTS to the customer's own audit system for independent auditing.
    • In FunctionGraph, customers can select a certain type of audit log as a trigger (such as file upload) to trigger a preset workflow (such as file format conversion), thereby simplifying business development, operation and maintenance, or avoiding problems and risks.
  • Data value mining:
    • The audit log contains various information such as time, operator, operating device ip, operated resources, and operation details. Currently, it includes up to 24 fields, which are valuable for mining.
  • Problem location analysis:
    • The retrieval dimensions provided by the cloud audit service include event type, event source, resource type, filter type, operation user, and event level, etc., and the audit log contains detailed information about the request and response of this operation, which is the key to locating problems on the cloud. One of the fastest and most effective means of positioning.

3.3 Cloud Log Service LTS

image.png

  • Real-time log collection: The cloud log service provides real-time log collection functions. The collected log data can be displayed in a simple and orderly manner on the cloud log console, queried in a convenient and fast manner, and can be stored for a long time.
  • Log query and real-time analysis: The collected log data can be queried simply and quickly through keyword query, fuzzy query, etc., which is suitable for real-time log data analysis, security diagnosis and analysis, operation and customer service systems, etc., such as access to cloud services Volume, hits, etc., through log data analysis, detailed operational data can be output.
  • Log monitoring and alerting: Cloud Log Service combines Application Operations Management (AOM for short) to support keyword statistics for log data stored in Cloud Log Service. Real-time monitoring of service running status.
  • Log dump: After the log data of the host and cloud service is reported to the cloud log service, the default storage time is 7 days, which can be set between 1 and 30 days. Log data that exceeds the storage time will be automatically deleted. For log data that needs to be stored for a long time (log persistence), Cloud Log Service provides a dump function, which can dump logs to Object Storage Service (OBS), data access service (DIS) medium and long-term preservation.
  • Log data is analyzed in real time in LTS, and the log results obtained by SOL statement query and analysis can be represented by various charts, and multiple statistical charts can be saved to the dashboard synchronously.

3.3.1 Basic concepts and operations of cloud logs

image.png

  • The types of log group creation are divided into user creation (active) and cloud service creation (passive). Cloud service creation means that after other cloud services of HUAWEI CLOUD are connected to the cloud log service, the system will automatically create a log group in the cloud log service console. and log stream, the running logs of the cloud service will be sent to the corresponding log stream.
  • Log reading and writing takes the log stream as the unit. You can specify the log stream when writing, and classify and store different types of logs. The stream reading and writing method can minimize the number of reads and writes and improve business efficiency. For example, users can write different logs (operation logs, access logs, etc.) into different log streams, and when querying logs, they can enter the corresponding log streams to quickly view logs.
  • If you have already installed ICAgent when using other cloud services, you do not need to install ICAgent again, please skip this step. Before installing ICAgent, please ensure that the time and time zone of the local browser are consistent with those of the host. In the cloud log service management console, the host management page can be used to install ICAgent. After the ICAgent is installed, you need to configure the path of the host's logs to be collected in the log stream. The ICAgent will package multiple logs and send them to the cloud log service in units of log streams.
  • The log structure is based on the log stream as a unit. The logs in the log stream are structured through different log extraction methods, and the logs with a fixed format or a high degree of similarity are extracted, and irrelevant logs are filtered out, so that the structured The final logs are queried and analyzed according to the SQL syntax.

3.3.2 Log collection and analysis

image.png

  • The collected log data can be queried simply and quickly through keyword query, fuzzy query, etc. It is suitable for real-time log data analysis, security diagnosis and analysis, operation and customer service systems, etc., such as the number of visits and clicks of cloud services, etc. Through log data analysis, detailed operational data can be output

3.3.3 Log dump and visual report

image.png

  • log dump:
    • Dumping LTS to OBS can only be dumped to OBS in the same Region
    • During the process of configuring OBS dump, if the OBS bucket has been encrypted, data cannot be written to the bucket. You need to cancel the encryption before performing subsequent operations.

3.3.4 Applicable scenarios of LTS service

image.png

  • Log collection and analysis
    • The log data of the host and cloud service is inconvenient to consult and will be cleared regularly. After the cloud log service collects the log, the log data can be displayed in a simple and orderly manner on the cloud log console, and can be queried in a convenient and fast manner, and can be stored for a long time . The collected log data can be queried simply and quickly through keyword query, fuzzy query, etc. It is suitable for real-time log data analysis, security diagnosis and analysis, operation and customer service systems, etc. 29 such as visits and clicks of cloud services, etc. 9 Through log data analysis, detailed operational data can be output.
  • Reasonably optimize business performance:
    • The performance and service quality of website services (database, network, etc.) are key indicators to measure user satisfaction. 03 Discover site performance bottlenecks through user congestion records to prompt site managers to improve website caching strategies, network transmission strategies, etc., reasonable Optimize business performance.
  • Quickly locate network faults:
    • Network quality is the cornerstone of business stability. Logs are reported to the cloud log service to ensure timely viewing and location of problems when problems occur, helping users quickly locate network faults w3 for network retrospective evidence collection. For example: cloud servers that quickly locate the root cause of problems, such as cloud servers with excessive bandwidth usage. By analyzing the access logs, it can be judged whether the business has been attacked, illegal hotlinking and bad requests, etc., and the problems can be located and solved in time.

3.4 Application operation and maintenance management service AOM

image.png

  • With the popularization of container technology, more and more enterprises use microservice framework to develop applications5 businesses to achieve more use of cloud services, and operation and maintenance are also turning to cloud operation and maintenance services. It also poses new challenges to the operation and maintenance of cloud applications.
  • O&M personnel have high skill requirements, complicated configurations, and multiple systems need to be maintained at the same time. For distributed tracking systems, learning and use costs are high, and stability is poor.
  • Difficulties in analyzing distributed application problems in cloud scenarios are mainly reflected in how to visualize the dependencies between microservices, how to improve application performance experience, how to correlate scattered logs, and how to quickly track problems.
  • Advantages of AOM:
    • Massive log management: high-performance search and business analysis, automatically cluster related logs, and quickly filter by application host, file name, instance and other dimensions
    • Correlation analysis: Automatically correlates applications and resources layer by layer, analyzes correlation indicators and alarm data from multiple perspectives such as applications, components, instances, hosts, and transactions, and directly detects abnormalities.
    • Ecological openness: open operation, operation and maintenance data query interface and collection standards, support independent development

3.4.1 AOM Service Architecture

image.png

  • Data collection access layer:
    • ICAgent collects data: install ICAgent (plug-in data collector) on the host and report relevant operation and maintenance data through ICAgent
    • API access data: Through the OpenAPI interface or Exporter interface provided by AOM, set the business indicator as 0 as a custom indicator and access it to AOM.
  • Transport storage layer:
    • Data transmission: AOM Access is a proxy service used to receive operation and maintenance data. After the operation and maintenance data is received, it will put the data into the Kafka queue, and use Kafka's high throughput capability to transmit the data to the business computing layer in real time.
    • Data storage: The operation and maintenance data is processed by the AOM backend service, and the data is written into the database. Cassandra is used to store time series index data, Redis is used to query the cache, ETCD is used to store AOM configuration data, and ElasticSearch is used to store Store resources, logs, alarms and events.
  • Business computing layer:
    • AOM provides basic operation and maintenance services such as alarms, logs, monitoring, and indicators, as well as AI services such as anomaly detection and analysis.

3.4.2 Application resource management

image.png

  • With the development of cloud computing, going to the cloud has become the norm. However, how to manage resources of thousands of square meters, various types, and many cloud vendors has become a difficult problem for enterprises. Application Resource Management (CMDB for short) is a resource management platform based on the DevOps concept for the entire application life cycle. Relationship.
  • CMDB function list:
    • Resource retrieval: Provide resource retrieval functions such as applications and hosts, and support rapid resource retrieval by ID, keyword, name, etc.
    • Application management: Manage the relationship between cloud service objects and applications, mainly used to manage cloud service applications such as ECS, CCE, and RDS.
    • Resource management: Resource management performs unified management of all kinds of cloud services owned by users. You can globally view the relationship between all cloud service resource objects and applications, including cloud resources that are not bound to applications, which is convenient for users to analyze and manage resources.
    • Environment label: According to the actual usage scenario, add labels to the created application environment, so that users can quickly filter and find application environments with the same attributes.
  • Change Management Service (CMS for short), as the automated operation and maintenance platform of AOM, provides atomic operation functions such as batch script execution, file distribution, and cloud service change, and supports custom orchestration of atomic operations and assembly into jobs and standardized operation and maintenance processes .

3.4.3 Application Monitoring

image.png

  • Application monitoring is a drill-down design layer by layer, and the hierarchical relationship is: application list->application details->component details->instance details->container details->process details. That is, in application monitoring, applications, components, instances, containers, and processes are associated layer by layer, and the relationship between each layer can be directly known on the interface.

3.4.4 Log Management

image.png

3.4.5 Alarm management

image.png

  • The alarm list is a management platform for alarms and events. It supports custom notification actions, and the alarm information can be obtained by email, SMS, etc., and the abnormality and its root cause can be found in the first time. Prerequisites for alarm management: ICAgent has been installed on the host.
  • Different graphs can be displayed on the same screen through the dashboard, and resource data can be displayed through different instrument forms, such as graphs, digital graphs, 2T6pN graphs, etc., so as to comprehensively and deeply grasp the monitoring data.
  • The log retrieval function can quickly query the required logs in a large number of logs, and log dumps can be stored for a long time. By creating log statistics rules, the periodic statistics of keywords can be realized, and index data can be generated to understand system performance and business information in real time. By configuring word segmentation, the log content can be segmented into multiple words according to the word segment, and the segmented words can be used to search for logs.

3.4.5 Case: Applying AOM for daily inspection and problem locationimage.png

3.5 Application performance management service APM

image.png

  • In the cloud era, applications under the distributed micro-service architecture are becoming more and more abundant, the number of users is growing explosively, and various application exception problems follow one after another. In the traditional operation and maintenance mode, various indicators on multiple operation and maintenance systems cannot be correlated and analyzed. Operation and maintenance personnel need to troubleshoot application exceptions one by one based on operation and maintenance experience. The efficiency of analyzing and locating problems is low, the maintenance cost is high, and the stability is poor.
  • Application O&M under massive business faces the following two challenges:
    • The relationship between large-scale distributed applications is intricate, and it is difficult to analyze and locate application problems. Application operation and maintenance is faced with the challenge of how to ensure normal application, quickly complete problem location, and quickly find performance bottlenecks.
    • Poor application experience leads to loss of users, O&M personnel cannot perceive and track services with poor experience in real time, and fail to diagnose application abnormalities in time, seriously affecting user experience
  • HUAWEI CLOUD Application Performance Management Service (APM) can help O&M personnel quickly discover application performance bottlenecks and quickly locate the root cause of failures, ensuring user experience.

3.5.1 APM Service Architecture

image.png

  • Data collection: APM can collect multiple indicators such as application data, basic resource data, and user experience data provided by Java probes and Istio grids in a non-invasive manner.
  • There are two main types of application topologies:
    • Single-component topology: It is the topology of a single environment under a single component, and can expand the topology relationship of direct or indirect upstream and downstream components.
    • Global application topology: You can view the global topology of all or some components under this application

3.5.2 APM Probe

image.png

  • The APM probe injects the necessary tracking code for distributed transaction and performance information by intervening in the application code during class loading.
  • Transactions in APM refer to http transactions. When a user purchases a mobile phone in Huawei Mall, the user's computer will initiate an http request to the service backend of Huawei Mall. The http request that occurs during this process is an http transaction. The url address is unique, and we use the url address as the name of the transaction; when the deployment probe (pinpoint service (iava type application) receives an http transaction, the APM system will capture the information of the transaction and present it on the APM management plane

3.5.3 Application/Resource Association Analysis

image.png

  • Full Link Topology:
    • Visual topology: APM displays the call relationship and dependency relationship between applications through topology visualization. Topology uses application performance indicators to quantify application performance satisfaction, and uses different colors to identify values ​​in different intervals, quickly discovering application performance problems and locating them. As shown in Figure 1, the topology clearly shows the relationship between applications, call data (services, instance indicators), health status and other details.
    • Cross-application call: The topology map supports the call relationship between different application services. When there are service calls between different applications, it can realize the collection of cross-application call relations and display the performance data of the application.
    • Abnormal SOL analysis: The topology map can count and display the key indicators of the database or SOL statement. APM provides views of key indicators such as database and SOL statement call times, response time, and error times. Through these indicator views, you can analyze database performance problems caused by abnormal (slow or call error) SOL statements.
    • JVM indicator monitoring: The topology map can count and display the JVM indicator data of the instance. APM monitors the memory and thread indicators of the JVM operating environment in real time, and quickly finds problems such as memory leaks and thread exceptions.
  • Call chain tracking: APM can monitor all aspects of the call based on the call status of the application, visually restore the execution track and status of the business, and assist in the rapid demarcation of performance and faults.
    • In the query chain list, click the link of the call chain to view to view the basic information of the call chain
    • On the call chain details page, you can view the complete link information of the call chain, including the call relationship of the local method stack and related remote calls.
  • Transaction analysis: APM displays key indicators such as transaction throughput rate, error rate, and delay through real-time analysis of server business flows, and uses the health indicator Apdex to score applications, which intuitively reflects user satisfaction with applications. When the transaction is abnormal, an alarm is reported. For transactions with poor user experience, the transaction problem location is completed through topology and call chain.

3.5.4 Transaction session monitoring

image.png

  • Track each business transaction in real time, quickly analyze the running status of the transaction and provide diagnostic capabilities
    • Custom transactions: users can define transaction names according to each URL, which is convenient for understanding health
    • Rule configuration: Health rules can be configured for each transaction, and an exception will be prompted if the threshold is exceeded
    • Performance tracking: Accurately collect abnormal performance data, compare historical baseline data, and find abnormal methods of applications to improve operation and maintenance efficiency.

3.5.5 Precise fault location

image.png

  • Application discovery and dependencies: non-invasively collect application KPI data, and automatically generate dependencies through inter-service interfaces.
  • Application KPI aggregation: Non-invasive collection of application KPI data, and automatic generation of dependencies through inter-service interfaces. The microservice instances are aggregated to the application (the number indicates XX instances), and the KPI data is automatically aggregated to the application.

3.5.6 Case: AOM+APM solution (AOM monitoring, APM positioning)

image.png

3.6 Cloud performance testing service CPTS

image.png

3.6.1 Features of CPTS service

image.png

  • Multi-protocol high concurrency performance test:
    • Standard HTTP/HTTPS/TCP/CDP message content can be quickly customized, and the pressure test traffic can be sent to different tested applications with simple adjustments. According to the actual needs of the tested application, any field content of the HTTP/HTTPS/TCP/UDP protocol message can be customized, including the setting and editing of HTTPGET\POST method, URL, Header, Bodv and other fields.
    • The behavior definition of virtual users is adapted to different test scenarios. Set the sending interval for the same user's request by considering the time or define multiple request messages in one transaction to set the number of requests initiated by each user per second.
    • Custom response result verification, more accurate request success criteria. For each user's request, it supports users to configure checkpoints. After obtaining the response message, it checks the response code and the content of the header field. Only when the conditions match can it be considered a normal response.
  • Test task model customization, support complex scene testing:
    • Through the flexible combination of multiple transaction elements and test task stages, it can help users test the performance of applications in multi-operation scenarios and concurrent scenarios.
    • Transactions can be reused by multiple test tasks. For each transaction, multiple test stages can be defined, and the duration and number of concurrent users or the number of stress tests can be defined for each stage to simulate complex scenarios of traffic peaks and troughs.
  • The cloud performance testing service includes two parts: "the cost of the resources used (elastic cloud server) and the cost of using the cloud performance testing service. The cloud performance testing service supports package packages and pay-as-you-go.

3.6.2 One-stop Cloud Performance Test

image.png

  • Million-level high-concurrency engine, full-link bottleneck analysis capability supports the test cycle to be reduced from weeks to hours

3.6.3 Application Scenario: Business Peak Test

image.png

thinking questions

image.png
image.png

end flowering

Guess you like

Origin blog.csdn.net/GoNewWay/article/details/131528217
Recommended