Huawei Cloud CCE Cluster Health Center: a cloud-native observable platform with expert operation and maintenance experience

This article is shared from the Huawei Cloud Community " Huawei Cloud CCE Cluster Health Center of the New Generation of Cloud Native Observable Platform " by: Cloud Container Future.

"Kubernetes operation and maintenance is indeed complex. It not only requires an in-depth understanding of various concepts, principles and best practices, but also requires risk assessment of the health status of the cluster, resource utilization, container stability and other aspects. When the cluster fails We usually need to spend a lot of time analyzing various logs and monitoring information to find the root cause of the problem." said an IT company operations director.

In recent years, more and more companies have turned to cloud-native architectures based on Kubernetes. As microservices and cloud-native architectures become more and more complex, we have also received feedback from many customers that monitoring and troubleshooting in production is becoming increasingly difficult. Although the CCE cloud-native observable platform provides monitoring, alarming, logging and other functions, which allows users to locate problems more conveniently, it also virtually increases the technical threshold of operation and maintenance personnel. In order to free operation and maintenance personnel and developers from arduous fault location and troubleshooting, the CCE service provides cluster health diagnosis capabilities.

CCE cluster health diagnosis gathers the experience of container operation and maintenance experts to provide you with cluster-level health diagnosis best practices. It can conduct a comprehensive check on the health status of the cluster, help you discover cluster faults and potential risks in a timely manner, and provide corresponding repair suggestions for your reference.

Ready to use out of the box: No need to activate, zero dependencies, one-click health diagnosis

As a built-in health expert system of CCE, the cluster health diagnosis function can run independently without relying on any plug-ins and other services. Users can trigger cluster health diagnosis with one click without going through cumbersome activation and configuration processes.

1-1.png

Figure 1 One-click health diagnosis

Regular inspection: unattended, continuously protecting cluster health

In active operation and maintenance scenarios, such as before and after cluster upgrade or during business restoration, users can actively trigger health diagnosis at any time to ensure the smooth operation of the business. On the other hand, in daily operation and maintenance, we cannot always stare at the screen to ensure that. In order to liberate customers from this low-level labor, health diagnosis supports the scheduled inspection function. You only need to simply configure the scheduled task, and the health diagnosis task will be completed. You can protect the health of your cluster in the background and archive the inspection results regularly to facilitate review at any time.

2-2.pngFigure 2 Health check results

Multi-dimensional diagnosis: rich diagnostic items, clustered comprehensive physical examination

CCE cluster health diagnosis refines high-frequency fault cases provided by operation and maintenance experts, covering health checks in multiple dimensions such as clusters/core plug-ins/nodes/workloads/external dependencies, and all diagnostic items are given risk ratings, Impact risks, and remediation recommendations.

  • Cluster dimension : including cluster operation and maintenance capability check, security group configuration check, cluster resource planning check and other diagnostic items.

3-3.png

Figure 3 Cluster dimension diagnostic items

  • Core plug-in dimension : Covers the health check of core plug-ins such as monitoring, logs, coredns, and storage.

4-4.png

Figure 4 Core plug-in dimension diagnostic items

  • Node dimension : including node resource load and node status diagnosis.

5-5.png

Figure 5 Node dimension diagnostic items

  • Workload dimension : including workload configuration check, Pod resource load check, Pod status diagnosis, etc.

6-6.png

Figure 6 Workload dimension diagnostic items

  • External dependency dimension : mainly includes resource quota checks such as ECS and cloud disks.

7-7.png

Figure 7 External dependency dimension diagnostic items

Intelligent analysis: intelligent health rating, professional repair suggestions

CCE cluster health diagnosis will give risk levels and provide repair suggestions based on faults and potential risks. Risk levels are divided into high risk and low risk according to the degree of urgency:

  • High risk : This indicates that the diagnostic item will endanger the stability of the cluster or application, may cause business losses, and needs to be repaired as soon as possible.
  • Low risk : This indicates that the diagnostic item does not comply with cloud native best practices and has potential risks, but it will not have a major impact on the business immediately and it is recommended to fix it.

After each health diagnosis is completed, all diagnosis results will be aggregated and analyzed, and a final cluster health score will be given, which reflects the overall health status of the cluster. Clusters with low health scores often have a greater risk of failure and require the cluster administrator's attention.

8-8.png

Figure 8 Health risk level assessment

Case study: A business failure caused by a security group misoperation

CCE is a universal container platform, and the setting of security group rules is suitable for common scenarios. When the cluster is created, a security group will be automatically created for the Master node and Node node. If the user accidentally operates the rules in the default security group, it may cause problems such as node network failure. Moreover, this kind of problem is often difficult to eliminate, and it takes a lot of time to locate the cause of the security group, which affects the speed of business recovery. . In this case, we can perform fault diagnosis through the inspection function of the health center.

For example, modify the default security group rules of a cluster and change the communication rules between Master and Node from allow to deny.

9-9.png

Figure 9 Modify security group rules

The above operations will cause some functional abnormalities in the cluster. For example, if the network is unavailable, the kubectl command cannot be executed.

This kind of problem is often difficult to troubleshoot and will consume a lot of time for users to find the root cause. At this time, if the user performs a health inspection in the CCE Health Center, he will find a prompt for high-risk inspection items in the security group:

10-10.png

Figure 10 Security group exception prompt

The abnormal security group can be directly located through the diagnosis details to facilitate targeted repairs:

11-11.png

Figure 11 Locating abnormal security group

The entire fault diagnosis process is convenient and fast, which can greatly reduce troubleshooting time and help customers' businesses run more stably on the CCE cluster.

Conclusion

The CCE cluster health diagnosis function integrates and accumulates a large amount of expert operation and maintenance experience, with the goal of providing customers with more intelligent and faster operation and maintenance capabilities. Currently, this capability is still being rapidly iterated. In the future, we will add capabilities such as inspection result notifications, risk assessment threshold adjustments, and richer diagnostic items to bring you a smarter, more reliable and stable cloud native system.

For service experience, please visit:

https://www.huaweicloud.com/product/cce.html

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

 

Broadcom announces the termination of the existing VMware partner program deepin-IDE version update, replacing the old look with a new look Zhou Hongyi: Hongmeng native will definitely succeed WAVE SUMMIT welcomes its tenth session, Wen Xinyiyan will have the latest disclosure! Yakult Company confirms that 95 G data was leaked The most popular license among programming languages ​​in 2023 "2023 China Open Source Developer Report" officially released Julia 1.10 officially released Fedora 40 plans to unify /usr/bin and /usr/sbin Rust 1.75.0 release
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10456183