kubernetes troubleshooting, treatment, prevention

kubernetes troubleshooting, treatment, prevention

Troubleshooting and order ideas

first step:

We can see through the node is normal, necessary to ensure K8S API Server is normal, the second is to see whether there is a network node in the cluster node abnormalities. If we find which node hung up in the first step, this time we can restart the node, the application on the node to recover. If we find that the cluster node hang because of insufficient resources, this time we must promptly increase the cluster nodes, or even restart the cluster, may still hang.

Step two:

By the first step, we did not find nodes in the cluster have any questions, I might need to see part of the application itself, we need to look at the log application itself, need to view the log Pod involved. You can see the implementation of two confirmation, one look at the Pod infrastructure network is unobstructed view through the application log is not found some problems during operation, is not there something wrong log inside. The second is to solve the common problems encountered during startup.

third step:

Suppose we find no error occurs Pod, this time we may need to look at K8S core components, including kube-apiserver, kube-scheduler, kube-controller-manager logs. At this time we can view the configuration K8S core components, to see if there is no problem.

the fourth step:

The fourth step is to check whether the Service can work visit, this time we need to look at DNS or kube-proxy log, you can see a very simple DNS or kube-proxy there is no error. Some people habitually on certain Pod capture or capture performed on the node, this efficiency is very low.

the fifth step:

There is no problem to locate, you can also take a look Kubelet on each node, if there is an error message on Kubelet occurred, probably because Kubelet this node there was a disaster, we need to re-do the nodes.

More than five cases can resolve exceptions most of the network, storage and applications in the cluster itself. The core itself is dependent on the cluster have monitoring systems, monitoring alarm systems, see the cluster node which is not there an exception in this way directly in our surveillance system. If we are micro-services application, you can also view service is not one of a micro link disconnected.

Common Application fault

The first is the start Pod been in a state of Pending, indicating that the pod is not scheduled to a node. In our real-world scenario, mostly because of lack of resources on the node caused.

The second is the use of hostPort cause port conflicts, the service does not start properly.

The third is if you program in Waiting, it means the pod has been tuned to the node, but not up and running, by kubectl describe This command to view information. The most common cause is pulling mirror fails, this situation is very common in the private cloud environment.

The fourth has a relationship with our own application using the code parameter or the application itself, here's a typical example --MySQL. I believe that the deployment of MySQL users on K8S, are likely to encounter this problem. MySQL in use, if your disk cache is insufficient, this time we need to look it up MySQL's error, an error message will be clearly told that you need to transfer large parameter, which is all you can be found by Controller command. If the Pod has been restart, we may need to view the log before it.

Fifth, Service provides load balancing between multiple services and multiple Pod. Assuming that your service is not accessible when we need to consider at this time, you can see there under Service Controller is not mounted concrete Pod.

Common cluster failures

1, node abnormalities
2, basic network anomalies
3, k8s abnormal component
4, the network and storage containers

Our cluster sometimes a situation occurs, such as our lack of CPU or memory, leading to node abnormalities, down, or the entire cluster network outages, flash.

If the node is abnormal, this time need to quickly restart the nodes, add nodes to a lack of resources. If the node frequently shake resulting in service drift, this time we will check the parameters Controller-Manager, configure response parameters, set its entire node status detection period.

If we find that an exception occurs K8S, such as we find Controller-Manager hung up, or Cloud-Controller-Manager hung up, the core is to look at the log, in a timely manner to these node restart them. The core demand is that we need to ensure that key components of high availability, including our need for these components to remain in multiple copies.

Storage and networking environments are K8S throughout most problematic areas, like general network problem, it will be like other positioning like we need from top to bottom layers of screening, a problem in which layer of the network.

How to avoid failure

1, real-time monitoring of cluster resources

pod lack of resources leads to avoid drift, time to add nodes

2, set the appropriate resources to the pod

k8s provides limitRange API, you can configure a default namespace limit, request value

Meanwhile cluster plug (monitoring, alarms, log collection) and other resources will be occupied, with the expansion of the cluster increases, so it is necessary to do resource constraints. If you do not set resource limits, it will lead to continuously kill (lack of resources, k8s expulsion pod priority and will give priority to kill the pod is not set resource limits, and monitoring resources increases as the load increases, the resources occupied after the increase, do not set the appropriate memory limits, it will continue to be killed restarted)

3, dynamic community concerns

Timely repair bug in the component cluster

4, application-level optimization

Try to make the application of stateless, combined k8s deployment resource objects, multi-copy, auto-scaling function.

Since the load K8S entire cluster is that we usually can not cope, so we have to find a way to avoid problems, I summarized the following four points to help developers avoid situations we may encounter in the cluster.

The first point is that we build up a cluster, the real fall into the production environment, we are very dependent surveillance systems in the cluster, we need to Prometheus, Docker need systems that help us to monitor real-time status of the cluster resources. If an exception occurs which node, we need timely notified.

Developer or operation and maintenance personnel can easily ignore it, assuming we deployed a cluster of some components or log Prometheus, we may ignore these system components itself will consume part of the system resources.

As we continue to increase the node, these system components will be above the overhead, it still consumes resources. If more and more applications, Pod more and more, these system components will have more consumption of resources, if not given enough resources system components can lead to the collapse of these critical system components, we need to give them the right to set resource limits the need to constantly adjust the size of the cluster. At least ensure that our monitoring alarm system uptime.

The second point is very often K8S tend to expose a lot of security issues, we need to focus on real-time, promptly fix problems that may arise in the cluster.

The third point is to optimize the application layer, application problems arise when we, if we want to quickly recover it, a premise is that we must ensure that the application is stateless, it does not depend on any storage. There are applications to restore the state of recovery than non-state applications up much trouble. Stateless applications, we can let the application to restore the number of copies, to take advantage of K8S scheduling arrangement, which is common legacy applications and differentiated place. We can avoid that we use in K8S cluster of problems in this way.

The most important thing we need in a timely manner to back up data in the cluster, before we had a line of disaster, to restore the backup data, did not affect the business.

Guess you like

Origin www.cnblogs.com/h-gallop/p/11682098.html