K8S Series (10) Observability Monitoring and Logging

1. Background

Monitoring and logging are important infrastructures for large-scale distributed systems . Monitoring can help developers view the operating status of the system, and logging can assist in troubleshooting and diagnosing problems.

In Kubernetes, monitoring and logging are part of the ecosystem, not core components, so most of the capabilities depend on the adaptation of upper-layer cloud vendors. Kubernetes defines intervening interface standards and specifications, and any component that meets the interface standards can be quickly integrated.

2. Monitoring

monitoring type

Let’s take a look at monitoring first. In terms of monitoring types, it can be divided into four different types in K8s:

  1. resource monitoring

A relatively common indicator of resources such as CPU, memory, and network. Usually, these indicators are counted in numerical and percentage units, which is the most common monitoring method. This kind of monitoring method is in the conventional monitoring, similar to the project zabbix telegraph, these systems can be done.

  1. performance monitoring

Performance monitoring refers to APM monitoring, that is to say, the inspection of some common application performance monitoring indicators. Usually through some Hook mechanism at the virtual machine layer, bytecode execution layer through implicit calls, or display injection at the application layer to obtain a deeper level of monitoring indicators, which are generally used for application tuning and diagnosis. The more common Zend Engine like jvm or php, through some common Hook mechanism, get the number of GC like in jvm, a distribution of various memory generations, and some indicators of the number of network connections. Perform application performance diagnosis and tuning.

  1. Security Monitoring

Security monitoring is mainly a series of monitoring strategies for security, such as unauthorized management, security vulnerability scanning, and so on.

  1. event monitoring

Event monitoring is an alternative monitoring method in K8s. In the last class, I introduced a design concept in K8s, which is a state transition based on a state machine. When transitioning from a normal state to another normal state, a normal event will occur, and when transitioning from a normal state to an abnormal state, a warning event will occur. Usually, warning events are what we are more concerned about, and event monitoring means that normal events or warning events can be sent offline to a data center, and then through the analysis and alarm of the data center, some corresponding abnormalities can be sent through DingTalk Or expose through text messages and emails to make up for some defects and disadvantages of conventional monitoring.

Monitoring interface standard for Kubernetes

In K8s, there are three different interface standards for monitoring. It standardizes and decouples the monitoring data consumption capability, and realizes an integration with the community. The community is mainly divided into three categories.

The first type of Resource Metric

The corresponding interface is metrics.k8s.io, and the main implementation is metrics-server, which provides resource monitoring, and the more common ones are node level, pod level, namespace level, and class level. This kind of monitoring indicators can be obtained through the interface of metrics.k8s.io.

The second type of Custom Metrics

The corresponding API is custom.metrics.k8s.io, and the main implementation is Prometheus. It provides resource monitoring and custom monitoring. Resource monitoring and the above resource monitoring actually have an overlapping relationship, and this custom monitoring refers to: for example, if the application wants to expose a similar online number, or call the following Slow query of MySQL for this database. In fact, these can be defined at the application layer, and then through the standard Prometheus client, the corresponding metrics are exposed, and then collected by Prometheus.

And once this type of interface is collected, it can also be used for data consumption through an interface standard like custom.metrics.k8s.io. That is to say, if you access Prometheus in this way, you can pass custom.metrics.k8s.io This interface is used for HPA and data consumption.

The third type of External Metrics

External Metrics is actually a special category, because we know that K8s has now become an implementation standard for cloud-native interfaces. In many cases, cloud services are used when dealing with the cloud. For example, in an application, the message queue is used in the front, and the RBS database is used in the back. Sometimes during data consumption, it is necessary to consume some monitoring indicators of cloud products at the same time, such as the number of messages in the message queue, or the number of connections in the access layer SLB, the number of 200 requests in the upper layer of the SLB, etc. Monitor metrics.

How to spend it? A standard is also implemented in K8s, which is external.metrics.k8s.io. The main implementation vendor is the provider of each cloud vendor, through which the monitoring indicators of cloud resources can be passed. An implementation of external.metrics.k8s.io, which is used by the Alibaba cloud metrics adapter to provide this standard, has also been implemented on Alibaba Cloud.

Promethues - The Monitoring "Standard" for the Open Source Community

Next, let's look at a common monitoring solution in the open source community, which is Prometheus. Why is Prometheus said to be the monitoring standard of the open source community?

  • One is because first of all Prometheus is a graduation project of the CNCF cloud native community. And the second is that more and more open source projects now use Prometheus as the monitoring standard. Similar to our more common projects such as Spark, Tensorflow, and Flink, in fact, they all have standard Prometheus collection interfaces.

  • The second is that for projects like some common databases and middleware, it has a corresponding Prometheus collection client. Similar to ETCD, zookeeper, MySQL or PostgreSQL, these actually have a corresponding Prometheus interface, if not, there will be a corresponding exporter in the community to implement an interface.

Then let's take a look at the general structure of Prometheus.

insert image description here

The figure above is the data link collected by Prometheus, which can be divided into three different data collection links.

  • The first is the push method, which is to collect data through the pushgateway, then connect the data cable to the pushgateway, and then Prometheus uses the pull method to pull the data from the pushgateway. The main scenario for this collection method is that your task may be relatively short-lived. For example, we know that Prometheus, the most common collection method is the pull mode, which brings a problem that once your data declaration cycle is shorter than the data The acquisition cycle, for example, my acquisition cycle is 30s, and my task may run for 15s and it will be over. In this scenario, some data may be missed. The easiest way to do this is to first push your metrics through the pushgateway, and then pull the data from the pushgateway through the pull method. This way, you can do it without losing job tasks in a short time.

  • The second is the standard pull mode, which directly pulls data from the corresponding data task through the pull mode.

  • The third is Prometheus on Prometheus, that is, another Prometheus can be used to synchronize data to this Prometheus.

These are the three collection methods in Prometheus. From the data source, in addition to the standard static configuration, Prometheus also supports service discovery. That is to say, some collection objects can be dynamically discovered through some service discovery mechanisms. It is more common in K8s to have the dynamic discovery mechanism of Kubernetes. It only needs to configure some annotations, and it can automatically configure collection tasks for data collection, which is very convenient.

On the alarm, Prometheus provides an external component called Alentmanager, which can send the corresponding alarm information to a data alarm by email or SMS. In terms of data consumption, data display and data consumption can be performed through the upper-layer API clients, through the web UI, and through Grafana.

To sum up, Prometheus has the following five characteristics:

  • The first feature is the introduction of powerful access standards. Developers only need to implement an interface standard such as Prometheus Client to directly realize data collection;

  • The second is a variety of data collection and offline methods. Data can be collected and offline through push, pull, or Prometheus on Prometheus;

  • The third is compatibility with K8s;

  • The fourth is the rich plug-in mechanism and ecology;

  • The fifth one is a boost of Prometheus Operator. Prometheus Operator may be the most complicated of all the operators we have seen so far, but it is also an Operator that makes the most of the dynamic capabilities of Prometheus. If it is used in K8s Prometheus, it is recommended that you use Prometheus Operator for deployment and operation and maintenance.

    3. Log

    log scene

    Next, let me introduce a part of the log in K8s. First, let's take a look at the log scenario. The log is mainly divided into four major scenarios in K8s:

    1. The log of the host kernel

    • The first one is the log of the host kernel. The log of the host kernel can assist developers in some common problems and diagnoses, such as the abnormality of the network stack, similar to our iptables mark, which can see some messages such as the controller table;

    • The second is driver exceptions. It is more common that driver exceptions may sometimes occur in some network solutions, or in some scenarios similar to GPUs. Driver exceptions may be some common errors;

    • The third is the abnormality of the file system. In the early stage when docker is not very mature, overlayfs or AUFS will actually cause problems frequently. After these problems occur, developers have no good way to monitor and diagnose them. In this part, you can actually check some exceptions in the host kernel log;

    • Further down are some abnormalities that affect the node, such as some kernel panics in the kernel, or some OOMs, which will also be reflected in the host log.

    2. Runtime log

    The second is the runtime log. The more common ones are some logs of Docker. We can use the logs of docker to troubleshoot a series of problems such as deleting some Pod Hangs.

    3. Logging of core components

    The third is the log of the core components. In K8s, the core components include some external middleware, such as etcd, or some built-in components, such as API server, kube-scheduler, controller-manger, kubelet Wait for this series of components. The logs of these components can help us see the usage of a resource on the control plane in the entire K8s cluster, and whether there is any abnormality in a current running state.

    There is also some core middleware, such as Ingress, which can help us see the flow of an entire access layer. Through the Ingress log, a good connection can be achieved. A layered application analysis.

    4. Deploy application logs

    Finally, there is the log of the deployed application. You can view a status of the business layer through the application log. For example, you can see if there are 500 requests in the business layer? Are there any panics? Is there some unusual wrong access? These can actually be viewed through the application log.

    log collection

    First of all, let's take a look at log collection. According to the division of the collection location, the following three types need to be supported:

insert image description here

  • The first is the host machine file. This kind of scene is more common. In my container, the log file is written to the host machine through something like a volume. Rotate the logs through the log rotation strategy of the host machine, and then collect them through the agent on my host machine;

  • The second is that there are log files in the container. How to deal with this common method? A common method is that I transfer to stdout through a sidecar streaming container, and write to the corresponding log-file through stdout. Then through a local log rotation, and then collect with an external agent;

  • The third one is that we directly write to stdout, which is a relatively common strategy. The first one is that I directly use this agent to collect to the remote end, and the second one is to directly collect to the remote end through standard APIs like some sls.

In the community, it is actually recommended to use a collection solution of Fluentd. Fluentd will set up a corresponding agent on each node, and then this agent will collect data to a server of Fluentd. This server can send data offline to Correspondingly, it is similar to elasticsearch, and then display it through kibana; or go offline to influxdb, and then display it through Grafana. This is actually a more recommended practice in the community.

Guess you like

Origin blog.csdn.net/lin819747263/article/details/125728750