In-depth analysis of cloud native to improve the actual operation of K8S monitoring observability

  • When it comes to cloud-native observability, probably everyone mentions OpenTelemetry (OTEL), because the community relies on standards to point all cluster component development in the same direction. OpenTelemetry is able to combine logs, metrics, traces and other contextual information into one resource. Cluster administrators or software engineers can use this resource to get a view of what is happening in the cluster over a defined period of time, but how does Kubernetes itself leverage this technology stack?
  • Kubernetes is made up of multiple components, some of which are independent, while others are stacked together. Looking at the architecture from the perspective of container runtime, then from top to bottom:
    • kube-apiserver: verify and configure the data of API objects;
    • kubelet: an agent running on each node;
    • CRI Runtimes: Container Runtime Interface (CRI) compliant container runtimes, such as CRI-O or containerd;
    • OCI runtimes: lower-level Open Container Initiative (OCI) runtimes, such as runc or crun;
    • Linux kernel or Microsoft Windows: the underlying operating system.
  • This means that if you have a problem running a container in Kubernetes, you start looking at one of the components. With the increasing complexity of today's cluster architectures, finding the root cause of a problem is one of the most time-consuming operations we face. Even when you know the component that might be causing the problem, you still have to consider other components. So how do you do this?
  • Most people will probably stick to grabbing logs, filtering them and assembling them together on component boundaries, we also have metrics, but associating metric values ​​with normal logs makes it harder to track what's going on, some metrics Nor was it formulated for debugging purposes.
  • The result is OpenTelemetry, a project that aims to combine signals such as traces~traces, metrics~metrics, and logs~logs to maintain a unified view of the state of the cluster.
  • What is the current state of OpenTelemetry tracing in Kubernetes? From an API server perspective, we have alpha support for tracing since Kubernetes v1.22, which will be promoted to beta in one of the upcoming releases. Unfortunately, beta graduates missed out on Kubernetes v1.26, and the design proposal can be found in API Server Tracing Kubernetes Enhancement Proposal (KEP), which provides more information about it.
  • The kubelet tracing part is traced in another KEP, which was implemented in alpha state in Kubernetes v1.25. There are no beta graduations planned at the time of writing, but there may be more in the v1.27 release cycle. There are other efforts beyond the two KEPs, for example klog is considering OTEL support, which will improve observability by linking log messages to existing traces. In SIG Instrumentation and SIG Node there will be a discussion on how to link kubelet traces together, as they are now focused on gRPC calls between the kubelet and the CRI container runtime.
  • CRI-O has supported OpenTelemetry traces since v1.23.0 and is working on improving them continuously, for example by appending logs to traces or extending spans to logical parts of the application, which helps users of traces get the same information, but with enhanced scoping and ability to filter other OTEL signals. The CRI-O maintainers are also working on a container monitoring replacement for conmon, called conmon-rs and written purely in Rust. One of the benefits of implementing in Rust is the ability to add features such as OpenTelemetry support, since libraries for these already exist, which allows tight integration with CRI-O and lets consumers see the lowest level of trace data from the container.
  • containerd has added tracing support since v1.6.0, which can be obtained by using a plugin. The lower level OCI runtimes, like runc or crun, don't support OTEL at all, and there don't seem to be plans for this, so performance overhead in collecting traces and exporting them to data sinks has to be taken into account, personally still Thought extended telemetry collection on the OCI runtime seemed worth evaluating, so is Rust OCI runtime youki considering something similar in the future?
  • For the demo shown below, here we will stick to a single local node stack of runc, conmon-rs, CRI-O and kubelet. To enable tracing in kubelet, the following needs to be configured in KubeletConfiguration:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  KubeletTracing: true
tracing:
  samplingRatePerMillion: 1000000
  • Equal to samplingRatePerMillion million will internally translate to sampling everything, a similar configuration must be applied to CRI-O; a binary crio file can be started with and, or an embedded configuration like this: --enable-tracing`` --tracing -sampling-rate-per-million 1000000.
  • A samplingRatePerMillion equal to 1 million will internally translate to sampling everything, a similar configuration must be applied to CRI-O; the crio binary can be started with the arguments --enable-tracing and --tracing-sampling-rate-per-million 1000000 file, or use a drop-in configuration like this:
cat /etc/crio/crio.conf.d/99-tracing.conf
[crio.tracing]
enable_tracing = true
tracing_sampling_rate_per_million = 1000000
  • To configure CRI-O to use conmon-rs, you need at least the latest CRI-O v1.25.x and conmon-rs v0.4.0, and then configure the plugin as follows to make CRI-O use conmon-rs:
cat /etc/crio/crio.conf.d/99-runtimes.conf
[crio.runtime]
default_runtime = "runc"

[crio.runtime.runtimes.runc]
runtime_type = "pod"
monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH
  • The default configuration will point to the OpenTelemetry collector gRPC endpoint: localhost:4317, which must also be up and running. There are multiple ways to run OTLP as described in the docs, but it is also possible to poke through kubectl proxy into an existing instance running in Kubernetes. If everything is set up, the collector should log incoming traces:
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
Span #0
    Trace ID       : 71896e69f7d337730dfedb6356e74f01
    Parent ID      : a2a7714534c017e6
    ID             : 1d27dbaf38b9da8b
    Name           : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2023-07-07 09:50:20.060325562 +0000 UTC
    End time       : 2023-07-07 09:50:20.060326291 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Span #1
    Trace ID       : 71896e69f7d337730dfedb6356e74f01
    Parent ID      : a837a005d4389579
    ID             : a2a7714534c017e6
    Name           : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2023-07-07 09:50:20.060321973 +0000 UTC
    End time       : 2023-07-07 09:50:20.060330602 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Span #2
    Trace ID       : fae6742709d51a9b6606b6cb9f381b96
    Parent ID      : 3755d12b32610516
    ID             : 0492afd26519b4b0
    Name           : github.com/cri-o/cri-o/server.(*Server).filterContainerList
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2023-07-07 09:50:20.0607746 +0000 UTC
    End time       : 2023-07-07 09:50:20.060795505 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Events:
SpanEvent #0
     -> Name: log
     -> Timestamp: 2023-07-07 09:50:20.060778668 +0000 UTC
     -> DroppedAttributesCount: 0
     -> Attributes::
          -> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
          -> log.message: Str(No filters were applied, returning full container list)
          -> log.severity: Str(DEBUG)
          -> name: Str(/runtime.v1.RuntimeService/ListContainers)
  • You can see that spans have a trace ID, and usually have an additional attached, and events such as logs are also part of the output. In the above case, the kubelet's Pod Lifecycle Event Generator (PLEG) periodically triggers the ListPodSandbox to call the CRI-O's RPC. These traces can be displayed by eg Jaeger, when running the trace stack locally, a Jaeger instance address should be exposed by default at: http://localhost:16686.
  • These ListPodSandbox requests are directly visible in the Jaeger UI:

ListPodSandbox RPC in Jaeger UI

  • The effect is not very good, so the workload will be run directly through kubectl:
kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
hi
pod "alpine" deleted
  • Now looking at Jaeger, you can see the traces of the RunPodSandbox and CreateContainer CRI RPC with conmonrs, crio and kubelet:

Create container traces in Jaeger UI

  • The kubelet and CRI-O spans are interconnected to make investigation easier, and if you now look closely at the spans, you can see that CRI-O's logs correctly include the corresponding capabilities. For example, container user can be extracted from traces like this:

CRI-O Traces in Jaeger UI

  • Lower level common-rs spans are also part of this trace. For example conmon-rs maintains an internal read_loop that handles IO between the container and the end user. The log of read and written bytes is part of the spans. The same applies to the wait_for_exit_code span, which tells the container to exit successfully with a code of 0:

common-rs Traces in Jaeger UI

  • Putting all this information together with Jaeger's filtering capabilities makes the whole stack an excellent solution for debugging container issues, mentioning "whole stack" also reveals the biggest disadvantage of the monolithic approach: it's less efficient on the cluster than parsing logs Adds obvious overhead on top of the setup, user has to maintain a sink like Elasticsearch to persist data, exposes Jaeger UI and may have performance pitfalls in mind, anyway it's still the best way to improve observability aspects of Kubernetes one.

Guess you like

Origin blog.csdn.net/Forever_wj/article/details/131845491