ByteDance Open Source Kelemetry: Global Tracking System for Kubernetes Control Plane

Hands-on attention

82d6655c911b10587150615f4455aef3.gif

Dry goods do not get lost

Kelemetry is a tracking system for the Kubernetes control plane developed by Bytedance. It connects the behavior of multiple Kubernetes components from a global perspective, and tracks the complete life cycle of a single Kubernetes object and the interaction between different objects. By visualizing the event chain within the K8s system, it makes the Kubernetes system easier to observe, understand, and debug.

db574c8c5b7af282a66ccf03918e7286.png

background

In traditional distributed tracing, a "trace" usually corresponds to an internal call during a user request. In particular, when a user request arrives, the trace starts from the root span, and then each internal RPC call starts a new child span. Since the duration of a parent span is usually a superset of its child spans, tracing can be visualized visually as a tree or flame graph, where the hierarchy represents dependencies between components.

In contrast to traditional RPC systems, the Kubernetes API is asynchronous and declarative. To perform an action, components update the specification (desired state) of the object on the apiserver, and then other components continually try to correct themselves to reach the desired state. For example, when we scale the ReplicaSet from 3 to 5 replicas, we update the spec.replicas field to 5, the rs controller will observe this change and keep creating new pod objects until the total number reaches 5. When the kubelet observes that a node it manages has created a pod, it spawns containers on its node that match the specifications in the pod.

During this process, we never called the rs controller directly, and the rs controller never called the kubelet directly. This means that we cannot observe direct causal relationships between components. If one of the original 3 pods is deleted during the process, the replica set controller will create a different pod along with the two new ones, and we cannot correlate this creation with the expansion of the ReplicaSet or the deletion of the pod. Therefore, the traditional span-based distributed tracing model is hardly applicable in Kubernetes due to the ambiguous definition of "trace" or "span".

In the past, individual components have been implementing their own internal traces, usually one trace per "reconcile" (e.g. kubelet traces only trace synchronous operations that handle single pod creation/updates). However, no single trace can account for the entire process, which leads to islands of observability, as many user-facing behaviors can only be understood by observing multiple reconciles; for example, the process of expanding a ReplicaSet can only be handled by observing the replica set controller Multiple reconciles of ReplicaSet updates or pod readiness updates to infer.

To solve the problem of observability data islands, Kelemetry collects and connects signals from different components in a component-independent and non-intrusive manner, and displays relevant data in the form of traces.

design

object as span

To connect observability data from different components, Kelemetry takes a different approach, inspired by the kspan project, and instead of trying to have a single operation as the root span, here a span is created for the object itself, and each on the object The events that occur are all a subspan. In addition, the individual objects are linked together by their ownership, such that spans of child objects become child spans of the parent. Thus, we get two dimensions: the tree hierarchy represents the object hierarchy and event scope, while the timeline represents the sequence of events, often aligned with causality.

For example, when we create a single pod deployment, the interaction between the deployment controller, rs controller, and kubelet can be shown in a single trace using data from audit logs and events:

32c6a1c96a7ea96904e77c05e3ca5141.png

Traces are typically used to trace short-lived requests lasting a few seconds, so trace store implementations may not support traces with long lifetimes or containing too many spans; traces containing too many spans may cause performance issues with some storage backends. Therefore, we limit the duration of each trace to 30 minutes by dividing each event into the half-hour time slot it belongs to. For example, an event that occurred at 12:56 would be grouped into an object span of 12:30-13:00.

We use a distributed KV store to store a mapping of (cluster, resource type, namespace, name, field, half-hour timestamp) to the trace/span ID of the corresponding object creation to ensure that only one trace is created per object.

Audit log collection

One of Kelemetry's main data sources is the apiserver's audit logs. Audit logs provide rich information about each controller operation, including the client that initiated the operation, the objects involved, the exact duration from receipt of the request to completion, and more. In the Kubernetes architecture, changes to each object trigger its associated controllers to coordinate and cause subsequent object changes, so observing the audit logs associated with object changes helps to understand the interactions between controllers in a sequence of events .

Audit logs for the Kubernetes apiserver are exposed in two different ways: log files and webhooks. Some cloud providers have implemented their own ways of collecting audit logs, and there has been little progress in the community on vendor-neutral approaches to configuring audit log collection. To simplify the deployment process for self-provided clusters, Kelemetry provides an audit webhook for receiving native audit information, and also exposes a plugin API to consume audit logs from vendor-specific message queues.

Event collection

When Kubernetes controllers process objects, they emit "events" associated with the objects. These events are displayed when the user runs the kubectl describe command, often providing a friendlier description of what the controller is doing. For example, when the scheduler fails to schedule a pod, it emits a FailToSchedulePod event with a detailed message:

0/4022 nodes are available to run pod xxxxx: 1072 Insufficient memory, 1819 Insufficient cpu, 1930 node(s) didn't match node selector, 71 node(s) had taint {xxxxx}, that the pod didn't tolerate.

Since events are optimized for use with the kubectl describe command, they do not keep each raw event, but instead store the timestamp and count of the last recorded event. Kelemetry, on the other hand, retrieves events using the Object List Observation API in Kubernetes, which only exposes the latest version of the event object. To avoid duplicate events, Kelemetry uses several heuristics to "guess" whether an event should be reported as a span:

  • Persist the timestamp of the last event processed, and ignore events before that timestamp after a restart. While the order in which events are received is not necessarily guaranteed (due to client clock skew, inconsistent latency of controller-apiserver-etcd round trips, etc.), this delay is relatively small and removes most duplication due to controller restarts.

  • Verify that the resourceVersion of the event has changed to avoid duplicate events due to relisting.

Correlate object state with audit log

When studying audit logs for troubleshooting, we are most interested in knowing "what changed with this request", not "who made this request", especially when the semantics of the various components are not clear. Kelemetry runs a controller that monitors object creation, update, and deletion events and associates audit events with audit spans when received. When a Kubernetes object is updated, its resourceVersion field is updated with a new unique value. This value can be used to correlate updates to the corresponding audit log. Kelemetry caches diffs and snapshots of each resourceVersion of an object in a distributed KV store for later linking from audit consumers so that each audit log span contains fields changed by the controller.

Tracking resourceVersion can also help identify 409 conflicts between controllers. Conflicting requests occur when a client passes an UPDATE request with a resourceVersion that is too old, and other requests change the resourceVersion. Kelemetry is able to combine multiple audit logs with the same old resource version to show audit requests related to their subsequent conflicts as related subspans.

To ensure seamless availability, the controller uses a multi-master election mechanism that allows multiple replicas of the controller to simultaneously monitor the same cluster, ensuring that no events are lost when the controller is restarted.

43cf545bb5e20d6e40bc5e7183fc4412.png

Front-end Tracking Conversions

In traditional tracing, spans always start and end in the same process (usually the same function). Therefore, tracing protocols such as OTLP do not support modifying spans after they are complete. Unfortunately, this is not the case with Kelemetry, since objects are not running functions, and there is no process dedicated to starting or stopping their spans. Instead, Kelemetry determines the object span immediately after creation and writes additional data to subspans, so that each audit log and event is a log on a subspan rather than an object span.

However, since the end time/duration of the audit log is usually of little value, the trace view is ugly and space inefficient:

035d8207a18c39d3401e81227e35e1ee.png

To improve user experience, Kelemetry intercepts between the Jaeger query frontend and the storage backend, and executes a custom transformation pipeline on the storage backend results before returning them to the query frontend.

Kelemetry currently supports 4 transformation pipelines:

  • tree: the original trace tree after the service name/operation name and other field names are simplified

  • timeline: trims all nested pseudo-spans, puts all event spans under the root span, effectively providing audit logs

  • tracing: non-object spans are flattened into span logs of related objects

dfddb67056d8a417c7e70a8042678284.png

  • Grouping: Create a new pseudospan for each data source (audit/event) on top of the trace pipeline output. When multiple components send their spans to Kelemetry, component owners can focus on their own component's logs and easily cross-check other components' logs.

Users can choose the conversion pipeline by setting "service name" when tracking search. The intermediate storage plugin generates a new "CacheID" for each trace search result and stores it into the cache KV along with the actual TraceID and transformation pipeline. When the user views, they pass the CacheID, which is converted to the actual TraceID by the intermediary storage plugin, and the conversion pipeline associated with the CacheID is executed.

bd6943cdde8241d8ec13b5396044f478.png

Break through the time limit

As mentioned above, the trace cannot grow infinitely, as it may cause problems with some storage backends. Instead, we start a new trace every 30 minutes. This leads to a confusing user experience, as a deployment trace that starts rolling at 12:28 ends abruptly at 12:30, and the user must manually jump to the next trace at 12:30 to continue viewing the trace. To avoid this cognitive overhead , the Kelemetry storage plugin identifies spans with the same object label when searching traces and stores them with the same cache ID and a user-specified search time range. When rendering a span, all related tracks are merged together, object spans with the same object label are deduplicated, and their children are merged. The track search time range becomes the track's clipping range, showing the complete story of the object group as a single track.

Multi-cluster support

Kelemetry can be deployed to monitor events from multiple clusters. At ByteDance, Kelemetry creates 8 billion spans (not including pseudospans) per day (using a multi-raft cache backend instead of etcd). Objects can be linked to parent objects from different clusters to enable tracking of cross-cluster components.

future enhancements

Using a custom trace source

In order to truly connect all observation points in the K8S ecosystem, auditing and events are not comprehensive enough. Kelemetry will collect traces from existing components and integrate them into the Kelemetry trace system to provide a unified and specialized view of the entire system.

batch analysis

Answering questions like “how long did it take to progress from a deployment upgrade to the first image pull” became easier with Kelemetry’s aggregated traces, but we still lacked the ability to aggregate these metrics at scale to provide overall performance insights. By analyzing Kelemetry's trace output every half hour, we can identify patterns in a series of spans and correlate them to different scenarios.

Use Cases

1. replicaset controller failure

Users reported that a deployment keeps creating new Pods. We can quickly find its Kelemetry trace through the deployment name, and analyze the relationship between the replicaset and the Pod it creates.

16f519e34595a374a78f72537bdd5b07.png

It can be seen from the tracking that several key points:

  • Replicaset-controller sends out SuccessfulCreatean event, indicating that the Pod creation request returned successfully, and was confirmed by the replicaset controller in the replicaset reconcile.

  • There are no replicaset status update events, which means that the pod reconcile in the replicaset controller failed to update the replicaset status or these pods were not observed.

Also, look at the trace for one of the Pods:

f2bdc9a6c1153051e6f939f2ba174d8e.png

  • The Replicaset controller never interacted with the Pod after it was created, not even a failed update request.

Therefore, we can conclude that the Pod cache in the replicaset controller is likely to be inconsistent with the actual Pod storage on the apiserver, and we should consider the performance or consistency issues of the pod informer. Without Kelemetry, locating this issue would involve looking at the audit logs of individual Pods across multiple apiserver instances.

2. Floating minReadySeconds

Users found that the rolling update of the deployment was very slow, taking several hours from 14:00 to 18:00. If Kelemetry is not used, by using kubectl to find the object, it is found that the minReadySeconds field is set to 10, so the long rolling update time is not as expected. The kube-controller-manager log shows that it took an hour for the Pod to become Ready

8fd7a702d72a5d9b3ebcf3c8ac7594d0.png

Further inspection of the kube-controller-manager logs reveals that at some point minReadySeconds has a value of 3600.

38b86d86b46dcca560642c250ed94de0.png

Using Kelemetry for debugging, we can find the trace directly through the deployment name and find that the federation component increases the value of minReadySeconds.

1f38c5e788da0a62493607295fc234b3.png

Later, the deployment controller restores the value to 10.

8857464ea4276d4fac2fa5e2bbed1b03.png

Therefore, we can conclude that the problem is caused by a large minReadySeconds value temporarily injected by the user during the rolling update. Problems caused by unexpected intermediate states can be easily identified by inspecting the object diff.

Try Kelemetry

Kelemetry has been open sourced on GitHub: https://github.com/kubewharf/kelemetry

Follow the docs/QUICK_START.md quickstart guide to try out how Kelemetry interacts with your components, or if you don't want to set up a cluster, check out the online preview built from GitHub CI pipeline: https://kubewharf.io/kelemetry/trace -deployment/

join us

Volcano Engine Cloud Native Team The Volcano Engine Cloud Native Team is mainly responsible for the construction of the PaaS product system in the public cloud and privatization scenarios of Volcano Engine, combined with ByteDance's years of cloud native technology stack experience and best practice accumulation, to help enterprises accelerate digital transformation and innovation. Products include container services, mirror warehouses, distributed cloud-native platforms, function services, service grids, continuous delivery, observable services, etc.

4c29c2817339b503f173bca28e404356.png

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/131566541