Mainstream enterprise full-link monitoring system-OpenTelemetry (1)

Insert image description here

1. Observability-observability

Before officially introducing OpenTelemetry, we also need to understand what observability is

Management guru Peter Drucker once said: “If you can’t measure it, you can’t manage it.” In an enterprise, whether you are managing people, things, or systems, you first need to measure. The process of measurement is actually the process of collecting information. Only with enough information can we make correct judgments, and with correct judgments can we make effective management and action plans.
Insert image description here

Illustration: See the appearance through observation, locate the problem through judgment, and solve the problem through optimization.

Observability describes 观测-判断-优化-再观测the continuity and efficiency of the closed loop of " ". If there are only observations but no judgment can be made based on the observations, it cannot be said to have observability. If there is only empirical judgment without data support, it cannot be said to be observable. This will cause the organization to be highly dependent on individual capabilities and bring management risks. If the optimization cannot be fed back to the observations, or if the optimization introduces new technologies that make it unobservable, then its observability is unsustainable. If high costs and risks are required in the closed loop of observation, judgment, and optimization, the value of observability will be negative.

Therefore, when we talk about observability, we actually consider more about the feelings of observers and managers. In other words, when we encounter problems, can we easily find the answer on the observation platform without resistance? No confusion, that's observability . With the development of enterprises, the organizational structure (roles, observers) and management objects (systems, observed persons) will also develop and change. When using a bunch of traditional observation tools, they still cannot satisfy the new requirements of observers and managers. When meeting the demand, we can’t help but ask: “Where is observability?”

"Observable" does not equal "observability"
. Now, let's take a look at the observation methods we are accustomed to.
Insert image description here

Illustration: Traditional observation tools are vertical, and observers need to make judgments from multiple tools.

Usually we build observation tools based on the data we want, such as:

  • When we want to understand the health status of our infrastructure, we will naturally think of building a dashboard to monitor various indicators in real time.
  • When we want to understand how business problems occur, we will naturally think of building a log platform to filter and check business logs at any time.
  • When we want to understand why transactions have high latency, we will naturally think of building a link monitoring platform to query topology dependencies and the response time of each node.

This model is very good and helps us solve many problems, so that we never doubt observability and we are full of confidence. Occasionally when we encounter a big problem, we open our dashboard, log platform, and link platform. All the data is here. We firmly believe that we can find the root cause of the problem. Even if it takes a long time, we just tell ourselves that we need to learn more and understand more about the system we are responsible for. I will definitely be able to find the root cause faster next time. Yes, when the data we want are all in front of us, what reason do we have to blame the observation tools.

Insert image description here

Illustration: When glitches are found in indicators, it is often necessary to construct complex log query conditions in the brain, which is time-consuming and error-prone.

We will tirelessly search for possible correlations in various indicator data. After getting the key clues, we will construct a bunch of complex log query conditions in our brains to verify our conjectures. Just comparing, guessing, and verifying like this, while switching between various tools, is undeniably fulfilling.
Insert image description here

Illustration: When the system scale is huge, people can no longer locate the problem.

The traditional system is relatively simple and the above method is effective. The keywords of modern IT systems are distributed, pooled, big data, zero trust, elasticity, fault tolerance, cloud native, etc. They are becoming larger, more sophisticated, more dynamic, and more complex. It is obviously not feasible to use people to find the correlation of various information, and then judge and optimize based on experience. It is time-consuming and labor-intensive and cannot find the root cause of the problem.

The most critical thing here is actually to solve the problem of data association, and leave things that previously required people to compare and filter to the program. Programs are best at such things and are also the most reliable. Human time is spent more on judgment. and decision-making. In complex systems, the time saved will be magnified many times. This little thing is the visible future of observability .
Insert image description here

Illustration: Future observation tools need to relate data through time and context

To achieve observability, a variety of data supports are needed, the three most important of which are as follows

  • Log: Record information about various discrete events that occur at a specific time.
  • Metrics: Measurement data that describes changes in software systems or components over time, such as CPU utilization of microservices, etc.
  • Tracing: Describes the dependencies between parts of a distributed system.
    Insert image description here

So, how to do data association? It's easy to say, that is, to make a correlation between time and space. On our unified data platform, since data comes from various observation tools, although we have unified the data format into metric, log, and trace, the metadata of metric, log, and trace of different tools are completely different. If we sort out and map these metadata on this unified data platform, it will be complex, difficult to maintain, and unsustainable. So what to do? The answer is standardization. Only by feeding standardized and structured data to the observation platform can the observation platform find great value from it . The unified data platform only standardizes the data format, but in order to associate trace, metric, and log, it is necessary to establish the standardization of context. Context is the spatial information of the data, and then superimposed on the association of time information, the real observation value can be exerted. .

So, here it comes, Opentelemetry (hereinafter referred to as: OTel) is a project that solves the problem of data standardization

For the above content, please refer to "Understanding Observability and Opentelemetry in One Article"

2. OpenTelemetry is coming

OpenCensusInitiated by Google, it was originally Google's internal tracking platform and later open sourced.
OpenTracingHosted by CNCF, it has a relatively complete instrumentation library.

These two projects are relatively special: one is OpenTracing, which developed a set of unified Trace standards regardless of platform. Many subsequent projects, such as Jaeger, etc., are based on this protocol, so he was in the field of Trace standards at that time. It has considerable influence; the second is OpenCensus, which is backed by Google, and it not only implements Trace, but also includes Metrics, and it includes a series of solutions such as Agent and Collector, which can be said to be quite complete.

At that time, the two major schools could be said to have a large following of each other. On one side was OpenCensus, led by Google and Microsoft, and on the other side was OpenTracing, used by many open source projects and manufacturers. Both of them can be said to have their own advantages and disadvantages, and each has its own advantages. Coquettish. until one day…

After a period of development, in order to integrate the advantages of OpenCensus and OpenTracing, the famous Opentelemetry was born!
Insert image description here Opentelemetry can be said to be born with a silver spoon in its mouth: the support of OpenTracing and OpenCensus brings its own experience from the beginning. Rich community members, and also supported by Internet giants. Since the birth of Opentelemetry, OpenTracing and OpenCensus are no longer maintained.

OpenTelemetry's positioning is very clear: the unification of data collection and standards , and the official does not involve how data is used, stored, displayed, alerted, etc.

What OpenTelemetry wants to solve is the unification of observability. It provides a set of APIs and SDKs to standardize the collection and transmission of telemetry data. OpenTelemetry does not want to rewrite all components, but reuses the current industry standards to the greatest extent possible. Commonly used tools in various major fields provide a secure, vendor-neutral protocol and component that can be used to form a pipeline and send data to different backends as needed.

3. Architecture and core concepts

Insert image description here

  • receivers: Define the data model in which data from the client is to be received, supporting many data models
  • processors: perform certain processing on the data of receivers, such as batch processing, performance analysis, etc.
  • exporters: Export the data after processors to a specific backend, such as storing metrics data in prometheus

OTLP protocol : otlp is a relatively core existence in opentelemetry. It has developed protocols including encoding, transmission, and delivery between telemetry data and Collector.
Insert image description here

  • Specification: It defines the API/SDK and data model. All third-party implementations need to follow the specifications defined by the spec.
  • Semantic Conventions: The OpenTelemetry project ensures that all instrumentation (regardless of language) contains the same semantic information
  • Resource: resource is a key-value pair attached to all traces generated by a process, specified during the initialization phase and passed to the collector.
  • Baggage: To add annotation information in metrics, logs, and traces, the key-value pairs need to be unique and cannot be changed.
  • Propagators: Propagators, for example, in multi-process calls, enable the propagator to propagate spanContext across services.
  • Attributes: In fact, it is a tag. You can add metadata data to the span. The SetAttributes attribute can be added multiple times. If it does not exist, it will be added. If it exists, it will be overwritten.
  • Events: Similar to logs, such as embedding request body and response body in trace.
  • Collector: It is responsible for the reception, processing and export of telemetry data sources, and provides a vendor-independent implementation.
  • Data sources: Data sources provided by OpenTelemetry, currently including:
    • Traces
    • Metrics
    • Logs
    • Baggage

Insert image description here

Let’s talk about these concepts in detail:

API

The API is part of OpenTelemetry and is specifically used to define and capture metric data. Any part of the OpenTelemetry client that imports third-party libraries and application code is considered part of the API.

SDK

The SDK is an implementation of the API provided by the OpenTelemetry project, and within an application the SDK is installed and managed by the application owner. Note that the SDK includes additional public interfaces that are not considered part of the API package because they are not cross-cutting concerns, these public interfaces are defined as constructor and plugin interfaces:

  • Application owner uses SDK constructor
  • Plug-in authors use the SDK plug-in interface
  • Instrument authors are not allowed to directly reference any SDK package of any type, only the API.

Semantic Conventions

Semantic conventions define keys and values ​​that describe common concepts, protocols, and operations used by applications. Semantic conventions are now in their own repository: [https://github.com/open-telemetry/semantic-conventions](https ://github.com/open-telemetry/semantic-conventions)
Insert image description here
Both collectors and client libraries should automatically generate semantic convention keys and enumeration values ​​as constants (or language idiomatic equivalents). Until the semantic convention is stable, no The generated values ​​should be distributed into stable packages. YAML files must be used as the source of truth for generation, and each language implementation should provide language-specific support for the code generator.
Insert image description here
Additionally, properties required by the specification will be listed here.

Contrib Packages

The OpenTelemetry project maintains integration with popular OSS projects that have been identified as important for observing modern Web services. Sample API integrations include instrumentation for web frameworks, database clients, and message queues; sample SDK integrations include plug-ins for exporting telemetry to popular analytics tools and telemetry storage systems.

The OpenTelemetry specification requires some plug-ins, such as OTLP Exporters and TraceContext Propagators, and these required plug-ins are included in the SDK.

Optional and SDK-independent plug-in and detection packages are called Contrib packages. API Contrib refers to packages that are completely dependent on API; SDK Contrib refers to packages that only rely on SDK.

The term Contrib refers specifically to the collection of plug-ins and tools maintained by the OpenTelemetry project; it does not refer to third-party plug-ins hosted elsewhere.

Version control and stability

OpenTelemetry values ​​stability and backward compatibility. For more information, see the Version Control and Stability Guide.

Tracing Signal

Distributed tracing is a set of events that are triggered by a single logical operation and integrated across various components of the application. Distributed tracing includes events that span process, network, and security boundaries. A distributed trace might be initiated when someone presses a button to start an action on a website - in this example, the trace would represent the calls made between downstream services that handle the chain of requests initiated by pressing this button.

Traces

A trace in OpenTelemetry is implicitly defined by its span. In particular, Trace can be considered as a directed acyclic graph (DAG) of Spans, where the edges between Spans are defined as parent/child relationships. For example, here is a sample Trace consisting of 6 Spans:

    Causal relationships between Spans in a single Trace
     
            [Span A]  ←←←(the root span)
                |
         +------+------+
         |             |
     [Span B]      [Span C] ←←←(Span C is a `child` of Span A)
         |             |
     [Span D]      +---+-------+
                   |           |
               [Span E]    [Span F]

Sometimes it's easier to track visually using a timeline, as shown in the image below:

    Temporal relationships between Spans in a single Trace
     
    ––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|> time
     
     [Span A···················································]
       [Span B··········································]
          [Span D······································]
       [Span C····················································]
             [Span E·······]        [Span F··]

Spans

A span represents an operation within a transaction. Each Span encapsulates the following states:

  • Operation name
  • start and end timestamp
  • Attribute: list of key-value pairs
  • A set of zero or more events, each event itself is a tuple (timestamp, name, attributes), the name must be a string
  • Parent's Span identifier
  • Link to zero or more causally related Spans (via the SpanContext of those related Spans)
  • SpanContext information required to reference the Span. see below.

Span data structure:

type Span struct {
    
    
    TraceID    int64 // 用于标示一次完整的请求id
    Name       string
    ID         int64 // 当前这次调用span_id
    ParentID   int64 // 上层服务的调用span_id  最上层服务parent_id为null
    Annotation []Annotation // 用于标记的时间戳
    Debug      bool
}

SpanContext

Represents all the information that identifies the Span in the Trace and must be propagated to the child Span and across process boundaries. The SpanContext contains the trace identifier and options for propagation from the parent Span to the child Span.

  • TraceId is the identifier of a trace, which consists of 16 randomly generated bytes with almost enough probability to be globally unique. TraceId is used to group all spans of a specific trace into all processes.
  • The SpanId is the span's identifier, and by making it 8 randomly generated bytes, it has almost enough probability to be globally unique that when passed to a child span, this identifier will become the child span's parent span ID.
  • TraceFlags represents tracing options. It is represented as 1 byte (bitmap)
    • sampling bit - bit indicating whether the trajectory was sampled (mask 0x1)
  • Tracestate carries tracing system specific context in a list of key-value pairs, Tracestate allows different vendors to propagate additional information and interoperate with their legacy ID formats, see this for details.

Links between spans

A span can be linked to zero or more causally related other spans (defined by the SpanContext). Links can point to spans within a single trace or across different traces. Links can be used to represent batch operations, where a span is started by multiple spans. To start, each Span represents a single incoming item being processed in the batch.

Another example of using chaining is to declare the relationship between the original trace and subsequent traces. This method can be used when the trace enters the trusted boundary of the service and the service policy requires that a new trace be generated instead of trusting the incoming trace context. A chained Trace may also represent a long-running asynchronous data processing operation initiated by one of many fast incoming requests.

When using the scatter/gather (also known as fork/join) pattern, the root operation initiates multiple downstream processing operations and aggregates all of these operations back into a single Span. The last Span is linked to the many operations it aggregates, which are Span from the same Trace, similar to Span's Parent field. However, it is recommended not to set the Span's parent in this case because semantically the parent field represents a single parent scenario and in many cases the parent Span completely contains the child Span, which is not the case in scatter/gather and batch scenarios. situation.

Metric Signal

OpenTelemetry allows recording raw measurements or metrics using predefined aggregations and a set of properties. Logging raw measurements using the OpenTelemetry API allows the end user to decide which aggregation algorithm should be applied to the metric as well as define the properties (dimensions) which will be used in gRPC etc. used in client libraries to record raw measurement values ​​"server_latency" or "received_bytes". Therefore, the end user will decide what type of aggregate value should be collected from these raw measurements, it may be a simple average or a complex histogram calculation.

It is equally important to record metrics through predefined aggregations using the OpenTelemetry API, which allows collecting values ​​such as cpu and memory usage, or simple metrics such as "queue length".

Record original measurements

The main classes used for recording raw measurements are Measure and Measurement. A list of measurements can be recorded along with an attached context using the OpenTelemetry API. Therefore, the user can define additional properties that aggregate these measurements and use the context passed along to define the resulting metric.

  • Measure: A measure describes the type of a single value recorded by the library. It defines the contract between the library that exposes the measure and the application that aggregates these individual measures into a metric, which is identified by a name, a description, and the unit of the value.
  • Measurement describes a single value to be collected for Measurement, which is an empty interface in the API surface defined in the SDK.

Record metrics using predefined aggregations

The base class for all types of pre-aggregated metrics is called Metric, which defines basic metric properties such as name and attributes. Classes inherited from Metric define their aggregation types and the structure of individual measures or points, and the API defines the following types of pre-aggregated metrics:

  • Counter is used to report measured counter metrics. Counter values ​​can increase or remain constant, but never decrease; counter values ​​cannot be negative. There are two types of counter measures - doubleand long.
  • Gauge metric reports an instantaneous measurement of the value. The gauge can either rise or fall. Gauge values ​​can be negative, there are two types of counter measurements - doubleand long.

The API allows building indicators of selected types. The SDK defines how to query and export the current value of a metric. Each type of metric has its API to record the values ​​to be aggregated. The API supports both - push and pull models for setting indicator values.

Metrics data model and SDK

The Metrics data model is specified here and is based on metrics.proto. This data model defines three semantics:

  • Event model used by the API
  • In-flight data model used by SDK and OTLP
  • A TimeSeries model that represents how the exporter should interpret the model on the fly.

Different exporters have different capabilities (such as which data types are supported) and different constraints (such as which characters are allowed in property keys), and metrics are intended to be a superset of what is possible, rather than the lowest common denominator supported everywhere. All exporters consume data from the metric data model through the metric producer interface defined in the OpenTelemetry SDK.

Therefore, indicators impose minimal restrictions on the data (e.g., which characters are allowed in keys), and code that handles indicators should avoid validation and sanitization of indicator data. Instead, pass the data to the backend, rely on the backend to perform validation, and pass back any errors from the backend.

For more information, see Metrics Data Model Specification.

Log Signal

  • Data Model
    The log data model defines how OpenTelemetry understands logs and events.

Baggage Signal

In addition to tracking propagation, OpenTelemetry provides a simple mechanism for propagating name/value pairs , called baggage; baggage is used to index observability events in a service that has previous services in the same transaction. properties that help establish causal relationships between these events.

While Baggage can be used to prototype other cross-cutting concerns, the mechanism is primarily intended to convey value to the OpenTelemetry observability system.

These values ​​can be used from Baggage and used as additional properties for metrics, or additional context for logs and traces, some examples:

  • Web services can benefit from including context about the service sending the request
  • The SaaS provider can include context about the API user or token responsible for the request
  • Determining that a specific browser version is associated with a failure in the Image Processing Service

For backward compatibility with OpenTracing, Baggage is propagated as a Baggage when using the OpenTracing bridge. New concerns with different standards should consider creating a new cross-cutting concern to cover their use cases. They may benefit from the W3C encoding format, but use New HTTP request headers carry data throughout distributed tracing.

Resources

Resources capture information about the entity for which telemetry is recorded. For example, metrics exposed by a Kubernetes container can be linked to resources that specify cluster, namespace, pod, and container names.

A resource can capture an entire hierarchy of entity identities, which might describe a host in the cloud and a specific container or process running an application.

Note that some process identification information can be automatically associated with telemetry through the OpenTelemetry SDK.

context propagation

All OpenTelemetry cross-cutting concerns (such as tracing and metrics) share an underlying context mechanism for storing state and accessing data throughout the lifecycle of a distributed transaction, viewing context.

Propagators

OpenTelemetry uses Propagators to serialize and deserialize cross-cutting values ​​of interest, such as Spans (usually only the SpanContext part) and Baggage. Different propagator types define the restrictions imposed by a specific transport and are bound to the data type. The Propagators API currently defines one Propagator type:

  • TextMapPropagator Injects values ​​as text into a vector and extracts values ​​from the vector.

Collector

The OpenTelemetry collector is a set of components that can collect traces, metrics and other telemetry data (such as logs) from processes instrumented by OpenTelemetry or other monitoring/tracing libraries (Jaeger, Prometheus, etc.), perform aggregation and intelligent sampling, and combine traces and Metrics are exported to one or more monitoring/tracing backends. Collector will allow enrichment and transformation of collected telemetry data (such as adding additional attributes or erasing personal information).

OpenTelemetry collectors have two main modes of operation:

  • Agent (daemon process that runs locally with the application)
  • Collector (a service that runs independently)

Read more in the OpenTelemetry service long-term vision.

Detection library

The original intention of the project was to have every library and application call the OpenTelemetry API directly and thus use it out of the box. However, many libraries will not have such integration, so a separate library will be needed to inject such calls, using mechanisms such as wrapping interfaces, subscribing to callbacks of a specific library, or converting existing telemetry data to an OpenTelemetry model. A library that enables OpenTelemetry observability for another library is called an Instrumentation library.

The naming of the instrumentation library should follow any naming convention for instrumentation libraries (for example, "middleware" for web frameworks). If there is no definite name, it is recommended to prefix the package with "opentelemetry-instrumentation", followed by the instrumentation library name itself. Examples include:

    opentelemetry-instrumentation-flask (Python)
    @opentelemetry/instrumentation-grpc (Javascript)

Readers who have endured reading this must be ready to give up. Don’t worry, let’s start deployment.

Guess you like

Origin blog.csdn.net/u010230019/article/details/132543339