Cloud-native observability and analytics

We've finally arrived at the final installment of the Cloud Native Computing Foundation Landscape series. If you missed our previous article, we covered the introduction in separate articles , followed by configuration , runtime , orchestration and management , and platform . Today, we'll discuss each category of observability and analysis "columns".

Let's start by defining observability and analytics. Observability is a system characteristic that describes the degree to which a system can be understood from its external output. Computer systems are more or less observable in terms of CPU time, memory, disk space, latency, errors, etc. Analytics, on the other hand, is your activity to look at this observable data and make sense of it.

To ensure that there are no service interruptions, you need to observe and analyze every aspect of your application in order to detect and correct any anomalies immediately. That's what this category is all about. It runs and observes all layers, that's why it's on the side and not embedded in a specific layer.

Tools in this category are divided into logging, monitoring, tracing, and chaos engineering. Note that the category names are somewhat misleading. Although listed here, Chaos Engineering is not so much an observability or analysis tool as it is a reliability.

1. Logging

1.1 What is this

Applications emit a steady stream of log messages describing what they are doing at any given time. These log messages capture various events that occur in the system, such as failed or successful operations, audit information, or health events. Logging tools collect, store, and analyze these messages to track error reports and related data. Along with metrics and tracing, logging is one of the pillars of observability.

1.2 The problem it solves

Collecting, storing, and analyzing logs is a critical part of building a modern platform. Logging helps with one or all of these tasks. Some tools handle all aspects from collection to analysis, while others focus on a single task such as collection. All logging tools are designed to help organizations gain control over their log messages.

1.3 How it helps

When you collect, store, and analyze application log messages, you will understand what your application is communicating at any given time. Note, however, that logs represent messages that an application can intentionally emit, and they do not necessarily pinpoint the root cause of a given problem. That being said, collecting and retaining log messages over time is a very powerful feature that will help teams diagnose problems and meet regulatory and compliance requirements.

1.4 Technology 101

While collecting, storing, and processing log messages is by no means a new problem, cloud-native patterns and Kubernetes have led to significant changes in the way we handle logs. Traditional logging methods for virtual and physical machines (such as writing logs to files) are not suitable for containerized applications because the file system does not outpace the application. In a cloud-native environment log collection tool like Fluentd, run alongside the application container and collect messages directly from the application. The messages are then forwarded to a central log store for aggregation and analysis.

Fluentd is the only CNCF project in this area.

catchphrase Popular items
logging Fluentd 、Fluentbit 、Elastic Logstash

insert image description here

2. Monitoring

2.1 What is this

Monitoring is the instrumentation of an application to collect, aggregate and analyze logs and metrics to improve our understanding of its behavior. While logs describe specific events, metrics are measurements of the system at a given point in time—they are two different things, but both are necessary for a comprehensive view of the health of the system. Monitoring includes everything from looking at disk space, CPU usage and memory consumption on a single node to performing detailed synthetic transactions to see if a system or application is responding correctly and in a timely manner. There are many different ways to monitor systems and applications.

2.2 The problem it solves

When you run an application or platform, you want it to do a specific task by design, and make sure it can only be accessed by authorized users. Monitoring allows you to see if it is functioning properly, securely, cost-effectively, only accessed by authorized users, and/or any other characteristics you may be tracking.

2.3 How it helps

Good monitoring enables operators to respond quickly and possibly automatically when an incident occurs. It provides insight into the current health of the system and monitors changes. Monitoring Tracking everything from application health to user behavior is an essential part of running an application effectively.

2.4 Technology 101

Monitoring in cloud-native environments is often similar to monitoring traditional applications. You need to track metrics, logs, and events to understand the health of your application. The main difference is that some managed objects are ephemeral, which means they may not persist, so tying your monitoring to auto-generated resource names is not a good long-term strategy. There are many CNCF projects in this area mainly around Prometheus, the CNCF graduate project.

catchphrase

  • Monitoring
  • Time series
  • Alerting
  • Metrics

Popular items/products

  • Prometheus
  • Cortex
  • Thanos
  • Grafana

insert image description here

3. Tracing

3.1 What is this

In the world of microservices, services are constantly communicating with each other over a network. Tracing is a special use of logging that allows you to trace the path of a request through a distributed system.

3.2 The problem it solves

Understanding how a microservices application behaves at any given point in time is an extremely challenging task. While many tools provide deep insight into service behavior, it can be difficult to connect the operation of a single service to a broader understanding of how the entire application behaves.

3.3 How it helps

Tracing solves this problem by adding a unique identifier to the messages sent by the application. This unique identifier allows you to track or track individual transactions as they pass through your system. You can use this information to view the health of your application and to debug problematic microservices or activities.

3.4 Technology 101

Tracing is a powerful debugging tool that allows you to troubleshoot and fine-tune the behavior of distributed applications. This power does come at a price. Application code needs to be modified to emit trace data, and any spans need to be propagated by infrastructure components in the application data path. Specifically the service mesh and its proxies. Jaeger and Open Tracing are CNCF projects in this area.

catchphrase

  • Span
  • Tracing

Popular items

  • Jaeger
  • OpenTracing

insert image description here

4.1 Chaos Engineering

4.2 What is this

Chaos engineering refers to the practice of deliberately introducing failures into systems to create more resilient applications and engineering teams. Chaos engineering tools will provide a controlled way to introduce failures and run specific experiments on specific instances of the application.

4.2 The problem it solves

Complex systems fail. They fail for many reasons, and the consequences in distributed systems are often hard to understand. Chaos engineering is embraced by organizations that accept that failure will happen, not to try to prevent it, but to practice recovering from it. This is called optimized mean time to repair or MTTR.

Side note: The traditional approach to maintaining application high availability is called optimizing mean time between failures, or MTBF. You can observe this practice in organizations that use "change review boards" and long-term change freezes to keep application environments stable by limiting changes. Accelerate's authors recommend that high-performance IT organizations optimize mean time to recovery or MTTR by optimizing to achieve high availability.

4.3 How it helps

In a cloud-native world, applications must dynamically adapt to failures—a relatively new concept. This means that when something fails, the system does not shut down completely, but degrades or recovers gracefully. Chaos engineering tools enable you to experiment with software systems in production to ensure they do so in the event of a real failure.

In short, you experiment with a system because you want to be confident that it can withstand turbulent and unexpected conditions. Rather than waiting for something to happen and finding out, coerce it under controlled conditions to identify weaknesses and fix them before the opportunity arises.

4.4 Technology 101

Chaos engineering tools and practices are critical to achieving high application availability. Distributed systems are often too complex for any single engineer to fully understand, and no change process can completely predetermine the impact of a change on the environment. By introducing thoughtful chaos engineering practices, teams are able to practice and automate to recover from failures. Chaos Mesh and Litmus Chaos are CNCF tools in this area, but there are many open source and proprietary options available.

catchphrase

  • Chaos Engineering

Popular items

  • Chaos Mesh
  • Litmus Chaos

insert image description here
As we've seen, the observability and analytics layers are all about understanding the health of the system and making sure it keeps running even in harsh conditions. The logging facility captures event messages emitted by the application, monitors watch logs and metrics, and traces the path of individual requests. When combined, these tools ideally provide a 360-degree view of what's going on within your system. Chaos engineering is a bit different. It provides a safe way to verify that a system can withstand unexpected events, basically ensuring that it remains healthy.


insert image description here

Guess you like

Origin blog.csdn.net/xixihahalelehehe/article/details/123733848