KCNA Exam Chapter 7: Monitoring and Detection


1 Introduction

The term cloud native observabilitysounds like another buzzword used to market new tools. It may be true that we have a lot of new tools emerging to solve the problem of monitoring container infrastructure.

Regular monitoring of a server may include collecting basic metrics of the system, such as CPU and memory resource usage, as well as process and operating system logging. A new challenge for microservices architectures is monitoring requests as they move through a distributed system. This principle, called tracing, is especially useful when a large number of services are involved in responding to a request.

In this chapter, we'll learn how container infrastructure still relies on collecting metrics and logs, but the requirements have changed considerably. It focuses more on network issues like latency, throughput, request retries, or application startup time, and the sheer volume of metrics, logs, and traces in distributed systems requires a different approach to managing these systems.

2. Learning Objectives

By the end of this chapter, you should be able to:

  • Explain why observability is a key discipline in cloud computing.
  • Discuss metrics, logging, and tracing.
  • Learn how to display logs for containerized applications.
  • Explains how to use Prometheus to collect and store metrics.
  • Learn how to optimize cloud costs.

3. Observability

Observability is often synonymous with monitoring, but monitoring is just a sub-category of cloud-native observability and doesn't play its full role. The term observability is closely related to control theory that deals with the behavior of dynamic systems. Essentially, control theory describes how to measure the external outputs of a system to manipulate the behavior of the system.
A common example is a car's cruise control system. You set the car's desired speed, which can be continuously measured by a person with a speedometer. To maintain speed in changing conditions, such as when driving up a hill, the power of the electric motor must be adjusted to maintain speed.
In IT systems, the same principles can be applied to autoscaling. You can set the desired system utilization and trigger scaling events based on system load.
Automating a system in this way can be very challenging and not the most important usage for observability. When we deal with container orchestration and microservices, the biggest challenge is to keep track of the systems, how they interact with each other, and how they behave under load or in error states.
Observability should answer the following questions:

  • Is the system stable or has it changed state while being manipulated?
  • Is the system sensitive to changes, such as high latency for certain services?
  • Are some metrics in the system exceeding their limits?
  • Why did the request to the system fail?
  • Are there bottlenecks in the system?

The higher goal of observability is to allow analysis of the collected data. This helps to better understand the system and react to error states. This technical aspect is closely related to modern agile software development, which also uses feedback loops where you analyze the behavior of the software and continually adjust it based on the results.

4. Telemetry

The term telemetry has Greek roots meaning remote or distance (telemetry) and measure (measure). Measure and collect data points and then transfer them to another system, not just cloud-native systems, or even IT systems, of course. A good example is a weather station with a data logger that measures temperature, humidity, wind speed, etc. at a certain point and then transmits the data to another system that can process and display the data.
In a container system, each application should have built-in tools to generate informational data, which is then collected and transmitted in a centralized system. These data can be divided into three categories.

  • logs
    These messages are emitted from the application when there are errors, warnings, or debugging information. A simple log entry can be the start and end of a specific task performed by the application.
  • metrics
    Metrics are quantitative measurements taken over a period of time. This could be the number of requests or the error rate.
  • traces
    They trace the progress of requests as they pass through the system. Tracing is used in distributed systems to provide information about when a service handles a request and how long it took.

Many traditional systems don't even bother to transfer data (like logs) to a centralized system, to view the logs you have to connect to the system and read directly from the file.
In a distributed system with hundreds or thousands of services, this would mean a lot of work and troubleshooting would be time-consuming.

5. Logging

Today, application frameworks and programming languages ​​have a plethora of logging facilities built into them, which makes it very easy to log different log levels to a file depending on the severity of the log message.
The documentation for the Python programming language provides the following examples :

import logging
logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG)
logging.debug('This message should go to the log file')
logging.info('So should this')
logging.warning('And this, too')
logging.error('And non-ASCII stuff, too, like Øresund and Malmö')

Unix and Linux programs provide three I/O streams, two of which are used to output logs from containers:

  • Standard input (stdin): Input to a program, such as from a keyboard.
  • Standard output (stdout): The output the program writes to the screen.
  • Standard error (stderr): Errors written by the program to the screen

Command line tools such as docker, kubectl or podman provide a command to display the logs of containerized processes if you let them log directly to the console or /dev/stdout and /dev/stderr.
To view the log information for the container nginx:

$ docker logs nginx

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching
/docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching
/docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching
/docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/10/20 13:22:44 [notice] 1#1: using the "epoll" event method
2021/10/20 13:22:44 [notice] 1#1: nginx/1.21.3

If you want to stream the logs in real time, you can add the -f parameter to the command. Kubernetes provides the same functionality through the kubectl command line tool. kubectl logsDocumentation for the command provides

# Return snapshot logs from pod nginx with only one container
kubectl logs nginx 
# Return snapshot of previous terminated ruby container logs from pod web-1
kubectl logs -p -c ruby web-1 
# Begin streaming the logs of the ruby container in pod web-1
kubectl logs -f -c ruby web-1 
# Display only the most recent 20 lines of output in pod nginx
kubectl logs --tail=20 nginx 
# Show all logs from pod nginx written in the last hour
kubectl logs --since=1h nginx

These methods allow direct interaction with a single container. But in order to manage large amounts of data, these logs need to be sent to the system that stores them. Logs can be sent in different ways:

  • Node-level logging
    The most efficient way to collect logs. The administrator configures the log shipping facility to collect logs and deliver them to the central store.
  • Logging via sidecar container
    The application has a sidecar container that collects logs and sends them to a central repository.
  • Application-level logging
    applications push logs directly to central storage. While this may seem very convenient at first, it requires logging adapters to be configured in every application running in the cluster.

There are several tools that can be used to transfer and store logs. The first two methods can be implemented with tools like fluentd or filebeat .

Common choices for storing logs are OpenSearch or Grafana Loki . To find more datastores, you can visit the fluentd documentation on possible log targets .
To make logs easy to process and search, make sure you log in a structured format like JSON, not plain text. The major cloud vendors provide good documentation on the importance of structured logging and how to implement it:

6. Prometheus

Prometheus is an open-source monitoring system that was originally developed by SoundCloud and became the second CNCF-hosted project in 2016. It has become a very popular monitoring solution over time and is now a standard tool that integrates particularly well in the Kubernetes and container ecosystem.
Prometheus can collect metrics that applications and servers emit as time-series data - these are very simple datasets that include timestamps, labels, and the metrics themselves. The Prometheus data model provides four core metrics:

  • Counter: an incremented value, such as the number of requests or errors
  • Gauge: value to increase or decrease, such as memory size
  • Histogram: A sample of observations, such as request duration or response size
  • Summary: Similar to a histogram, but also provides the total number of observations.

To expose these metrics, applications can expose HTTP endpoints under /metrics instead of implementing it themselves. You can use existing client libraries:

  • go
  • Java or Scala
  • Python
  • Ruby

You can also use one of the many unofficial client libraries listed in the Prometheus documentation .
The exposed data might look like this

# HELP queue_length The number of items in the queue.
# TYPE queue_length
gauge queue_length 42
# HELP http_requests_total The total number of handled HTTP requests.
# TYPE http_requests_total counter
http_requests_total 7734
# HELP http_request_duration_seconds A histogram of the HTTP request durations in seconds.
# TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 4599
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 227420
# HELP http_request_duration_seconds A summary of the HTTP request durations in seconds.
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{
    
    quantile="0.5"} 0.052
http_request_duration_seconds_sum 88364.234
http_request_duration_seconds_count 227420

Prometheus has built-in support for Kubernetes and can be configured to automatically discover all services in a cluster and collect metrics data at defined intervals to save them in a time series database.
For querying data stored in time series databases, Prometheus provides a query language called Prometheus query language (Prometheus query language). Users can use PromQL to select and aggregate data in real-time, and view the data in the built-in Prometheus user interface, which provides a simple graphical or tabular view.
Here are some examples from the Prometheus documentation:

# Return all time series with the metric http_requests_total and the given job and handler labels:
http_requests_total{
    
    job="apiserver", handler="/api/comments"}

Or give a sample function of the rate over time

# Return the per-second rate for all time series with the http_requests_total metric name, as measured over the last 5 minutes:
rate(http_requests_total[5m])

You can use these functions to get an indication of how a value increases or decreases over time. This will help analyze the application for errors or prediction failures.
Of course, monitoring only makes sense if the collected data is used. Prometheus' most common companion is Grafana, which builds dashboards based on collected metrics. You can use Grafana for many more data sources than just Prometheus, although it is the most commonly used one.

insert image description here
Grafana Dashboard, retrieved from the Grafana website

Another tool of the Prometheus ecosystem is Alertmanager . The Prometheus server itself allows you to configure alerts when certain metrics meet or exceed thresholds. Alertmanager can send notifications to your favorite persistent chat tool, email, or specialized tools for alerting and on-call management when alerts are triggered.
Here is an example of a warning rule in Prometheus:

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{
    
    job="myjob"} > 0.5 for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency

7. Tracing

Using metrics collections for logging and monitoring is not particularly new. The same cannot be said for (distributed) tracing. Metrics and logs are important to give a good overview of individual services, but to understand how requests are handled in a microservices architecture, Tracing can work well.
Tracing describes the tracing of requests as they pass through the service. A trace consists of multiple units of work that represent different events that occur when a request is passed through the system. Each application can contribute a span to the trace, which can include information such as start and end times, names, tags, or log messages.
These traces can be stored and analyzed in a tracing system like Jaeger .
insert image description here
Tracing details, found from the Jaeger website
While tracing is a new technology and method for cloud-native environments, there are problems with standardization. In 2019, the OpenTracing and OpenCensus projects merged to form the OpenTelemetry project, which is now also a CNCF project.
OpenTelemetry is a set of application programming interfaces (APIs), software development kits (SDKs) and tools that can be used to integrate telemetry technologies such as metrics, protocols, and especially tracking applications and infrastructure. The OpenTelemetry client can be used to export telemetry data to a central platform such as Jaeger in a standardized format. Existing tools can be found in the OpenTelemetry documentation .

8. Cost Management

The possibilities of cloud computing allow us to draw resources from a theoretically infinite pool of resources and only pay for them when they are really needed. Since the services provided by cloud service providers are not for the public good, the key to cost optimization in cloud computing is to analyze what is really needed and, if possible, to automate the scheduling of the required resources.

Automatic and manual optimization

  • Identifying wasted and unused resources
    With good monitoring of resource usage, it's easy to find unused resources or servers that don't have a lot of free time. Many cloud providers have cost explorers that break down the cost of individual services. Autoscaling helps shut down unneeded instances.
  • Reasonably lean
    In the beginning, it may be a good idea to choose servers and systems that are much more powerful than you actually need. Also, good monitoring can show over time how much resources an application actually needs. This is an ongoing process and you should always adapt to the load you really need. Don't buy powerful machines if you only need half their capacity.
  • Reserved Instances
    On-demand pricing is great if you really need on-demand resources. Otherwise, you may be paying a lot of money for an "on-demand" service. One way to save a lot of money is to reserve resources and even pay in advance. This is a great pricing model, maybe even a few years in advance if you have a good estimate of the resources you need.
  • Spot Instances
    If you have a batch job or are under heavy load for a short period of time, you can use on-site instances to save money. The concept of spot instances is that you can get unused resources that are over-provisioned by cloud providers at a very low price. The "problem" is that these resources are not reserved for you, they may be terminated for a short period of time for others to use "full price".

All of these methods can be combined for cost-effectiveness. Mixing On-Demand, Reserved, and Spot instances is usually no problem.

9. Other resources

Cloud Native Observability

Prometheus

Prometheus at scale

Logging for Containers

Right-Sizing and cost optimization


Recommended reading:


Guess you like

Origin blog.csdn.net/xixihahalelehehe/article/details/123573993