Observability Design Patterns for Microservice Architecture

Observability is a superset of monitoring. It provides a high-level overview of system health in addition to providing detailed insight into implicit failure modes. Additionally, observable systems provide a repository of their inner workings, enabling the discovery of deeper systemic problems.

Once the service is deployed in production, we want to know how it is performing in terms of requests per second, resource utilization, etc. Also, you want to be alerted immediately if something goes wrong, such as a service instance failing or running out of disk space, ideally before it impacts the user experience. If something goes wrong, we need to be able to troubleshoot and do RCA ( Root Cause Analysis ).

As service developers, we should implement several patterns to make service management and troubleshooting easier. The following five patterns can help us design observable services:

  • Health Check API : Provides an endpoint that returns the health status of the service.
  • Log aggregation : You can log service activity and store logs in a centralized log server that provides alerting and search capabilities.
  • Distributed Tracing  : Identify each external request with a unique ID and track requests as they flow between services.
  • Exception Tracking  : Exceptions should be reported to an exception tracking service, which de-duplicates exceptions, alerts developers, and tracks how they are resolved.
  • Application Metrics  : Metrics such as counters and gauges are maintained by the service and exposed to the metrics server.
  • Audit Log  : Track User Actions

1. Health check API mode

Occasionally, a service may be running but unable to process requests. A newly started service instance may still be initializing and doing some sanity checks before processing requests. It doesn't make sense for the deployment infrastructure to route HTTP requests to service instances until they are ready to handle them.

It may also happen that the service instance fails without being terminated, for example, all DB connections are used up and the database cannot be accessed. The deployment infrastructure should not route requests to a failed but still running service instance; if the service instance cannot recover, it must be terminated and a new instance created. A service instance must be able to tell the deployment infrastructure whether it can handle the request. You can use Spring Boot Actuator which implements health endpoints to implement health check endpoints for your services.

2. Log aggregation mode

Log aggregation mode can be used for troubleshooting. If you want to determine what is wrong with your application, the log files are a good place to start. Logging in a microservices architecture can be challenging because log content is scattered across log files of different services.

Log aggregation is the solution. The log aggregation service sends the logs of all service instances to a centralized log server. When logs are stored by the log server, they can be viewed, searched and analyzed. It is also possible to set up an alert to be triggered when certain messages appear in the log.

The logging infrastructure is responsible for aggregating logs, storing them, and using it for searching. Many popular tools provide log aggregation, such as Splunk, Fluentd, ELK stack, Graylog, etc.

3. Distributed tracking mode

Let's say you're troubleshooting a slow API response, which may involve multiple services. Use distributed tracing to gain insight into what your application is doing. A distributed tracer is similar to a profiler in a monolithic application. Logs information about service calls made while processing a request. You can then see how the services interact during the processing of external requests, and how much time is spent on each service.

Each external request is assigned a unique ID and tracked as it flows from one service to another on a centralized server providing visualization and analytics. Distributed tracing servers include Zipkin, Jaeger, OpenTracing, OpenCensus, New Relic, etc.

4. Exception tracking mode

The service record is abnormal, and it is important to help determine the cause. Exceptions indicate a problem or program error. Logs are used to view exceptions, and even a log server can be configured to remind operation and maintenance personnel in case of exceptions. There are however a few downsides to be aware of:

  • Log files consist of single-line log entries, while exceptions have multiple lines.
  • In log files, there is no mechanism to track exception resolution. The exception needs to be manually copy/pasted into the issue tracker.
  • There is currently no way to automatically treat repeated exceptions as one exception.

Exception Tracking Service is a very good method of exception tracking. Services report exceptions to a centralized service that deduplicates, generates alerts, and manages exceptions. Exception tracking services can be implemented using Honeybadger, Sentry, etc.

5. Apply Metric Patterns

Monitoring and alerting are critical components of production environments. A monitoring system collects metrics from all parts of its technology stack that provide critical information about the health of an application. These metrics range from infrastructure-level metrics such as CPU, memory, and disk utilization, to application-level metrics such as service request latency and number of requests processed.

Metrics are the responsibility of the service developer, in two ways. The service must first be instrumented to collect relevant behavioral indicators. Second, these service metrics, as well as metrics from the JVM and application framework, must be exposed to the metrics server. The Application Metrics Service can poll endpoints to retrieve metrics just like the AWS CloudWatch service or Prometheus server. Grafana is a data visualization tool that can be used to view metrics in Prometheus.

6. Audit log mode

Each user's operation is recorded by the audit log. Typically, audit logs are used to provide customer support, ensure compliance, and detect suspicious activity. Audit log entries record who the users are, the actions they perform, and the business objects involved. Audit logs are typically stored in database tables.

Audit logs can be implemented in several different ways:

  1. Add audit log code to business logic  : each service method can create an audit log entry and save it to the database.
  2. Aspect-Oriented Programming (AOP) : You can use an AOP framework such as Spring AOP to define advice that intercepts each service method call and keeps audit log entries.
  3. Leverage Event Sourcing  : By default, Event Sourcing provides audit logs for create and update operations.

Observability patterns, by definition, are not about logs, metrics, or tracing, but about being data-driven during debugging and using feedback to iterate and improve the product.

Guess you like

Origin blog.csdn.net/stone1290/article/details/126295860