Observability: The power of effective log management in software development and operations

Author: Luca Wintergerst , David Hope , Bahubali Shetty

Today's rapid software development processes require ever-expanding and complex infrastructure and application components, and the jobs of operations and development teams are growing and multifaceted. Observability helps manage and analyze telemetry data and is key to ensuring application and infrastructure performance and reliability. In particular, logging is the primary default signal enabled by developers, providing a wealth of detailed information for debugging, performance analysis, security, and compliance management. So how do you define a strategy to get the most out of your logs?

In this blog post we will explore:

  • A journey of logging, reviewing the collection, processing and enrichment, analysis and rationalization of logs
  • The difference between structured and unstructured data in management logs
  • Whether tracing should replace logging
  • Improve log operational efficiency by understanding how to reduce conversion time, centralized vs. decentralized log storage, and how and when to store less content

By gaining a deeper understanding of these topics, you will be better able to effectively manage logs and ensure the reliability, performance, and security of your applications and infrastructure.

Journaling journey

The logging process involves three basic steps:

Logging process

Let’s cover how to collect and ingest logs, what is proper parsing and processing, and how to analyze and rationalize logs. Additionally, we'll discuss how to enhance this journey with application performance monitoring (APM), metrics, and security events.

Log collection

The first step in your logging journey is to collect all the logs you have into one central location. This involves identifying all your applications and systems and collecting their logs.

A key aspect of log collection is maximum standardization across all applications and infrastructure. Having a common semantic pattern is very important. Elastic contributes Elastic Common Schema (ECS) to OpenTelemetry (OTel) to help accelerate the adoption of OTel-based observability and security. This will move towards a more prescriptive way of defining and ingesting logs (including metrics and traces).

Another important aspect of log collection is ensuring you have adequate ingest capacity at a low cost. When making this assessment, you may want to be cautious about solutions that charge multiples for high-cardinality data. Some observability solutions will charge a premium for this, which is why it's crucial to understand scaling capabilities and costs during this step.

Process and enrich logs

After collecting the logs, you can start processing the log files and placing them into a common schema. This helps standardize your logs and aggregate the most important information more easily. Standardized schemas also make it easier to correlate data across sources. When processing log data into a normalized schema, you need to consider:

  • Most observability tools have out-of-the-box integration for transforming data into specific known patterns. If no out-of-the-box integration is available, you can use regular expressions or similar methods to parse the information .
  • Management log processing is usually done in some kind of ingestion pipeline. You can use scalable and complex architectures involving streaming data pipelines with Kafka, Azure Service Bus, or Pulsar, or even use processing frameworks like Logstash or Kafka Streams. There's no right or wrong answer, but our advice is to be careful not to add too much complexity here. There's a danger that if you add a streaming data processing step here, you could end up with a months-long project and a lot of code to maintain.

Elastic provides multiple ways to ingest (i.e. AWS Kensis Data Firehose) and process ( parse unstructured logs , enrich logs , and even classify them correctly ).

Once the data is ingested, it may need to be extracted and enriched, and metrics may need to be created based on the data in the log files. There are many ways to do this:

  • Some observability dashboard features can perform runtime transformations to dynamically extract fields from unparsed sources. This is often called read-time mode . This is useful when dealing with legacy systems or custom applications that may not record data in a standardized format. However, runtime parsing can be very time-consuming and resource-intensive, especially for large amounts of data.
  • On the other hand, write-on-write mode provides better performance and more control over the data. The schema is predefined and the data is constructed and validated as it is written. This allows faster processing and analysis of data, which is beneficial for enrichment.

Log analysis

Search, pattern matching and dashboards

The logical and traditional next step is to build a dashboard and set up alerts to help you monitor your logs in real time. Dashboards provide a visual representation of logs, making it easier to identify patterns and anomalies. Alerts notify you when specific events occur so you can take action quickly. Leverage the power of full-text search to easily create metrics from logs. For example, you can search for the term "error" and the search function will find all matching entries in all sources. The results can then be displayed or alerted. Additionally, you'll search logs and look for patterns while on the dashboard.

Machine learning can then be used to analyze the logs and identify patterns that may not be visible through manual analysis. Machine learning can be used to detect anomalies in logs, such as unusual traffic or errors. Additionally, features such as log classification capabilities that automatically track the type of log data being indexed and detect any changes in that pattern can be useful. This can be used to detect new application behavior or other interesting events that you may have missed.

Machine Learning - Uncovering the unknown lurking in log files

It is recommended that organizations use various machine learning algorithms and techniques to help you be more likely to effectively discover unknown things in log files.

Unsupervised machine learning algorithms such as clustering (x-means, BIRCH), time series decomposition, Bayesian inference, and correlation analysis should be employed for anomaly detection on real-time data with rate control alerts based on severity.

By automatically identifying influencers, users can gain valuable contextual information for automated root cause analysis (RCA). Log pattern analysis classifies unstructured logs , while log rate analysis and change point detection help identify root causes of spikes in log data . These methods combined provide a powerful way to uncover hidden insights and issues within log files.

In addition to helping RCA, uncover unknown unknowns and improve troubleshooting, organizations should also look to predictive capabilities to help them predict future needs and calibrate business goals.

Enhance the journey

Finally, when collecting logs, you may also want to collect metrics or APM data. Metrics can give you insights into your system's performance, such as CPU usage, memory usage, and network traffic, while APM can help you identify problems with your application, such as slow response times or errors. While not required, the metrics data in particular goes hand in hand with the log data, and it's very easy to set up. If you're already collecting logs from your system, collecting additional metrics usually won't take more than a few minutes. As an operations person, you're always tracking not only the security of your deployments, but also who and what is being deployed, so adding and using security events through APM and metrics can help complete the picture.

Structured logs vs. unstructured logs?

A common challenge in the logging process is managing unstructured and structured logs, especially parsing during ingestion.

Manage logs using known patterns

Fortunately, some tools offer extensive built-in integration for popular sources such as Apache, Nginx, Docker, and Kubernetes, among others. These integrations make it easy to collect logs from different sources without spending time building custom parsing rules or dashboards. This saves time and effort and allows teams to focus on analyzing logs rather than parsing or visualizing them.

If you don't have logs from popular sources (such as custom logs), you may need to define your own parsing rules. Many observability logging tools will provide you with this functionality.

Therefore, structured logs are generally better than unstructured logs because they provide more value and are easier to use. Structured logs have predefined formats and fields that make it easier to extract information and perform analysis.

What about unstructured logs?

Full-text search capabilities in observability tools can help reduce concerns about the potential limitations of unstructured logs. Full-text search is a powerful tool that can extract meaningful information from unstructured logs by indexing them. Full-text search allows users to search logs for specific keywords or phrases, even if the logs are not parsed.

One of the main advantages of indexing logs by default is that it helps search logs in an efficient manner. If indexed beforehand, even large amounts of data can be searched quickly to find the specific information you are looking for. This saves time and effort when analyzing logs and can help you identify problems faster. With everything indexed and searchable, you don't have to write regular expressions, learn complex query languages, or wait long for searches to complete.

Should you move logs to tracking?

Logging has long been a cornerstone of application monitoring and debugging. However, while logs are an integral part of any monitoring strategy, they are not the only option available.

For example, traces are a valuable addition to logs and can provide deeper insight into a specific transaction or request path. In fact, if implemented correctly, tracing can complement logging as an additional instrumentation for the application (as opposed to the infrastructure), especially in cloud-native environments. Tracing can provide more contextual information and is particularly good at tracking dependencies in an environment. Any ripple effects are easier to see in tracking data than in log data, because individual interactions are tracked end-to-end across services.

Although tracking can provide a variety of advantages, it is also important to consider the limitations of using tracking. Implementation tracking only works for the applications you own, and you don't get infrastructure tracking as it's still application-only. Not all developers fully agree with using tracing as an alternative to logging. Many developers still default to logging as their primary means of detection, which makes it difficult to fully adopt tracing as a monitoring strategy.

Therefore, it is important to adopt a combined logging and tracing strategy, where tracing can cover newer instrumentation applications, and logging will support legacy applications and systems whose source code you do not own, with the understanding that you do not own the source code. the status of the system.

Improve log operation efficiency

When managing logs, there are three main aspects that will reduce operating efficiency:

  • It takes a lot of time to convert logs
  • Manage the storage and retrieval of log information
  • When to delete or what to delete from logs

We discuss strategies for managing these issues below.

Reduce the time spent converting data

With large volumes of logs with varying schema sets, even unstructured logs, organizations may spend more time on unnecessary data transformations rather than understanding the problem, analyzing root causes, or optimizing operations.

By structuring data according to common patterns, operations teams will be able to focus on identifying, resolving and preventing issues while reducing mean time to resolution (MTTR). Operations can also reduce costs by not having duplicate data and not having to process data for standardization.

The industry is achieving this standardization through the OpenTelemetry Semantic Convention project, which attempts to implement a common pattern for logs, metrics, tracing, and security events. Logging is especially powerful thanks to Elastic Common Schema (ECS)'s recent contributions to OpenTelemetry , which help strengthen the common schema journey.

A simple illustrative example is when a client's IP address is sent from multiple sources that monitor or manage telemetry data about the client.

src:10.42.42.42
client_ip:10.42.42.42 
apache2.access.remote_ip: 10.42.42.42 
context.user.ip:10.42.42.42 
src_ip:10.42.42.42

Representing IP addresses in multiple ways can complicate analyzing potential problems or even identifying them.

With universal schema, all incoming data is in a standardized format. Using the example above, each source will identify the client's IP address in the same way.

source.ip:10.42.42.42

This helps reduce the need to spend time converting data.

Centralized vs decentralized log storage

Data locality is an important consideration when managing log data. The cost of transferring large amounts of log data in and out can be very high, especially when dealing with cloud providers.

Without zone redundancy requirements, your organization may not have a compelling reason to send all log data to a central location. Your logging solution should provide a way to allow your organization to keep its log data local to the data center where it was generated. This approach helps reduce the ingress and egress costs associated with transmitting large amounts of data over the network.

The cross-cluster (or cross-deployment) search feature enables users to search across multiple logging clusters simultaneously. This helps reduce the amount of data that needs to be transferred over the network.

If your organization needs to maintain business continuity in the event of a disaster, cross-cluster replication is another useful feature you may want to consider. With cross-cluster replication, organizations can ensure that their data is available by automatically replicating it to a second destination even in the event of a failure in one of the data centers. Typically, this approach is used only for the most critical data, while leveraging the aforementioned data locality for the largest data sources.

Another related topic worth considering is vendor lock-in. Some logging solutions lock the data after it is acquired, making it difficult or impossible to extract the data if the organization wants to switch vendors.

This is not the case with every vendor and organizations should check to ensure they have full access to all raw data at all times. This flexibility is critical for organizations that want to be able to switch vendors without being locked into a proprietary logging solution.

Processing large amounts of log data - log everything, discard everything, or reduce usage?

Processing large amounts of log data can be a challenging task, especially as the amount of data generated by systems and applications continues to grow. Additionally, compliance requirements may mandate the logging of certain data, and in some cases, logging of all systems may be beyond your control. Reducing the amount of data may seem impractical, but there are some optimizations that can help:

1) Collect everything, but develop a log deletion policy. Organizations may want to evaluate what data is collected and when. Discarding the data at the source may cause problems later if you find that you need the data for troubleshooting or analysis. Evaluate when to delete data. If possible, set up a policy to automatically delete old data to help reduce the need to manually delete logs and the risk of accidentally deleting logs that are still needed.

A good practice is to discard DEBUG logs and even INFO logs as early as possible, and delete development and staging environment logs as soon as possible. Depending on the product you use, you have the flexibility to set the retention period for each log source. For example, you can retain staging logs for seven days, development logs for one day, and production logs for one year. You can even go a step further and split it by application or other attributes.


2) A further optimization is to aggregate short windows of the same log lines, which is especially useful for TCP security event logging. Just aggregate the same log lines and generate a count, plus start and end timestamps. This can be easily expanded in a way that provides sufficient fidelity and saves a lot of storage space. You can use preprocessing tools to do this.


3) For applications and code you control, you might consider moving some logs into traces, which will help reduce the amount of logs. But traces still constitute additional data.

put it together

In this article, we provide a view of the logging process and several challenges to consider in this article. In introducing the logging process, which involves three basic steps - collection and ingestion, parsing and processing, and analysis and rationalization - we highlighted the following points:

  • It is important to ensure there are no blind spots in the infrastructure during the log collection phase and focus on producing structured logs at the source.
  • Extracting data and enriching logs can be achieved through read pattern or write pattern methods, and machine learning can be used to analyze logs and identify patterns that may not be visible through manual analysis.
  • Organizations should also collect metrics or APM data to enhance their logging journey.

We discussed several key challenges:

  • Structured logs are easier to use and provide more value, but unstructured logs can be managed through full-text search capabilities.
  • Tracing can provide deeper insights into specific transactions and dependencies, potentially replacing logging as the primary means of detection, but due to limitations and developer adoption, a combined logging and tracing strategy is important.
  • Organizations can help improve log operations by focusing on reducing the time spent transforming data, choosing between centralized or decentralized log storage, and processing large volumes of log data by implementing log deletion policies and aggregating short windows of identical log lines. efficiency.

原文:Effective log management in software development and operations | Elastic Blog

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/133317863