How to Use Elasticsearch and Time Series Data Streams for Observability Metrics - 8.7

Author: Nicolas Ruflin

Elasticsearch is used for many types of data - one of which is metrics. Metrics use cases became even more popular with the launch of Metricbeat many years ago and later our APM agent. Over the years, Elasticsearch has made many improvements in how it handles metrics aggregations and sparse documents, among other things. At the same time, TSVB visualization was introduced to make it easier to visualize metrics. There is a missing concept in most other metrics solutions, the concept of time series with dimensions.

In mid-2021, the Elasticsearch team set out to make Elasticsearch more suitable for metrics. The team created Time Series Data Streaming (TSDS), which was released as General Availability (GA) in 8.7.

This blog post takes a deep dive into how TSDS works and how we use it in Elastic Observability, and how you can use it for your own metrics.

A quick introduction to TSDS

Time Series Data Streaming (TSDS) builds on top of time-series-optimized data streaming in Elasticsearch. To create a dataflow for a metric, additional setup of the dataflow is required. When we use dataflows, we first have to create an index template:

PUT _index_template/metrics-laptop
{
  "index_patterns": [
    "metrics-laptop-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.mode": "time_series"
    },
    "mappings": {
      "properties": {
        "host.name": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "packages.sent": {
          "type": "integer",
          "time_series_metric": "counter"
        },
        "memory.usage": {
          "type": "double",
          "time_series_metric": "gauge"
        }
      }
    }
  }
}

Let's take a closer look at this template. At the top, we mark the index schema with metrics-laptop-*. Any pattern can be chosen, but a stream naming scheme is recommended for all metrics . The next section sets "index.mode": "time_series" and ensures it is a data_stream: "data_stream": {}.

dimension

Every time series data stream requires at least one dimension. In the example above, host.name is set as a dimension field with "time_series_dimension": true. By default there can be up to 16 dimensions. Not every dimension has to appear in every document. Dimensions define time series. A general rule is to choose Field as the dimension that uniquely identifies your time series. Usually this is a unique description of the host/container, but for some metrics like disk metrics, the disk ID is also required. If you're curious about the dimensions recommended by default, check out this ECS Contributed Dimension Properties.

Reduce storage and improve query speed

At this point, you have a functioning time series data stream. Setting the indexing mode to time series automatically turns on the compositing source. By default, Elasticsearch typically replicates data three times:

  • row-oriented storage (_source field)
  • column-oriented storage (for aggregate doc_values: true)
  • index (for filtering and searching index: true )

For synthetic sources, the _source field is not persisted; instead, it is rebuilt from doc values . Especially in the metrics use case, there is little benefit in keeping the source code.

Not storing it means a significant reduction in storage. Time series dataflows sort data based on dimensions and timestamps. This means that data that is normally queried together is stored together, speeding up query times. This also means that the data points of a single time series are stored side by side on disk. Since the rate at which the counter increments is usually relatively constant, the data can be further compressed.

Indicator type

But to benefit from all the advantages of TSDS, the field attribute of the metric field must be extended with time_series_metric: {type}. Multiple types are supported - eg gauge and counter are used above. Giving Elasticsearch knowledge about metric types allows Elasticsearch to provide more optimized queries for different types and further reduce storage usage.

When you create your own templates for dataflows under the dataflow naming scheme , it is important to set "priority": 200 or higher, otherwise the built-in default template will be applied.

ingest document

Ingesting documents to TSDS is no different than ingesting documents to Elasticsearch. You can add documentation with the following command in Dev Tools, then search it and check the mapping. NOTE: You must adjust the @timestamp field to be close to your current date and time.

# Add a document with `host.name` as the dimension
POST metrics-laptop-default/_doc
{
  # This timestamp neesd to be adjusted to be current
  "@timestamp": "2023-03-30T12:26:23+00:00",
  "host.name": "ruflin.com",
  "packages.sent": 1000,
  "memory.usage": 0.8 
}

# Search for the added doc, _source will show up but is reconstructed
GET metrics-laptop-default/_search

# Check out the mappings
GET metrics-laptop-default

If you want to add a current machine running date to your documents, you can refer to the article " Elasticsearch: How to add a now timestamp when writing documents ".

If you do a search, it will still show _source but this is reconstructed from the doc values. The additional field added above is @timestamp. This is important as it is a required field for any data flow.

Why is this important for observability?

One of the strengths of the Elastic observability solution is that within a single storage engine, all signaling is in one place. Users can query logs, metrics, and traces together without jumping from one system to another. Because of this, it was critical for us to have a robust storage and query engine not only for logs but also for metrics.

Use of TSDS in Integration

Through integrations , we provide users with an out-of-the-box experience integrating with their infrastructure and services. If you're using our integration, assuming you're on version 8.7 or higher, you'll end up with all the metrics benefits of TSDS automatically.

We are currently working on our integration package list, adding dimensions, metric type fields, and then opening TSDS for the metric data stream. This means that once the package has all properties enabled, the only thing you have to do is upgrade the integration and everything else will happen automatically in the background.

learn more

If you want to learn more about how TSDS works behind the scenes and all the configuration options available, check out the TSDS documentation . What Elasticsearch supports in 8.7 is only the first iteration of the metrics time series in Elasticsearch. If you switch to using TSDS, you will automatically benefit from all future improvements that Elasticsearch makes to metric time series, whether it's more efficient storage, query performance, or new aggregation capabilities.

TSDS has been available since 8.7 and will automatically appear in more and more of our integrations as integrations are upgraded. What you'll notice is lower storage usage and faster queries. happy!

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/130518728