Elasticsearch: use ingest pipeline to manage index names

In my previous article " Elasticsearch: Using pipelines to route documents to the desired Elasticsearch index " I detailed how to use the existing  date_index_name handler to group documents into the desired index relative to the document date go. For example, we want to write all documents in April 2023 to the index name my-index-2023-04-01. This processor solves the problem that in many cases, we need to put the index of the current month or year into the index name related to the document timestamp, which is convenient for future management and search.

In today's article, we will implement the same solution in another way.

Various different ways to modify data

In the use of Elastic Stack, we have many different solutions to modify data, such as:

Above, we use custom microservices to ingest commercial application documents, use our algorithm to modify documents, and finally write documents to Elasticsearch through the client library. The downside of this approach is that you need to write an appropriate application to do it. For large amounts of data, we may not have buffered, and sometimes we cannot even guarantee at least one transfer. This is implemented in both Logstash and Filebeat.

We can also use Logstash to modify the data:

 

Logstash provides a wealth of filters to help us process data. You can read the article " Logstash: Getting Started with Logstash Part 1 " to learn more. The disadvantage of this solution is that in order to cope with single point of failure and load balancing, you need to manage multiple Logstash instances. This one requires extra work.  

Among the latest developers, more and more developers tend to use ingest pipeline to process data. For more articles about the ingest pipeline, please refer to " Elastic: A Developer's Guide to Getting Started ".

Ingest nodes are a type of node in an Elasticsearch cluster. It can help us run rich processors to process data. Since it is part of the Elasticsearch cluster, it can be easily scaled up to handle more demand. 

The benefits of using the Ingest pipeline are:

  • Ability to modify data without changing application logic
  • Lightweight solution compared to Logstash
  • Manage clusters individually without overhead
  • Reduce structural complexity

Although the ingest pipeline has many advantages above, there are some limitations when using the ingest pipeline:

In addition to the data modification scheme described above, the other is to process the data through the Beats processor. You can read " Beats: Beats processors " further.

Write the data to the index we want

Next, we use the ingest pipeline method to write the data we want into the index we want. we want to put

For example, above, we can see a field called created_at. It happened on 2022-11-30 this day. We want to write this document to our desired index name books.2022.11. The reason is very simple, we just want to attribute all the documents of the current month to the same index named books.2022.11. Easy to archive and search in the future. In the actual production environment, there is such a need. So how do we achieve our needs?

We use the ingest pipeline to achieve this requirement. We enter the following command in Kibana:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Change index name according to created_at",
    "processors": [
      {
        "date": {
          "field": "created_at",
          "target_field": "index_suffix", 
          "formats": ["ISO8601"],
          "output_format": "yyyy.MM"
        }
      },
      {
        "set": {
          "field": "_index",
          "value": "{
   
   { _index }}.{
   
   {index_suffix}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "created_at": "2023-04-13T23:57:11.092808962ZZ",
        "content": "This is Xiaoguo, Liu from Elastic"
      }
    }
  ]
}

Running the above command, the result we see is:

{
  "docs": [
    {
      "doc": {
        "_index": "_index.2023.04",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "created_at": "2023-04-13T23:57:11.092808962ZZ",
          "content": "This is Xiaoguo, Liu from Elastic",
          "index_suffix": "2023.04"
        },
        "_ingest": {
          "timestamp": "2023-04-14T00:01:24.74589526Z"
        }
      }
    }
  ]
}

Above, since we didn't specify _index, the result of the test is _index.2023.04. Below, if we specify _index, it will automatically replace the index name we want. The above shows that it is already the result we want. We can create the following pipeline:

PUT _ingest/pipeline/change_index_according_to_created_at
{
  "description": "Change index name according to created_at",
  "processors": [
    {
      "date": {
        "field": "created_at",
        "target_field": "index_suffix",
        "formats": [
          "ISO8601"
        ],
        "output_format": "yyyy.MM"
      }
    },
    {
      "set": {
        "field": "_index",
        "value": "{
   
   { _index }}.{
   
   {index_suffix}}"
      }
    }
  ]
}

After running the above command, we use the following command to test:

PUT books/_doc/1?pipeline=change_index_according_to_created_at
{
  "created_at": "2023-04-13T23:57:11.092808962ZZ",
  "content": "This is Xiaoguo Liu from Elastic"
}

The result returned by the above command is:

{
  "_index": "books.2023.04",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

As shown above, our _index name now becomes books.2023.04 instead of books at the time of writing. We can use the following command to query the data just written:

GET books.2023.04/_search

The above command returns:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "books.2023.04",
        "_id": "1",
        "_score": 1,
        "_source": {
          "created_at": "2023-04-13T23:57:11.092808962ZZ",
          "content": "This is Xiaoguo Liu from Elastic",
          "index_suffix": "2023.04"
        }
      }
    ]
  }
}

In fact, installing the same routine, by modifying _index, we can arbitrarily combine our index names. In a production environment, we would prefer to use date-tagged names to identify our indexes!

Guess you like

Origin blog.csdn.net/UbuntuTouch/article/details/130144244