How does Elasticsearch 8.X elegantly implement batch modification of field names?

1. Online practical problems

Before writing to es, the data format is as follows

{"json_lbm_01":"test01","json_lbm_02":"test02","tmp_lbm_01":"test03","tmp_lbm_02":"test04"}

Requirement: Is it possible to simply use the pipeline, if the key written contains json_ to be replaced with empty, and tmp to be replaced with core, because there are many key fields that do not consider exhaustion, the final effect should be as follows:

{
  "lbm_01":"test01",
  "lbm_02":"test02",
  "core_lbm_01":"test03",
  "core_lbm_02":"test04"
}

——The source of the problem: Dead Elasticsearch Knowledge Planet https://t.zsxq.com/0bzWL3w1X

2. Cognitive premise

Once an Elasticsearch mapping is created, it is not allowed to be modified! There are several special points that allow updating the mapping, see: Can Elasticsearch change the Mapping? How to modify?

In addition to the mapping level, especially the field level, if you want to modify it, you need to change your thinking.

3. What if the Mapping field must be modified to meet business needs?

Looking at the question at the beginning, it is essentially a modeling problem, which should be handled well in the modeling stage, and this problem will not exist in the later stage. I hope everyone can reach a consensus on this point.

Regarding the importance of Elasticsearch data modeling, recommended reference:

Dry goods | Elasticsearch data modeling guide

Consider the following solutions to the opening question:

3.1 Solution 1, field alias implementation.

The field alias field-alias is different from the index alias alias.

Everyone is familiar with index aliases, and field aliases are heard a lot, but not so many are actually used.

Field aliases are a new feature in Elasticsearch 6.4, see:

https://www.elastic.co/cn/blog/introducing-field-aliases-in-elasticsearch

Because generally enough homework is done in the early modeling, the later links are not needed.

Advantages: The existing mapping remains unchanged, but an update operation is performed on its basis.
Disadvantages: 1,000 fields in batches need to construct a mapping of 1,000 fields, which can actually be implemented by scripts.

Practical reference:

POST mytxindex-20230303/_bulk
{"index":{"_id":1}}
{"json_lbm_01":"test01","json_lbm_02":"test02","tmp_lbm_01":"test03","tmp_lbm_02":"test04"}


PUT mytxindex-20230303/_mapping
{
    "properties": {
      "lbm_01": {
        "type": "alias",
        "path": "json_lbm_01"
      },
      "lbm_02": {
        "type": "alias",
        "path": "json_lbm_02"
      },
      "core_lbm_01": {
        "type": "alias",
        "path": "tmp_lbm_01"
      },
      "core_lbm_02": {
        "type": "alias",
        "path": "tmp_lbm_02"
      }
    }
}

POST mytxindex-20230303/_search
{
  "query": {
    "match": {
      "lbm_01": "test01"
    }
  }
}

The recall result data is shown in the screenshot below:

3.2 Option two, remodeling, and then implement the reindex operation.

The core points are as follows:

It is recommended to use the template template first, which solves the problem of template matching with similar field names.
The preprocessing pipeline is implemented in two parts:

First, the script realizes the assignment of old and new fields;
Second, remove removes unnecessary old fields.

Advantages: This operation is relatively common and quite satisfactory.

Disadvantages: reindex is required to migrate index data, and do preprocessing during migration.

The specific implementation is as follows:

#### step1：创建模板，有了模板，字段一键搞定
PUT _index_template/mytx_template_20230303
{
  "index_patterns": [
    "mytx_new_*"
  ],
  "template": {
    "settings": {
      "number_of_shards": 1
    },
    "mappings": {
      "dynamic_templates": [
        {
          "lbm_to_keyword": {
            "match_mapping_type": "string",
            "match": "lbm_*",
            "mapping": {
              "type": "keyword"
            }
          }
        },
        {
          "core_lbm_to_keyword": {
            "match_mapping_type": "string",
            "match": "core_lbm_*",
            "mapping": {
              "type": "keyword"
            }
          }
        }
      ]
    }
  }
}

#### step2：创建预处理管道
PUT _ingest/pipeline/mytx_pipeline_20230303
{
  "processors": [
    {
      "script": {
        "source": """
        ctx.lbm_01 = ctx.json_lbm_01;
        ctx.lbm_02 = ctx.json_lbm_02;
        ctx.core_lbm_01 = ctx.tmp_lbm_01;
        ctx.core_lbm_02 = ctx.tmp_lbm_02;
        """,
        "lang": "painless"
      }
    },
    {
      "remove": {
        "field": [
          "json_lbm_01",
          "json_lbm_02",
          "tmp_lbm_01",
          "tmp_lbm_02"
        ],
        "if": "ctx.json_lbm_01 != null && ctx.json_lbm_02 != null && ctx.tmp_lbm_01 != null && ctx.tmp_lbm_02 != null"
      }
    }
  ]
}

#### step3：数据迁移操作
POST _reindex
{
  "source": {
    "index": "mytxindex-20230303"
  },
  "dest": {
    "index": "mytx_new_001",
    "pipeline": "mytx_pipeline_20230303"
  }
}

#### step4：执行检索
POST mytx_new_001/_search

The recall results are shown in the figure below:

3.3 Solution 3: The ultimate solution to the traversal problem

Neither scheme one nor scheme two can solve the problem of N fields.

Assuming that there are multiple fields, I don't want to copy the fields one by one, and I don't want to use third-party scripts such as shell or python for processing.

So is there a better solution? Solution 3 is implemented based on field traversal, and fields are nothing more than key: value combinations.

First get the key through: entry.getKey( ), then make a logical judgment based on the key, construct a new key, and then copy the old value to the new key.

Finally, update via putAll.

PUT _ingest/pipeline/rename_fields_pipeline
{
  "processors": [
    {
      "script": {
        "source": """
          def new_fields = [:];
          for (entry in ctx.entrySet()) {
            String key = entry.getKey();
            if (key.startsWith('json_')) {
              key = key.replace('json_', '');
            } else if (key.startsWith('tmp_')) {
              key = 'core_' + key.replace('tmp_', '');
            }
            new_fields[key] = entry.getValue();
          }
          ctx.clear();
          ctx.putAll(new_fields);
        """
      }
    }
  ]
}


POST _reindex
{
  "source": {
    "index": "mytxindex-20230303"
  },
  "dest": {
    "index": "mytx_new_002",
    "pipeline": "rename_fields_pipeline"
  }
}

The final execution results are as follows, consistent with expectations.

4. Summary

Even if three different implementation schemes are given for similar problems, they can meet the given business requirements.

However, it is still not recommended to deal with it in the middle and late stages of business. For a better solution, it is recommended to use the Elasticsearch modeling stage to make a plan to avoid major changes involving a large amount of data migration similar to the above-mentioned problems in the middle and late stages.

More practical ideas are welcome to share with you! ! !