When the user enters a bunch of strings like this into Elasticsearch?

1. The question leads

The following sample data has been imported into Elasticsearch, how to achieve specific field retrieval? and figure out the length of a specific subfield?

0d6f09d4436db34aea1afc565ae11070.png

"message": "[策略排序]排序后结果:[{\"intentItems\":[\"200001\"],\"level\":1,\"moduleCode\":\"CENTER_PIT\",\"priority\":100,\"ruleId\":3947,\"sortScore\":9900.0,\"strategyId\":1000,\"strategyItemId\":1003}],deviceId:0aa81c2d-5ec9-3c09-81ba-7857709379ad"

2. Troubleshooting

  • Major premise: Elasticsearch documents are all stored in the form of json.

  • The data in the part of the question is not standardized enough. It is intended to be json data, but it is actually stored as a string.

  • Storing as a string brings great inconvenience for subsequent retrieval.

So, you need to consider doing a conversion.

There are many ways to convert. When writing, parse the json before writing, everyone can think of it.

Is there a faster way? At this time, consider using the json processor in the preprocessing function of the ingest pipeline .

237c14141957ce0c341e8bb38ac3d087.png

3. Realization

Step 1: Sample data formatting.

POST test-009/_bulk
{"index":{"_id":1}}
{"message":"{\"rst\":[{\"intentItems\":[\"200001\", \"200002\"],\"level\":1,\"moduleCode\":\"CENTER_PIT\",\"priority\":100,\"ruleId\":3947,\"sortScore\":9900.0,\"strategyId\":1000,\"strategyItemId\":1003}],\"deviceId\":\"0aa81c2d-5ec9-3c09-81ba-7857709379ad\"}"}

After writing, Kibana retrieves the recall as shown below.

Step 2: Convert string to json

PUT _ingest/pipeline/msg2json_pipeline
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field" : "json_msg"
      }
    },
    {
      "remove": {
        "field": "message",
        "if": "ctx.message != null"
      }
    }
  ]
}
  • json processor Purpose: convert message text string to json_msg target json string.

  • remove processor Purpose: The original message field has no practical meaning, deleting it is actually "cleaning up the portal" and freeing up space.

Note: The ingest processor is a function that has existed since Elasticsearch 5.0. As the version changes, the relevant preprocessors are gradually enriched, expanded, improved and strengthened.

Step 3: Verify that the json conversion is ok

POST test-009/_update_by_query?pipeline=msg2json_pipeline
{
  "query": {
    "match_all": {}
  }
}

POST test-009/_search
909b404a8fda46eaa1cb2659055feb2c.png

Step 4: Find the size of the intentItems array.

PUT _ingest/pipeline/len_pipeline
{
  "processors": [
    {
      "script": {
        "lang": "painless", 
        "source": """
        ctx.array_len = ctx.json_msg.rst[0].intentItems.size();
        """
      }
    }
  ]
}

POST test-009/_update_by_query?pipeline=len_pipeline
{
  "query": {
    "match_all": {}
  }
}

POST test-009/_search
a71f000da25744cb5ba2c352928caa7c.png

Of course, update_by_query is not the optimal solution. This article is just for everyone to see that none of the subdivision steps are intentional.

A more convenient solution is: specify the default_pipeline when creating the index, and integrate the json processor, ingest processor, and remove processor written above into the default_pipeline.

Due to space reasons, it will not be expanded.

4. Summary

The previous article has also repeatedly emphasized that Elasticsearch has a relatively powerful preprocessing function, which can meet the basic data cleaning, cleaning, and conversion functions of most businesses.

Some students also asked: Can the filter function of logstash be completely replaced. The official has also emphasized that it is not allowed! !

Taking the latest (2023-01-12) version of Elasticsearch 8.6 as an example, let me explain in detail in terms of data volume: the number of Logstash filter plug-ins is 48  , and the number of Elasticsearch Ingest processors is 40 . To be more specific, under the Logstash filter: logstash-integration-jdbc, logstash-filter-uuid and other Elasticsearch Ingest processors are not available.

In one sentence: if the Elasticsearch Ingest pipeline can handle it, let it handle the preprocessing. If there is Logstash in the technology stack, hand it over to the filter plug-in of Logstash for processing.

recommended reading

  1. First release on the whole network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Heavyweight | Dead Elasticsearch 8.X Methodology Cognition List (2022 National Day Update)

  3. How to systematically learn Elasticsearch?

  4. Looking at Elasticsearch data cleaning methods from an online question

  5. Elasticsearch's ETL weapon - Ingest node

  6. Elasticsearch preprocessing has no magic tricks, please use this trick first!

  7. Dry goods | Logstash Grok data structured ETL combat

  8. Dry goods | Logstash custom regular expression ETL combat

cbcd8a1b602705c4db9ac129a39515cc.jpeg

Acquire more dry goods faster in a shorter time!

Improve with 1800+ Elastic enthusiasts around the world!

e30312b950dd060e7f59cd1f9162277b.gif

Learn advanced dry goods one step ahead of your colleagues!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/128681295