Can Elasticsearch 8.X retrieve data based on array subscripts?

1. Online environment issues

Teachers and students, has anyone encountered this problem? There is an integer array field in the index, and then the value of the array subscript 1 is obtained through a script as a runtime field. It is found that the returned value is messy, not the subscript 1. The value of 1 is as follows:

DELETE my_index
PUT my_index
{
  "mappings": {
    "properties": {
      "price": {
        "type": "integer"
      }
    }
  }
}

POST /my_index/_doc
{
  "price": [
    5,
    6,
    99,
    3
  ]
}

POST /my_index/_doc
{
  "price": [
    22,
    11,
    19,
    -1
  ]
}
GET my_index/_search
{
  "runtime_mappings": {
    "price_a": {
      "type": "double",
      "script": {
        "source": """
        double v = doc['price'][1];
        emit(v);
        """
      }
    }
  },
  "fields": [
    {
      "field": "price_a"
    }
  ]
}

Is it because the results stored in doc value are out of order?

The result is:

5f1ef5a8169db5b4d7134a984bf8a2ec.png                                                                                          ——The question comes from the technical exchange group

2. Problem analysis

2.1 How are Elasticsearch arrays accessed?

In Elasticsearch, arrays are not a special data type.

When you have an array field in a JSON document and index it into Elasticsearch, Elasticsearch indexes each element in the array as a separate value, but it does not store the structure or order information of the array.

For example, assume you have the following document:

{

  "tags": ["A", "B", "C"]

}

Elasticsearch will treat it as if you had added three tags to the document: "A", "B", and "C".

Array fields (and many other field types) are stored primarily through Doc Values ​​in Elasticsearch.

Doc Values ​​are an optimized, on-disk, columnar data structure that makes sorting and aggregating fields very fast and efficient.

However, columnar storage does not preserve the order of the original data, which is why arrays lose their original order in Elasticsearch.

2.2 Access array data

When you access an array field in a script or query, such as doc['tags'], what you actually get is a list of values.

Even if the original array only has one value, you will get a list of values. Therefore, you usually need to check its .size() and access a specific value via .value or a specific index.

2.3 Array and nested document type Nested

Although arrays do not preserve order, Elasticsearch provides a nested data type that allows you to index objects in an array and maintain relationships between them.

This is useful for complex arrays of objects, but also introduces some complications, such as using specific nested queries and aggregations.

3. How to obtain the data of the specified subscript?

3.1 Option 1, minor changes.

#### 删除索引
DELETE my_index

#### 创建索引
PUT my_index
{
  "mappings": {
    "properties": {
      "price": {
        "type": "integer"
      }
    }
  }
}

#### 导入数据
POST /my_index/_doc
{
  "price": [
    5,
    6,
    99,
    3
  ]
}

POST /my_index/_doc
{
  "price": [
    22,
    11,
    19,
    -1
  ]
}
#### 创建预处理管道
PUT _ingest/pipeline/split_array_pipeline
{
  "description": "Splits the price array into individual fields",
  "processors": [
    {
      "script": {
        "source": """
        if (ctx.containsKey('price') && ctx.price instanceof List && ctx.price.size() > 0) {
          for (int i = 0; i < ctx.price.size(); i++) {
            ctx['price_' + i] = ctx.price[i];
          }
        }
        """
      }
    }
  ]
}

The meaning of split_array_pipeline preprocessing pipeline:

  • description:

Describe the purpose of this pipeline . In this case, we illustrate that the purpose of this pipeline is to decompose the price array into separate fields.

  • processors:

Is an array of processors, each processor performs a specific task. Here, we only have a script processor .

In the script processor, we wrote a small script that checks whether a field named price exists, whether the field is an array, and whether the array has at least one element. If all these conditions are met, the script iterates through the array and creates a new field for each element in the array. The names of the new fields will be price_0, price_1, etc., where the numbers are the indices of the array.

This preprocessing pipeline is useful, especially when the raw data format is not suitable for direct indexing into Elasticsearch. By using a preprocessing pipeline, we can perform the required transformations or cleansing of the data before indexing it.

POST my_index/_update_by_query?pipeline=split_array_pipeline
{
  "query": {
    "match_all": {}
  }
}


GET my_index/_search
{
  "runtime_mappings": {
    "price_a": {
      "type": "double",
      "script": {
        "source": """
        if (doc['price_0'].size() > 0) {
          double v = doc['price_0'].value;
          emit(v);
        }
        """
      }
    }
  },
  "fields": [
    {
      "field": "price_a"
    }
  ]
}

Runtime Fields. Runtime fields are a feature introduced after version 7.12 that allow you to define temporary fields whose values ​​are calculated through scripts at query time rather than pre-stored at index time.

In the above code:

  • We define a new runtime field called price_a. The field type is double.

  • We provide a Painless script that calculates the value of this field.

Script interpretation:

  • if (doc['price_0'].size() > 0):

This checks if the price_0 field exists and has a value. In the Elasticsearch script, doc['field_name'] means getting the value of the field, and the .size() method is used to check whether the field has a value (in some documents, the field may not exist or be empty).

  • double v = doc['price_0'].value;:

If the above condition is true, this line of code will take the value from the price_0 field and convert it to type double.

  • emit(v);:

This is the key instruction of the Painless script. It outputs the specified value as the value of the runtime field price_a.

The execution results are as follows, and the results have reached expectations.

fda64f8f1db9ef87286e902581cda187.png

3.2 Option 2: Nested implementation

Nested nested data type, we have talked about it many times in previous articles. Students who don’t understand can read the historical articles.

### 定义索引
PUT /my_nested_index
{
  "mappings": {
    "properties": {
      "prices": {
        "type": "nested",
        "properties": {
          "value": {
            "type": "integer"
          }
        }
      }
    }
  }
}

### 导入数据
POST /my_nested_index/_doc
{
  "prices": [
    {"value": 5},
    {"value": 6},
    {"value": 99},
    {"value": 3}
  ]
}

POST /my_nested_index/_doc
{
  "prices": [
    {"value": 22},
    {"value": 11},
    {"value": 19},
    {"value": -1}
  ]
}
#### 执行检索
GET my_nested_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "prices",
            "query": {
              "exists": {
                "field": "prices.value"
              }
            },
            "inner_hits": {
              "size": 1
            }
          }
        }
      ]
    }
  }
}

If you want to return only the first data result under inner_hits, you can use the size parameter. By setting size to 1, you can limit the number of results returned by inner_hits.

Return results:

24306f80c2394ec29945d334d23a6c37.png

4. Summary

When we use Elasticsearch to process array data, it's easy to misunderstand its actual behavior. This article explores in detail how Elasticsearch processes and stores arrays, and provides several methods for obtaining elements at specific positions in an array.

First, we must understand that Elasticsearch 不是以传统的方式存储数组treats each element as an independent value. Therefore, we cannot directly access a specific element in an array simply through a subscript.

There are several ways to solve this problem:

Use a preprocessing pipeline : Break down the array and generate a new field for each element by creating a preprocessing pipeline. This approach is very intuitive and allows us to easily access elements at any specific location.

Use the Nested data type : The Nested data type is a very efficient choice for complex arrays that need to preserve relationships between their elements. This allows us to perform more complex queries against each object in the array and preserve the relationships between them.

Both methods have their advantages and disadvantages. Which method to choose depends on your specific needs and data structure. The preprocessing pipeline solution is suitable for scenarios where you want to keep your data simple and be able to access array elements directly. The Nested data type is suitable for more complex scenarios where relationships need to be maintained between array objects.

In any case, it's crucial to understand the structure of your data and how Elasticsearch handles it. I hope that through this article, you will have a deeper understanding of Elasticsearch's array processing and be able to solve array-related problems more effectively.

Finally, no matter which method you choose, make sure to frequently test and verify the completeness and accuracy of your data. This way, you can ensure you get the expected results in a production environment and avoid potential problems caused by misunderstandings of data structures.

Recommended reading

  1. First release on the entire network! From 0 to 1 Elasticsearch 8.X clearance video

  2. Breaking news | Obsessed with Elasticsearch 8.X methodology knowledge list

  3. How to learn Elasticsearch systematically?

  4. 2023, do something

  5. Dry information | In-depth explanation of Elasticsearch Nested type

  6. To choose Elasticsearch Nested, read this article first!

  7. Dry information | Elasticsearch Nested array size solution, all in one place!

  8. Does Elasticsearch have an array type? What are the pitfalls?

  9. Dry information | Dismantling a complex query problem of Elasticsearch Nested type

0ccc5009684511c3c757f579c97343d5.jpeg

Learn more useful information faster and in less time!

Work together with nearly 2,000+ Elastic enthusiasts around the world!

be1df05a9d799ce1f17ad7161cce62ed.gif

One person can go very fast, but a group of people can go further!

Guess you like

Origin blog.csdn.net/wojiushiwo987/article/details/132644583