ElasticSearch usage summary (10)

ElasticSearch supports Java regular expression queries, but before mining large chunks of text (Text Block), you must understand the special features of regular expression queries. Since the analyzer will perform word segmentation on the text field, remove stop words, lowercase conversion and other operations, the final stored in the inverted index is the lowercase token stream (Token Stream), by default, each token is a token (Term ), which cannot meet the general requirements of regular expression query, that is to say, the regular expression query is the original text, it should be noted that the ElasticSearch engine performs regular expression matching from the first character of the original text.

Before enabling regular expression queries in ElasticSearch, two questions need to be considered: Tokenization? Is it case sensitive?
1. Participle or no participle?
Normally, the ElasticSearch engine performs word segmentation on text fields, removes stop words, and converts them to lowercase. This is the standard configuration for full-text search. In this setting, regular expressions can only match a single word segmentation of the text field, but cannot To perform regular expression query on the original text, in order to realize the regular expression query, the text field must be set not to be segmented, that is, the index attribute of the field must be set to not_analyzed. In the actual production environment, it is common to perform regular expression query and full-text search on a field at the same time, and the mapping parameter fields provided by ElasticSearch can meet this requirement.

fields parameter: multi-fields, which use different processing methods to index the same field to achieve different purposes. Multivariate fields use the same data to derive new fields, for example, a field field is indexed as an analyzed field to perform full-text search, and the field is set as a multivariate field, then the ElasticSearch engine derives a new field field. raw, the field text is indexed as a word, and only sorting or aggregation operations are performed on this field.
Note: Multi-fields are different from multi-values ​​fields. Multi-values ​​of fields are inherently supported features of ElasticSearch, which are "out of the box" and do not require any configuration. Each field can store multiple data values, which means that each field is an array type, but the data stored in the field has the same data type.
1. Example of using multiple fields
In the example index mapping, eventdescription is a multiple field, and its index attribute is analyzed, indicating that the field is an analysis field. The ElasticSearch engine analyzes the text of the field into word streams and indexes them to execute the full text. Search, which means that the original text of the field is not stored in the inverted index, but a single segmented word; and eventdescription.raw is a derived field of the multi-field, and its index attribute is not_analyzed, indicating that the derived field will not To be tokenized, the entire text field as a whole is indexed as a token, which means that the original text value of the derived field is stored in the inverted index.

"eventdescription":{  
    "type":"string",
    "index":"analyzed",
    "fields":{  
        "raw":{  
            "type":"string",
             "index":"not_analyzed"
        }
    }
}

2. Perform regular expression query on the original text.
When the ElasticSearch engine processes the analyzed field (analyzed field), it uses the specified analyzer to perform analysis operations, including word segmentation, removal of stop words, case conversion, etc. If a field is not analyzed field, the ElasticSearch engine will not perform any analytical work on it.

Mapping parameter index: Determines whether the ElasticSearch engine performs an analysis operation on the text field, that is, the analysis operation will split the text, index the word segmentation, and make the word segmentation searchable:

  • When the parameter value is analyzed, the field is an analysis field, and the ElasticSearch engine performs analysis operations on this field, divides the text into word streams, and stores them in the inverted index to support full-text search;
  • When the parameter value is not_analyzed, the field will not be analyzed, and the ElasticSearch engine stores the original text as a single segmented word in the inverted index, does not support full-text search, but supports entry-level search; that is, the original text of the field It is stored in the inverted index without analysis, and the original text is used for indexing. During the search process, the query conditions must all match the entire original text;
  • When the parameter value is no, the field will not be stored in the inverted index, nor will it be searched;

This means that to perform a regular expression query on the original text, the index attribute must be set to not_analyzed, which means that the original form of the text, such as capitalization, spaces, etc., is preserved.
3. The maximum length of a single word
segment If the index attribute of the field is set to not_analyzed, the original text will be used as a single word segment, and its maximum length is related to UTF8 encoding. The default maximum length is 32766Bytes. If the text of the field exceeds this limit, then ElasticSearch will skip Skip the document and throw an exception message in Response:

operation[607]: index returned 400 _index: ebrite _type: events _id: 76860 _version: 0 error: Type: illegal_argument_exception Reason: "Document contains at least one immense term in field="event_raw" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[112, 114,... 115]...', original message: bytes can be at most 32766 in length; got 35100" CausedBy:Type: max_bytes_length_exceeded_exception Reason: "bytes can be at most 32766 in length; got 35100"

The ignore_above attribute can be set in the field, and the attribute value refers to the number of characters, not the number of bytes; since a UTF8 character occupies at most 3 bytes, you can set

“ignore_above”:10000

In this way, characters after more than 30000 bytes will be ignored by the analyzer, and the maximum length of a single word (Term) is 30000Bytes.

The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.

2. Case sensitive?
Regular expression queries are generally case-sensitive. Sometimes, we may wish that regular expression queries ignore case. In this case, multivariate fields (fields) cannot meet the requirements, and multivariate fields cannot implement text case. convert. In order to solve this problem, we can create a new field. When updating the index, import the original text into the analysis field, and convert the same data into lowercase and import it into another field. After doing so, analyze the field and its derived fields, Used to support full-text search and case-sensitive regular expression queries, and another field for case-insensitive regular expression searches.

"eventdescription":{  
    "type":"string",
    "index":"analyzed",
    "fields":{  
        "raw":{  
            "type":"string",
            "index":"not_analyzed"
        }
    }
},
"eventdescription_lowcase":{  
    "type":"string",
    "index":"not_analyzed"
    }
}

3. Storage control
In order to implement regular expression query, the above example creates three fields for one data, eventdescription, eventdescription.raw and eventdescription_lowcase, these three fields need to be stored in the inverted index, whether the ElasticSearch engine uses 3 times the capacity to store this data?

By default, once the field value is indexed, the field can be searched, but the original value of the field is not stored in the inverted index, which means that the field can be searched, but not from the inverted index Get the original value of the field. Usually, this design can save hard disk storage space and will not have any impact on the application. In fact, the ElasticSearch engine stores the original value of the field in the _source meta field. By default, the _source meta field is stored .
1. Whether the original value of the stored attribute (store)
field is stored in the inverted index is determined by the mapping parameter store. The default value is false, that is, the original value is not stored in the inverted index. In certain cases, it makes sense to store the original value of the field. For example, in order to store a blog, the document must have title, date and a very large body field (content). If you only get the title and date, but not the body field, then you can store the title and date, and put the content The field's store property is set to false.

"title":{  
    "type":"string",
    "store":true,
    "index":"analyzed"
},
 "date":{  
    "type":"date",
    "store":true,
    "index":"not_analyzed"
},
 "content":{  
    "type":"string",
    "store":false,
    "index":"analyzed"
}

The difference between the mapping parameters index and store is:

  • The store is used to obtain the original value of the (Retrieve) field. It does not support query. You can use the projection parameter fields to filter the fields whose stroe attribute is true, and only retrieve (Retrieve) specific fields to reduce network load;
  • The index is used to query the (Search) field. When the index is analyzed, the full-text query is performed on the word segmentation of the field; when the index is not_analyzed, the original value of the field is used as a word segmentation, and the entry query can only be performed on the original text of the field;
    2 , the source field (_source)
    When passing the original JSON document to the ElasticSearch engine, the ElasticSearch engine uses the _source field to store the original JSON document. The _source field itself is not indexed nor searched, however, the field is stored in the inverted index and used to return the results of the query.
    The _source field will lead to an increase in the index storage space. Therefore, the _source field can be disabled. However, before disabling the _source field, please read the official document "_source field" carefully.
"mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }

In general, do not disable the _source field. When you need to consider the occupied Disk space, please consider compressing storage to a limited extent and increase the compression level. In the configuration file, the compression option is index.codec, and the default value is LZ4 compression. Setting best_compression can provide a higher compression ratio at the cost of reducing the performance of data storage.
4. One field contains all text?
In order to implement regular expression query, the above example creates three fields for one data, eventdescription, eventdescription.raw and eventdescription_lowcase, these three fields need to be stored in the inverted index, the usual practice is, Perform a regex query on all three fields at the same time, but in ElasticSearch, coding, it can be simpler. The meta-field _all is a special "catch-all" field that concatenates the values ​​of other fields into a large string, separated by spaces. The ElasticSearch engine first analyzes the _all field and then indexes it, but, by default, the original value of the field is not stored, which means that the _all field can be searched, but the original value will not be returned. The _all field treats the original values ​​of all fields as character types, and concatenates the original values ​​of the fields with the separator space.

Note that the raw value is added to the _all field, not the term after field analysis. Whether the original value of the field is included in the _all field is controlled by the attribute include_in_all of the field, and the default value is true. Enabling the _all field comes at a cost. The _all field will consume additional CPU clock cycles and more hard disk space. If it is not necessary, it is recommended to disable the _all field.

"content": { 
     "type": "string""include_in_all": false
},

When the _all field is disabled, the user can create a custom "_all" field: create a new field with a data type of string, and set the attribute "copy_to" in the field that needs to be spliced.

In ElasticSearch, each index has only one _all field, and a custom "_all" field can be created through the field's attribute copy_to. For example, the fields first_name and last_name can be concatenated together with a delimiter space as the value of the full_name field.

{
  "mappings": {
    "mytype": {
      "properties": {
        "first_name": {
          "type":    "string",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type":    "string",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type":    "string"
        }
      }
    }
  }
}
PUT myindex/mytype/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_name": "John Smith"
    }
  }
}

By default, the _all field will not store the value of the _source field, nor will it store the original value. This is because the _all field is composed of other fields, and storing the _all field will take up a lot of hard disk storage space. If Set the attribute store of the _all field to true, then the ElasticSearch engine will store the original value of the _all field, and its original value can also be obtained.
V. Example
In summary, in order to realize regular expression query, in order to realize regular expression query, there are two design ideas

Example 1. Each of the original text and lowercase text uses one field
to use the should clause of the bool query to perform regular expression queries on multiple fields. When there are many fields or the text of the field is particularly large, use this method to save hard disk space , but to write more query code:

"eventdescription":{  
    "type":"string",
    "index":"analyzed",
    "fields":{  
        "raw":{  
            "type":"string",
            "index":"not_analyzed",
            "ignore_above":10000
        }
    }
},
"eventdescription_lowcase":{  
    "type":"string",
    "index":"not_analyzed",
    "ignore_above":10000
    }
}

Example 2. Adding redundant fields

"eventdescription":{  
    "type":"string",
    "index":"analyzed",
    "copy_to":"eventdescription_regexp"
},
"eventdescription_lowcase":{  
    "type":"string",
    "index":"not_analyzed",
    "copy_to":"eventdescription_regexp"
},
"eventdescription_regexp":{
    "type":"string",
    "index":"not_analyzed"
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325902578&siteId=291194637