ElasticSearch usage summary (9)

The medium of document storage in ElasticSearch is divided into memory and hard disk: memory is fast, but its capacity is limited; hard disk is slow, but has a large capacity. At the same time, the operation of the ElasticSearch process itself also requires memory space, and it must be ensured that the ElasticSearch process has sufficient runtime memory. In order to achieve the best performance of the ElasticSearch engine, the limited memory and hard disk resources must be reasonably allocated.
1. Inverted Index
The ElasticSearch engine writes the document data into the data structure of the Inverted Index. The inverted index establishes the mapping relationship between the Term and the Document. In an inverted index, the data is term-oriented rather than document-oriented.

For example, the relationship between documents and terms is as follows:

DocID field
1 studey after school
2 study English

After the field value is analyzed, it is stored in the inverted index. The inverted index stores the relationship between the word segment (Term) and the document (Doc). The simplified version of the inverted index is as follows:

Term Counter DocID
after 1 1
school 1 1
study 2 1,2
English 1 2

As can be seen from the figure, the inverted index has a list of terms, each participle is unique in the list, and records the number of occurrences of the term and the documents that contain the term. In fact, the inverted index created by the ElasticSearch engine is much more complicated than this.
1. Segments are part of an inverted index

The inverted index is composed of segments (Segment), and the segment is stored in the hard disk (Disk) file. Index segments are not updated in real time, which means that segments are not updated after they are written to disk. When deleting a document, the ElasticSearch engine stores the information of the deleted document in a separate file. When searching for data, the ElasticSearch engine first executes the query from the segment, and then filters the deleted document from the query result, which means , the segment stores deleted documents, which reduces the density of segments containing "normal documents". Multiple segments can physically delete the "deleted" documents from the segment through the Segment Merge operation, and merge the undeleted documents into a new segment. There is no "deleted document" in the new segment. Therefore, The segment merging operation can improve the index search speed. However, segment merging is an IO-intensive operation that consumes a lot of hard disk IO.

In ElasticSearch, most queries need to obtain data from hard disk files (indexed segment data is stored in hard disk files). Therefore, in the global configuration file elasticsearch.yml, the path of the node (Path) is configured to perform better. High hard disk, can improve query performance. By default, ElasticSearch uses the relative path based on the installation directory to configure the path of the node. The installation directory is displayed by the property path.home. Under the home path, ElasticSearch automatically creates the config, data, logs and plugins directories, which are generally not required. Configure the node path separately. The file path configuration item of the node:

  • path.data: Set the directory where the index data of the ElasticSearch node is saved. Multiple data files are separated by commas, for example, path.data: /path/to/data1,/path/to/data2;
  • path.work: Set the directory where ElasticSearch's temporary files are saved;
    2. Word segmentation and storage of original text

The mapping parameter index determines whether the ElasticSearch engine performs an analysis operation on the text field, that is to say, the analysis operation divides the text into word segmentations, that is, token streams, and indexes the segmentations so that the segmentations can be searched:

  • When the index is analyzed, the field is an analysis field, and the ElasticSearch engine performs analysis operations on this field, divides the text into word streams, and stores them in the inverted index to support full-text search;
  • When the index is not_analyzed, the field will not be analyzed, and the ElasticSearch engine stores the original text as a single segmented word in the inverted index, does not support full-text search, but supports entry-level search; that is, the original text of the field is not After analysis, it is stored in the inverted index, and the original text is indexed. During the search process, the query conditions must all match the entire original text;
  • When index is no, the field will not be stored in the inverted index and will not be searched;
    whether the original value of the field is stored in the inverted index is determined by the mapping parameter store, the default value is false, and That is, the original value is not stored into the inverted index.

The difference between the mapping parameters index and store is:

  • The store is used to obtain the original value of the (Retrieve) field. It does not support query. You can use the projection parameter fields to filter the fields whose stroe attribute is true, and only retrieve (Retrieve) specific fields to reduce network load;
  • The index is used to query the (Search) field. When the index is analyzed, the full-text query is performed on the word segmentation of the field; when the index is not_analyzed, the original value of the field is used as a word segmentation, and the entry query can only be performed on the original text of the field;

3. The maximum length of a single word
segment If the index attribute of the field is set to not_analyzed, the original text will be used as a single word segment, and its maximum length is related to UTF8 encoding. The default maximum length is 32766Bytes. If the text of the field exceeds this limit, then ElasticSearch will skip Skip the document and throw an exception message in Response:

operation[607]: index returned 400 _index: ebrite _type: events _id: 76860 _version: 0 error: Type: illegal_argument_exception Reason: "Document contains at least one immense term in field="event_raw" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[112, 114,... 115]...', original message: bytes can be at most 32766 in length; got 35100" CausedBy:Type: max_bytes_length_exceeded_exception Reason: "bytes can be at most 32766 in length; got 35100"

The ignore_above attribute can be set in the field, and the attribute value refers to the number of characters, not the number of bytes; since a UTF8 character occupies at most 3 bytes, you can set:

“ignore_above”:10000

In this way, characters after more than 30000 bytes will be ignored by the analyzer, and the maximum length of a single word (Term) is 30000Bytes.

The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.

2. Columnar storage (doc_values)
By default, most fields can be searched after being indexed. The inverted index is composed of an ordered list of terms, and each term exists uniquely in the list. Through this data storage mode, you can quickly find the list of documents that contain a term. However, sorting and aggregation operate in opposite data access patterns, and instead of looking up terms to discover documents, they look up documents to discover terms contained in fields. ElasticSearch uses columnar storage to implement sorting and aggregation queries.

Document values ​​(doc_values) are data structures stored on disk, created at index time from the original values ​​of documents In addition to the analysis field of the type, other field types support document value storage. By default, document value storage for fields is enabled, except for parsed fields of type character. If you do not need to perform sorting or aggregation operations on a field, you can disable the field's document value to save disk space.

"mappings": {
    "my_type": {
        "properties": {
        "status_code": { 
            "type":       "string",
            "index":      "not_analyzed"
        },
        "session_id": { 
            "type":       "string",
            "index":      "not_analyzed",
            "doc_values": false
        }
        }
    }
}

Third, the analysis field of the sequential index (fielddata)
character type does not support document values ​​(doc_values), but supports the fielddata data structure, which is stored in the heap memory of the JVM. Fielddata fields (data stored in memory) have higher query performance than document values ​​(data stored on disk). By default, the ElasticSearch engine creates a fielddata data structure when performing an aggregation or sorting query on a field for the first time ((query-time)); in subsequent query requests, the ElasticSearch engine uses the fielddata data structure to improve aggregation and sorting. query performance.

In ElasticSearch, the data of each segment (segment) of the inverted index is stored on the hard disk file. After reading the field data from the segment of the entire inverted index, the ElasticSearch engine first reverses the relationship between the entry and the document, creating The relationship between documents and terms, that is, create a sequential index, and then store the sequential index in the heap memory of the JVM. Loading the inverted index into the fielddata structure is a process that consumes a lot of hard disk IO resources. Therefore, once the data is loaded into memory, it is best to keep it in memory until the end of the life cycle of the index segment (segment). By default, each segment (segment) of the inverted index will create a corresponding fielddata structure to store the analysis field value of the character type. However, it should be noted that the allocated JVM heap memory is limited, and Fileddata stores the data Stored in memory, it will occupy too much JVM heap memory, and even exhaust the memory space that the JVM relies on to run normally, which will reduce the query performance of the ElasticSearch engine.
1. The format attribute
fielddata will consume a lot of JVM memory. Therefore, try to set a large memory for the JVM, and do not enable fielddata storage for unnecessary fields. The format parameter controls whether to enable the fielddata feature of the field. For character type analysis fields, the default value of fielddata is paged_bytes, which means that, by default, character type analysis fields enable fielddata storage. Once fielddata storage is disabled, then sort and aggregation queries are no longer supported for character-type analytical fields.

"mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "string",
          "fielddata": {
            "format": "disabled" 
          }
        }
      }
    }
  }

2. Loading attribute (loading)
The loading attribute controls the timing of loading fielddata into memory. The possible values ​​are lazy, eager and eager_global_ordinals. The default value is lazy.

  • lazy: fielddata is only loaded into memory when needed. By default, fielddata is loaded into memory on the first search; however, if querying a very large index segment (Segment), lazy loading will result in larger time delay.
  • Before the segment of the inverted index is available, its data is loaded into the memory. The eager loading method can reduce the time delay of the query. However, some data may be so cold that there is no request to query the data, but the cold data is still eager are loaded into memory, occupying scarce memory resources.
  • eager_global_ordinals: Actively load fielddata into memory according to global ordinals.
    4. Memory and heap memory used by the JVM process

1. Configure the memory used by
ElasticSearch ElasticSearch uses the JAVA_OPTS environment variable (Environment Variable) to start the JVM process. In JAVA_OPTS, the most important configuration is: -Xmx parameter controls the maximum memory allocated to the JVM process, -Xms parameter controls the allocation to the JVM process. minimum memory. Usually, using the default configuration can meet the project needs.

The ES_HEAP_SIZE environment variable controls the size of the heap memory (Heap Memory) allocated to the JVM process, and the data of the sequential index (fielddata) is stored in the heap memory (Heap Memory).
2. Memory locking
Most applications try to use as much memory as possible and swap out unused memory as much as possible. However, memory swapping will affect the query performance of the ElasticSearch engine. It is recommended to enable memory locking and disable ElasticSearch memory swapping. In and out.

In the global configuration document elasticsearch.yml, set bootstrap.memory_lock to true, which will lock the memory address space of the ElasticSearch process and prevent the ElasticSearch memory from being swapped out by the OS.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325902582&siteId=291194637