Elasticsearch core technology (2) --- the basic concepts (Index, Type, Document, clusters, nodes, and a copy of the fragment, inverted index)

Elasticsearch core technology (2) --- Basic Concepts

This blog talked about the basic concepts include: Index, Type, the Document. Clusters, nodes, and a copy of the fragment, inverted index .

一、Index、Type、Document

1、Index

index: Index is a document (Document) containers, is a kind of collection of documents.

The index of the word in ElasticSearch have three meanings:

1), index (noun)

Analogy traditional relational database field, the 索引相当于SQL中的一个数据库(Database). Index by its name ( must be all lowercase characters ) are identified.

2), the index (verb)

保存一个文档到索引(名词)的过程. This is very similar to the SQL statement INSERT keyword. It is equivalent to UPDATE the database if the document already exists.

3), inverted index

Relational database by adding a B + tree index introduced onto the column designated to enhance the speed of data retrieval. Index ElasticSearch using a feature called 倒排索引structure to achieve the same purpose.

2、Type

Type It can be understood as a relational database Table.

Previously, the concept of a document index and middle there is a type under each index can set up multiple types, you need to specify the index and store the document type. 6.0.0 From the beginning there is only a single index type,

7.0.0 The future will not recommended, after 8.0.0 do not support.

The reason abandoned the concept:

Although we can go to popular understanding Index for SQL Database ratio of, Type Table likened the SQL. But this is not accurate, because if in SQL, independent of each other before the Table, the field of the same name has nothing in both tables.

But in the ES, under the same Index Different Type If there is a field of the same name, they will be treated as Luecence the same field, and they must have the same definition. So I think now more like a table Index,

The Type field did not have much significance. Currently Type has been Deprecated, beginning at 7.0, an index can only be built as a Type_doc

3、Document

DocumentIndex which recorded a single called Document (document). Equivalent relational database table rows .

We look at a document source data

_index Document belongs index name.

_type Document their type name.

_idDoc primary key. When written, the ID value can be specified Doc, if not specified, the system automatically generates a unique UUID value.

_versionVersion information of the document. Elasticsearch by using the version to ensure that changes to the document can be performed in the correct order to avoid data loss due to out of order.

_seq_noStrictly increasing sequence numbers, one for each document, Shard level strictly increasing, after written guarantee of Doc's _seq_nolarger than Doc's _seq_no first written.

primary_termprimary_term and also _seq_noas an integer, occurs whenever Primary Shard reassigned, such as restart, Primary elections, _primary_term incremented by 1

found ID query correctly so ture, if Id is not correct, finding out the data, found field is false.

_source JSON data of the original document.


Second, clusters, nodes, and a copy of the fragment

1, cluster

ElasticSearch cluster is actually a distributed system, it needs to have two characteristics:

  1)高可用性

    a) Service Availability: Allow nodes out of service;

    b) availability of data: part of the node is lost, the data is not lost;

  2)可扩展性

    With the rising amount of requests, the growing amount of data, the system data can be distributed to other nodes, to achieve horizontal scaling;

A cluster can have one or more nodes ;

Cluster health value

  1. green: All major fragmentation and fragmentation are available copy
  2. yellow: All major slice is available, but not all copy fragments are available
  3. red: Not all of the major fragments are available
当集群状态为 red,它仍然正常提供服务,它会在现有存活分片中执行请求,我们需要尽快修复故障分片,防止查询数据的丢失;

2, the node (Node)

 1) What nodes are?

    a) node is an example of a ElasticSearch, which is essentially a Java process;

    b) ElasticSearch can run multiple instances on a machine, but it is recommended to run a ElasticSearch instance only on one machine in a production environment;

Node is a single cluster of servers, for storing data and providing search and indexing cluster. Like cluster node also has a unique name, the default when the node startup will generate a uuid as a node name,

The name can also be specified manually. Single cluster may be composed of any number of nodes. If only one start node, a single-node cluster will form.

3, slice

Primary Shard(主分片)

ES shard to solve the problem of node size limit ,, the main fragments may be distributed over the data to all nodes in the cluster.

The relationship between them

一个节点对应一个ES实例;
一个节点可以有多个index(索引);
一个index可以有多个shard(分片);
 一个分片是一个lucene index(此处的index是lucene自己的概念,与ES的index不是一回事);

主分片数是在索引创建时指定,后续不允许修改,除非Reindex

Index data stored in a plurality of sub-sheets (as a default), equivalent to the level sub-table. A slice is a Lucene instance, it is itself a complete search engine. Our documents are stored and indexed into fragments.

But the application is interacting directly with the index rather than fragmentation.

Replica Shard(副本)

There are two copies of important roles:

1, service availability : Because the data only one, if a node linked, and that the presence of the above data is all lost, with replicas, as long as this data is not stored in the node trailer, the data will not be lost. So do not copy and fragmentation

Primary slice allocated to the same node;

2, scalability : to improve search performance by a parallel search across all replicas because the data on the replicas are near real-time (near realtime), so all replicas can provide search capabilities, by setting reasonable replicas.

You can increase the number of high throughput search

分片的设定

  Carved piece set for a production environment, we need to do capacity planning in advance, because the main points is the number of pieces in the pre-set index creation, follow-up can not be modified.

It is set too small number of fragments

      Lead to subsequent nodes can not increase the level of expansion.

      Resulting in the amount of data pieces is too large, time-consuming data reallocation;

Setting number is too large fragment

      Affect the search results of relevancy scoring, affect the accuracy of statistical results;

      On a single node excessive fragmentation, it would lead to waste of resources, but also affect performance;


Third, the inverted index

ES search function is based on the fundamental principle of lucene, but lucene search index is flashback, 倒序排序的结果跟分词的类型有关.

举例

1, assume a collection of documents comprising five documents, documents every content As shown in FIG leftmost column is a text block corresponding to each document ID.

As shown (FIG. Pirates)

2, the first to use word system will automatically cut documents into word sequence record which documents containing the word, at the end of such a process, we can get the most simple inverted index.

3, the indexing system may also record additional information in addition, also described under FIG word frequency information. The document is divided into one sentence term (term used to indicate a word or words, 取决于使用的分词方式),

倒叙索引Stored in the term, term frequency of occurrence (tf, term frequency) and the emergence of location ( 倒叙索引中的单词是按顺序排列的,这张图没有体现出来), please note that the contents of this document are document

In a field that is indexed each field has its own flashback index

A simple search process

Suppose we search 谷歌地图之父, the search process will be the case

  1. Word, word sentence is divided into three plug-Term 谷歌, 地图,之父
  2. These three get flashbacks term index to find (would be very efficient, such as binary search), if matched to the corresponding document id Take obtain the document content

However, how to determine the order of results?

Here to introduce the concept of _score, for the term of the match, lucene will have on its score, the higher the score, more high ranking here to introduce several related concepts

- TF(term frequency),词频,term在当前document中出现的频率,一个term在当前document中出现5次要比出现1次更相关,打分也会更高
- IDF(inverse doucment frequency),逆向文档频率,term在所有document中出现的频率,这个频率越高,该term对应的分值越低
- 字段长度归一值,简单来说就是字段越短,字段的权重越高, 比如 term `我`在匹配 `我123`和`我123456`时,`我123`的得分会更高.


reference

1, Elasticsearch core technology and combat --- Ruan Yiming (eBay Pronto platform technical director

2, elasticsearch basic concepts

3, elasticsearch the basic concepts

4, elasticsearch Section 5 inverted index, tokenizer




 我相信,无论今后的道路多么坎坷,只要抓住今天,迟早会在奋斗中尝到人生的甘甜。抓住人生中的一分一秒,胜过虚度中的一月一年!(8)


Guess you like

Origin www.cnblogs.com/qdhxhz/p/11448451.html