elasticsearch learning (a) - es Introduction and Principle Analysis

ES tutorial summary Recommended reading: https://blog.csdn.net/gwd1154978352/article/details/82781731

A, Elasticsearch brief

1.1 What is ES

It is based Lucence build open source, distributed, RESTful interfaces full-text search engine , is a distributed document database where each field data is indexed and can be searched.

1.2, ES advantage / effect

(1) lateral extensibility : just add the server to do, configured to do, can be incorporated ES start cluster;

(2) fragmentation of the mechanism to provide better distributed : the same index into a plurality of fragments, which is similar to the HDFS block mechanism; embodiment of the divide and conquer can improve processing efficiency;

(3) availability: providing replication mechanism, a slice may be provided a plurality of replication, such that in the case of a server downtime, the cluster can still run as usual, the server downtime and lost data recovery information is copied to the other available node;

(4) easy to use: just a command to download the file, and then soon be able to build a site search engine.

1.3, ES scenarios

(1) large distributed log analysis system ELK : ES (log storage) + logstash (collect log) + kibana (display data);

(2) large electricity providers search system, network disk search engines.

1.4, ES storage structure

ES file is stored, document-oriented database, a data, this is a document , as a document in JSON serialized format:

{
    "name" :     "XXX",
    "sex" :      0,
    "age" :      25
}

Relational database structures: ⇒ the database table ⇒ ⇒ ⇒ row columns (Columns)

Elasticsearch the structure: ⇒ Index (Index) ⇒ type (type, similar to the table structure) ⇒ document (Docments) ⇒ fields (Fields) 

1.5, ES version control

(1) Why version control

In order to ensure the accuracy of data in a multi-threaded operation

(2) What is the optimistic and pessimistic locking lock

Pessimistic locking: Suppose concurrency violation occurs, the shield may violate the accuracy of the data matching all operations;

Optimistic locking: Suppose concurrency conflicts will not occur, if only to check data integrity violation when a commit operation.

(3) internal and external version control versioning

Build Control: _version self-growth modified once _version will automatically add 1 ;

External Version Control: In keeping with the external value _version version control, inspection data using version_type = external current

 

ES version control:

By optimistic locking lock-free mechanism, CAS, will automatically be modified once _version plus 1

Second, the principle analysis

2.1, distributed architecture of the principle es (es is how to achieve distributed)


Bottom: Based on the lucene.
The core idea: that launch multiple processes es instances on multiple machines, formed a cluster es
basic unit: Index

2.2, es written work process data

  • The client sends a request to select a node in the past, this node is coordinating node (the coordinator node).
  • Coordinating node (coordinator node) of the document route, forwards the request to the corresponding node (with a primary shard).
  • primary shard processing request on the actual node, then the data is synchronized to the replica node.
  • coordinating node (the coordinator node) if it is found after the primary node and all replica node are done, the result is returned in response to the client.

Here Insert Picture Description2.3, es underlying principle of data written:


Data is first written to the memory buffer, and then at intervals of 1s, will refresh the data to the os cache, to the os cache data can be searched (es so we say from the writing to be searched, in the middle there is a delay of 1s). Every 5s, to write data to the file, translog (so that if a machine is down, no more data memory, most data will be lost 5s), translog to a certain extent, or default every 30mins, will trigger the commit operation, buffer data areas are flush segment file to a disk file, after the data segment file is written, while on the establishment of a good inverted index. Figure:
Here Insert Picture Description
 

2.4, es read data (process) Principle


Can be queried by doc id, will be based on hash doc id, it was judged to allocate doc id which shard to go above, to make inquiries from the shard

  • Any client sends a request to a node, it becomes coordinate node (coordinator node).
  • coordinate node (coordinator node) of doc id hash route, forwards the request to the corresponding Node, this time using the round-robin polling random algorithm, and all of its primary shard randomly select a replica, so that the read request load balanced.
  • The receiving node returns the request to the document Coordinate node (coordinator node).
  • coordinate node (the coordinator node) returns the document to the client.

2.5, es delete / update the underlying principle of data


(1)删除原理:如果是删除操作,commit 的时候会生成一个 .del 文件,里面将某个 doc 标识为 deleted 状态,那么搜索的时候根据 .del 文件就知道这个 doc 是否被删除了
(2)更新原理:如果是更新操作,就是将原来的 doc 标识为 deleted 状态,然后新写入一条数据。
buffer 每 refresh 一次,就会产生一个segment file,所以默认情况下是 1 秒钟一个 segment file,这样下来 segment file 会越来越多,此时会定期执行 merge。每次 merge 的时候,会将多个 segment file 合并成一个,同时这里会将标识为 deleted 的 doc 给物理删除掉,然后将新的 segment file 写入磁盘,这里会写一个 commit point,标识所有新的 segment file,然后打开 segment file 供搜索使用,同时删除旧的 segment file。


2.6、底层 lucene

lucene 就是一个 jar 包,里面包含了封装好的各种建立倒排索引的算法代码。我们用 Java 开发的时候,引入 lucene jar,然后基于 lucene 的 api 去开发就可以了。通过 lucene,我们可以将已有的数据建立索引,lucene 会在本地磁盘上面,给我们组织索引的数据结构


2.7、倒排索引


在搜索引擎中,每个文档都有一个对应的文档 ID,文档内容被表示为一系列关键词的集合。

For example, the document 1 through segmentation, extraction of 20 keywords, each keyword will record the number of times it appears in the document and the emergence of location. So, inverted index is the keyword mapping document ID , each keyword corresponds to a series of files that have emerged Keywords

  • All terms in the inverted index corresponds to one or more documents
  • Inverted index of terms in the dictionary are arranged according to ascending order

Forward index is a mapping from the document to the keywords (known documentation requirements keyword), an inverted index is a mapping from a key to a document (known keyword seeking documents)

 

Case :

Document Content:

No.

Document Content

1

Xiaojun is a technology company founder, opened the car is Audi a8l, accelerate cool.

2

Wei is a technology company's reception, open car is a Porsche 911

3

Wei bought a red Porsche 911, the acceleration cool.

4

Xiao Ming is a technology company development director, opened the car is Audi a6l, accelerate cool.

5

Xiao Jun is a technology development company, to open the car BYD speed sharp, accelerating a bit slow

 Inverse document content indexing will be more than word keywords, you can use keywords to locate the document content directly:

Word ID

word

Inverted list docId

1

small

1,2,3,4,5

2

One

1,2,4,5

3

Technology companies

1,2,4,5

4

Develop

4,5

5

car

1,2,4,5

6

Audi

1,4

7

Accelerate cool

1,3,4

8

Porsche

2,3

9

Porsche 911

2

10

BYD

5

 

Inverted index some common questions:

(1) Why inverted index B-Tree indexes faster than the speed of a common database?

 

 

Published 52 original articles · won praise 116 · views 50000 +

Guess you like

Origin blog.csdn.net/RuiKe1400360107/article/details/103864216