ES distributed architecture and the underlying principles of analysis

About a ES

     elasticsearch design concept is a distributed search engine, the underlying implementation is based on Lucene, the core idea is to start multiple processes es instances on multiple machines, consisting of a cluster es.

Two concept es

A near real-time


es is a near real-time search platform, which means that, from a document index documents can be searched until there is a slight delay.

B cluster (cluster)


A cluster has multiple nodes (servers) composition, save all the data together through all the nodes, through joint index and search a collection of nodes, each cluster has a unique name to identify.

C node (node)


A node is a single server that is part of, data storage cluster, and participating in the cluster and search capabilities, a node can be added to a particular cluster by configuring specific name, in a cluster, how many nodes you want to start you can start how many nodes.

D index (index)


An index is also a collection of some of the common characteristics of the document, an index is a name that uniquely identifies, and this name is used to index through the document to perform a search, update and delete operations.

E type (type)


type has been deprecated in 6.0.0.

F Document (document)


A document is a basic search unit.

Three sections


    es, the basic unit of data is stored in the index, such as es stored in a number of sales order system, you should create an index es in, order-index, all the sales data are written to the index will go inside , just like an index database. The type is equivalent to every table,
a index which can have multiple type, which is equivalent to the structure defined mapping table, defines what type of field, you add a row to a data type index's called a document each document has a plurality of field, each field represents a field value of the document.

1 slice (Shards)

      In a search for data in memory, the next potential situations may exceed the storage limit of a single hardware node, in order to solve this problem, elasticsearch it provides a functional fragment, it can be indexed into multiple fragments, when you create an index, you can simply define the number you want slices, each slice itself is a full-featured fully independent index that can be deployed to any node in the cluster. Two important reasons fragment:
(1) it allows the contents of the volume level of segmentation.
(2) the distribution of which allows operations performed by slicing and to cope with the increasing amount of execution.

Copy (Replica)
      in one network, the failure may occur at any time, there is a recovery mechanism is necessary in order to achieve this object, ES allows making one or more copies into a fragment is called replication or transient replication supplements. Copy is important for two main reasons.
(1) high availability. It provides a high-availability since the node goes down or prevent fragmentation, therefore, a very important point of note is never to copy a slice with the slice on the same machine.
(2) high concurrency. It allows you to slice beyond their throughput can provide search services, search behavior can be executed in parallel in all copies of the fragments.


  In summary, a process is complete, ES client to write a data shard Primary, it will split into a pair of shard slice, and data replication, ES when the client terminal will fetch the data in the replica or the primary shard in to read. ES cluster multiple nodes, one node will be automatically elected as the master node, the master node is actually doing some operations management class, such as maintenance of metadata, responsible for switching primary shard and a replica shard of identity and the like, dang if the master node the machine, it will re-elect the next node is a master node. If, after a non-master goes down, there will be a master node, so that the transfer of primary shard downtime identity of the node to the replica shard, if the repair downtime of the machine, restarting, master node will control the missing replica shard distribution in the past, synchronized subsequent modifications work, so cluster back to normal.

Process in Four es writing data

A client node sends a request to select a past, this node is Coordinating node (coordinator node).

B Coordinating Node, to route document, the request is forwarded to the corresponding node.

C actual primary shard on the processing request node, the data is then synchronized to the replica node.

D Coordinating the Node, then if we find primary node and all of the replica node are done, it will return the request to the client.

Five es reading process data

Query, a data of a GET, write a document, the document will automatically give you assign a globally unique ID, hash simultaneously routed to the primary shard above corresponding to the ID according to this, of course, you can manually set ID

A client sends a request to any one of any node, it becomes coordinate node.

B Coordinate Node route on a document, the request is forwarded to the corresponding Node, this time using the round-robin algorithm random in rotation, in a random selection of all primary shard and replica, so that the read request load balancing.

C Node accepts the request, returns the document to coordinate note.

D Coordinate the Node returned to the client.

Six es data search process

A client sends a request to the coordinate node.

B coordinator node to forward the search request to all corresponding shard Primary shard or replica shard.

C Query Phase: Each shard will result (in fact, is a unique identification number) own search, returned to the coordinator node, has a data merge, sorting, paging and other operations coordinator node, output the final result.

D FETCH Phase, and then, based on the unique identifier to each node by a pull to the data coordinating node, most always returned to the client.

Seven underlying principles of writing data

A data is first written to the buffer inside, there is data in the buffer will not be searched, and the data is written into the log file translog.

B After If the buffer is almost full, or a period of time, it will refresh the data to a buffer in the new OS cache, and then every 1 second, will be written into the OS cache data segment file, but If every second there is no new data into the buffer, it will create a new empty segment file, as long as the data in the buffer is refresh into the OS cache, on behalf of the data can be searched. Of course, by restful api and Java api, manually perform a refresh operation is manual data buffer into the brush into the OS cache, so that data is immediately searched, as long as the data is entered into the OS cache, the buffer the contents will be cleared. At the same time that the data after the shard, data will be written into the translog, every 5 seconds the data from among the translog persisted into the disk.

C repeat the above steps, each time a data write buffer, while writes a log file to log into the translog, the translog file will continue to become larger, when a certain extent, it will trigger the commit operation.

D will be a commit point is written to a disk file, which identifies the segment file with all the corresponding commit point.

E forced into the OS cache data are fsync to disk file.
Explained: translog role: Before performing commit, all the data are to stay in the buffer or OS cache, whether or OS cache is a buffer memory, once the machine is dead, the memory of the data will be lost, so among the data needs to be written in a special operation corresponding log asking about the price, once the machine downtime, restart again when, es will take the initiative to read data log files into the translog, restored to the memory buffer and the OS cache being.

F existing translog file is empty, and then restart a translog, commit at this time even if successful, the default is to commit once every 30 minutes, but if translog file is too large, will trigger commit, whole commit process is called a flush operation, we can also ES API, to flush manually, manually OS data fsync cache to go, recording a commit point, empty the translog file disk above.
Added: Actually translog data is first written into the OS cache, default in every 5 seconds to flush data to the hard drive to, say, maybe five seconds of data in the buffer or just stay translog file OS cache, if the machine hung up at this time, will lose 5 seconds of data, but this performance is better, we can also be a time of fsync operations directly to disk, but the performance will be relatively poor.

G If the delete operation, commit time will produce a .del file, which was about a doc marked delete state, so when searching, based on the state .del file, you know that a file has been deleted.

H If the update operation is talking about the original doc state identified as delete and then re-written piece of data can be.

The I Buffer once each update, it will produce a segment file document, so under the default, it will generate a lot of segment file documents, will perform merge operations on a regular basis

J every merge and they will segment file multiple files into one, and will be marked as deleted delete the file, then writes the new segment file to a disk file, there will write a commit point, identification All new segment file, and then open a new segment file to use for the search.

   In short, the four core concepts segment, refresh, flush, translog, merge.

 

The underlying principle of eight search

  Query query and retrieval process is largely divided into two phases, a broadcast query request to all slices, and integrate the results of their responses to the global ordering the set, the result set returned to the client.

 Queries stage

  A When a node receives a search request, the node will become a coordinator node, the first step is to fragment a broadcast request a copy of each node to the search query request may be one of a master or a sub-fragment fragmentation process, the coordinating node in rotation all fragments of copies after the request to share the load.

  B Each slice will be built on a local priority queue, if the client is asked to return the number of results from the sort named starting from the size of the result set, each node will have a size from the size of the result set +, thus priority queue size is from + size, fragmentation merely returns a result to the coordinator node lightweight, the results include the ID stages each document and sorting the information required.

  C coordinating node will summarize all the results, and globally ordered, eventually to rank.

 The value stage

  A sort the results obtained by the query process, which documents the mark is in line with the requirements of this case still need to get these documents back to the client.

  B coordinator node will determine the actual needs of the returned documents, and the document containing fragment get request is transmitted, the document is returned to the acquired slices coordination node, the coordinator node returns the result to the client.

Nine inverted index

Inverted index to map the relationship between word and document, in inverted index among the data is word-oriented rather than document-oriented.

How to improve efficiency in ten massive data

A filesystem cache


ES search engine is seriously dependent on the underlying filesystem cache, if more memory to the filesystem cache, try to make the memory can accommodate all of the index segment file index data file.

Data preheating B


For those data do you think is hot, often someone will access data, it is best to do a special pre-caching subsystem, the thermal data is, every once in a while, you access the following in advance, so that data into the filesystem cache inside go, so look forward to the next visit of the time, the performance will be better.

C hot and cold separation

ES optimize performance with respect to the data split can not search a large number of fields to store to another split, vertical split MySQL This is similar to the sub-library fraction table.

Model design of D document

Not to perform when searching for a variety of complex operations, as much as possible when the document model design, when the write is complete, in addition to a number of complex operations, try to avoid.

E Page Performance Optimization

Turn the pages, turn the deeper, the more each shard data returned, and the longer the coordinator node process, of course, is to use scroll, scroll will generate a one-time snapshot of all the data, then each page It is done by moving the cursor. api just next turn in page by page.

 

reference 

https://segmentfault.com/a/1190000015256970?utm_source=tag-newest

https://blog.csdn.net/liyantianmin/article/details/72973281

https://www.cnblogs.com/leeSmall/p/9220535.html

Published 43 original articles · won praise 28 · views 40000 +

Guess you like

Origin blog.csdn.net/u013380694/article/details/101760607