Basic understanding about Elasticsearch depth exploration of the ES

Basic understanding about the Elasticsearch

Elasticsearch is a real-time distributed analysis engine index, internal use Lucene to index and search.
Real-time: refers to the new data will be retrieved quickly;
Distributed: You can dynamically adjust cluster size, elastic expansion.
Lucene: text search framework is written in the Java language for data processing plain text, but it's just a library that provides indexing, perform a search interfaces, but does not provide distributed services.

The basic concept of ES

  • cluster: a cluster represents the cluster has multiple nodes, including a master node, the node is elected, played to the center.
  • shards: representative index fragmentation, ES can be a complete index into a plurality of fragments, an index can be split into multiple, distributed search implemented.
  • replicas: a copy of the representative index, es can set up multiple copies of the index, both to improve the system's fault tolerance, load balancing can also search request.
  • recovery: Data Recovery representatives or call data redistribution, es will be redistributed according to the load index of the machine when nodes join or quit, data recovery will hang when the node restarts.
  • river: es representative of a data source, but also other storage (such as a database) a method to synchronize data of es. (River
    including couchDB, RabbiMQ etc.)
  • gataway: es index represents the snapshot storage, es default is to first index in memory, when the memory becomes full, and then persisted to a local hard disk. When closed es restart, index es reads the backup data from the hard disk. es supports multiple
    gateway ways: the local file system (the default), Distributed File System, Hadoop's HDFS and cloud storage;
  • discovery.zen: es behalf automatic discovery mechanism nodes, nodes exist es will look through the broadcast, and then to communicate between nodes via multicast protocol.
  • Transport: es inside jiedian or on behalf of the cluster interacts with the client, the default is to use the internal tcp protocol to interact, while supporting json, thrift and so on.

Es of some features

Scalability: By way of slice data into several pieces assigned to each machine in order to achieve horizontal scalability of the system.
Parallel read: fragmentation reading and writing unit is the basic underlying object slice is divided huge index, so that the reader may operate in parallel, performed by multiple machines together.
Resilience: a copy of the form, copy the data into a plurality of copies, placed into different machines.
Concurrent updates: ES copy of the data points from the two main parts, i.e. the main and sub-fragment fragment. Master data as the authoritative data, the writing process to write the primary partition, and then write a successful vice partition, the recovery phase to the primary partition based.
Real-time: (a disk has not been previously written) after as much as possible so that the new index can be indexed, ES system writes data in the system cache, the data is readable outside.

ES index structure

ES is a document-oriented. ES variety of text stored in the form of a document, the document may be a message, a log, or a content of the page.
Generally speaking, ES only supports JSON format, refer to the ES using JSON as a serialization format documents.
On the storage structure by _index, _ type, and _id three parameters to uniquely identify a document:

  • _index: one or more points to the logical-physical fragmentation namespace
  • _type: a whole pattern data is identical or similar set of
  • _id: document tag generated automatically by the system or user provided

ES index contains a lot of slice, a slice is a Lucene index, which itself is a complete search engine that can independently perform indexing and search task.
Lucene index has made a lot of segments, each segment is an inverted index.

Use inverted index after once they are written to the file invariant: access to the file does not require locking, the index can be read when the file system cache and so on.
In order to maintain consistency index, when we modify a document, the original index will be marked as deleted (but do not physically deleted), use the new index.

In the ES, emptied once per second write cache, the data will be written to the file, a process called refresh, refresh each time creates a new segment of Lucene.
But too much Lucene segment will affect its performance, so the use of ES will be merged into smaller segments large segment strategy: and in the process and marked as deleted data is not written to a new segment, when the end of the merger process old segment data is deleted, the data marked for deletion are not deleted from the disk.

How to understand the inverted index?
Forward index: can be understood as the result of a statistical document wordcount: hello 2 times world 1 times
inverted indexes: inverted index is to achieve "word - document matrix" of a particular storage format, by inverted index, you can quickly get a list of documents containing the word based on the word.
Here Insert Picture Description

ES cluster

ES cluster uses a master-slave mode

  • This mode can simplify system design, Master as the authoritative nodes, only part of the operation performed by the Master, and maintains metadata cluster.
  • There is a single node failure need disaster recovery issues Master node, and the cluster size is limited by the Master node management capabilities.

ES cluster node role
master node

  • Responsible for cluster related operations side, cluster management changes.
  • Cluster status by the master node maintains, if the master node receives data from node updates, these updates will be broadcast to other nodes in the cluster so that each node of the cluster status to date.
node.master: true
node.data: false

Data Node

  • Responsible for saving data, perform data-related operations: CURD, search, aggregation and so on.
  • Data node CPU, memory, I / O requires relatively high
  • Generally, data write process and data nodes interact only. And it does not deal with the master node.
node.master: false
node.data: true
node.ingest: false

Pretreatment node

  • Before preprocessing operation before allowing the index file, i.e. data is written, and Pipeline (pipe), the data is converted by a certain pre-defined number of Processors (processor), enriched.
  • By default, all nodes on the launch of ingest.
node.master: false
node.data: false
node.ingest: master

Coordinating node

  • The coordinating node will request data forwarding node to save data
  • Each data node in a local execution request, and returns the result coordinator node. After coordinate the collection of data nodes, the results of each data node into a single global results.
  • Because takes a lot of CPU and memory resources to collect and sort the results, so there is coordination nodes, can ease the pressure on other nodes.
node.master: false
node.data: false
node.ingest: false

Tribal node

  • Tribal node can act as joint clients across multiple clusters
  • It is an intelligent load balancer in essence, provides functionality for routing requests
  • It has now been replaced by the coordinating node
node.master: false
node.data: false

Cluster health status

Cong data integrity point of view, the health status of the cluster is divided into three indexes also apply for a single :()

  • Green: All primary and secondary slices slices are operating normally.
  • Yellow: All master slices are operating normally, but not all sub-fragments are operating normally. This means that the risk of failure single node.
  • Red: main fragment did not work properly.

Cluster expansion

When expansion of the cluster, add a node, it will evenly slice allocated to each node of the cluster, thereby indexing, and searching for load balancing, which are automatically completed.
When a cluster fails ES, ES automatically processing nodes exception:

  • When the master node abnormalities, the primary node cluster re-election
  • When a master slice exception, will be promoted to primary sub-fragment fragment

Process extended cluster:
1. When only one node: There are three main fragments Node1, no sub-slice
Here Insert Picture Description
2. Add a second node, the sub-fragment is assigned to Node2
Here Insert Picture Description
3. adding a third node , three node indices six slices (three primary three) are assigned to the cluster average
this will ensure that the main process and the sub-fragment fragment not assigned to the same node, to avoid data loss caused by a single node failure.
Here Insert Picture Description

Published 231 original articles · won praise 42 · views 60000 +

Guess you like

Origin blog.csdn.net/weixin_40990818/article/details/104823076