Elasticsearch ES Index

ES is a construct based RESTful web interfaces and open source distributed search engine on top of Apache Lucene.
Meanwhile ES is a distributed document database, where each field can be indexed, and the data for each field can be searched, you can scale to hundreds of server storage and handling PB-level data.
Can be stored in a very short period of time, search and analyze large amounts of data. As is usually the core of the engine in case of a complex scene search.
ES is highly available and scalable born. Aspect of the system expansion can be done by upgrading hardware, called a vertical or upward extension (Vertical Scale / Scaling Up).
On the other hand, adding more servers to complete the system expansion, expand or extend outwardly called horizontal (Horizontal Scale / Scaling Out). Although the ES can take advantage of more powerful hardware, but the vertical extension has its limits after all. True scalability comes from the horizontal extension, to share the load by adding more nodes to the cluster, increase reliability. ES inherently distributed, it knows how to manage multiple nodes to complete the expansion and high availability. Means that applications do not need to make any changes.
 
Gateway, persistent storage on behalf of ES index. In the Gateway, ES default first index is stored in memory and when the memory is full, and then persisted to the Gateway Lane. When ES cluster is shut down or reboot when it reads the index data from Gateway go. For example LocalFileSystem and HDFS, AS3 and so on.
DistributedLucene Directory, it is the directory where the Lucene index file consisting of a series. It is responsible for managing these index files. It includes read data, write, and the index of addition and combination of the like.
River, on behalf of the data source. It is present in the form of plug-ES. 
Mapping, mapping the meaning of the data type is very similar to the static language. For example, we declare a variable of type int, and that after this variable can only store data of type int. For example, we declare a double type of mapping field, you can store double the data type.
Mapping is not only telling ES, which field is which type. ES can tell how the index is data, and the data is indexed and so on.
Some common operations Search Moudle, search module, support search
Index Moudle, index module, support indexing some frequently used operations
Disvcovery, is mainly responsible for a cluster master node discovery. For example where a node leaving or suddenly come, for re-slicing a slice like. Here is a discovery mechanism.
The default discovery mechanism of implementation is unicast and multicast form, that is Zen, but also in support of point to point. Another form of plug-ins that EC2.
Scripting, namely scripting language. Including many, not many go into details here. As mvel, js, python like.   
Transport, on behalf of ES internal node, on behalf of the client interaction with the cluster. Including Thrift, Memcached, Http and other agreements
RESTful Style API, API programming is achieved through RESTful way.
3rd plugins, on behalf of third-party plug-ins.
Java (Netty), is a development framework.
JMX, is monitored.
Use case
1, the ES as a site of major back-end systems
For example, the system is now set up a blog for blog posts of data can be stored directly on the ES and ES use to retrieve statistics. ES provides persistent storage, statistics and many other data storage features.
Note: But like other NOSQL as data storage, ES does not support a transaction, if the transaction mechanism, or consider using other databases do real library.
 
2, will be added to an existing system ES
Sometimes ES does not need to provide all of the data storage capabilities, but want to use the ES on the basis of a data storage on. For example, there is already a complex system is running, but now want to add a search function, you can use the program.
 
3, a rear end portion of the ES conventional solutions
Because the ES system is open source, provides direct HTTP interface, and now there is a large ecosystem to support him. For example, now we want to deploy large-scale logging framework for storing, searching and analyzing massive events, taking into account existing tools can be written to and read ES, may not require any development, these tools can be configured to operate .
 
Design structure
1, Logic Design
File
Documentation is the basic unit of information that can be indexed, it contains several important attributes:
It is self-contained. A document that contains both fields and their values.
Hierarchical type. Document may also contain a new document, the value of a field can be simple, such as the value of the location field can be a string, you can also contain other fields and values, such as urban and street addresses may contain at the same time.
It has a flexible structure. The document does not rely on pre-defined mode. That is not all documents are required to have the same field, it is not limited to a single mode
{
  "name":"meeting",
  "location":"office",
  "organizer":"yanping"
}
{
  "name":"meeting",
  "location":{
    "name":"sheshouzuo",
       "date":"2019-6-28"
  },
  "memebers":["leio","shiyi"]
}
Types of
Type is a logical container document, similar to the table is a container line. In a different type, preferably into a different document structure.
Field
ES, every document, in fact, is stored in the form of json. The document can be viewed as a set of multiple fields.
Mapping
Definition of fields in each type is called mapping. For example, name field is mapped to a String.
index
The index is a mapping type of container ES index very much like a relational database in the world, is independent of a large number of document collections.
 
Comparative structural relational database and ES
 
 
2, physical design
node
A node is an instance of the ES, ES after starting on the server, you have a node, if the ES start on another server, which is another node. ES can even start multiple processes on a single server with multiple nodes on a single server. Multiple nodes can join the same cluster.
When ElasticSearch node starts, it will use multicast (multicast) (or unicast, if the user changes the configuration) look for other nodes in the cluster, and connect to it. This process is shown below:
 
There are three main types of nodes, the first type is client_node, mainly functions as distribution request, similar routes. The second type is master_node, is the master node, all the add, delete, data fragmentation is by the master node operation (elasticsearch underlying operating data is not updated, the update actually provided outside the upper removed and then new increase), of course, can undertake a search operation. The third type is date_node, node can only do this type of search operation, which is assigned to specific date_node, is determined by the client_node, and data_node data is synchronized over from master_node
Fragmentation
An index can store a large amount of data exceeds a single node hardware limitations. For example, a document having an index 1000000000 1TB occupy disk space, and any node are not such a large disk space; or a single processing node search request, the response is too slow.
 
To solve this problem, ES provides the ability to index into multiple copies, these copies is called fragmentation. When you create an index, you can specify the number you want to slice. Each fragment is itself a fully functional and independent "Index", the "Index" may be placed on any node in the cluster.
Fragmentation is important, there are two reasons:
 
1, allows you to split horizon / expand your content capacity
Allows you slicing (potentially, a plurality of nodes) distributed on, in parallel operation, thereby improving performance / throughput
As for how a fragmented distribution, how its documentation polymerization back to the search request, is entirely managed by the ES, for you as a user, these are transparent.
 
2, in a network / cloud environment, the failure can happen at any time, in a slice / node do not know how it is offline, or for any reason disappear. In this case, there is a failover mechanism is very useful and is highly recommended. For this purpose, ES allows you to create one or more copies of the slice, slice these copies are called copying, or directly called replication.
Copy is important mainly for two reasons:
(1) In the case of fragmentation / node failure, provides high availability. For this reason, noting that fragmentation never replicated on the same node is very important and original / main (original / primary) fragment placed.
(2) Expand your search volume / throughput, because the search can be run in parallel on all copy
In summary, each index may be divided into a plurality of slices. An index can also be copied 0 times (meaning no copy) or more times. Once copied, there is a master index of each slice (the original copy source fragmentation) and replication slice (primary slice copy) of difference. Fragmentation and number of copies can be specified when the index was created. After the index is created, you can dynamically change the number of copies at any time, but can not change the number of slices.
 
By default, ES each index is fragmented five main fragmentation and a copy, which means that if you have at least two nodes of the cluster, you will have five main index fragmentation and five additional copy slicing (a full copy), so there are a total of 10 index each slice. An index of multiple slices can be stored on a cluster host, it can also be stored on multiple computers, depending on your number of clusters machine. DETAILED copy sheet and the position of the main points of the slice is determined by the inherent ES policy.
3, plug-HEAD
elasticsearch-head is an interface of the cluster operation and management tools
 
● node: Elasticsearch i.e. a running instance, using a multicast or unicast manner, and found that cluster is added.
● cluster: contains one or more cluster node with the same name, which includes a master node.
● index: analogy of a relational database DB, is a logical namespace.
● alias: You can add zero or more alias to index, by using the index and index name alias access index based on the same, but, alias provides us with an ability to switch the index, such as rebuilding the index, named ● customer_online_v2, At this time, with the alias, I want to access the new index, alias just need to add a new index to, and delete the alias from the old index. Do not modify the code.
● type: relational database analogy of Table. Where one index can define multiple type, but generally only with a usage type.
● mapping: schema concept analogy relational database, mapping defines the index of the type. Define the mapping can be displayed, it may be generated automatically when the document is indexed, if there is a new field, elasticsearch automatically estimated and added to the type field of the mapping.
● document: relational database analogy of a row (record), document is Elasticsearch in a JSON object, including zero or more field.
● field: the analogy of a relational database field, each field has its own field types.
● shard: Lucene is an example. Elasticsearch based Lucene, a Lucene Shard example, Elasticsearch is automatically managed. As mentioned earlier, index is a logical namespace, shard is a specific physical concepts, indexing and query are specific shard at work. shard including primary shard and a replica shard, when writing data, first wrote primary shard, then synchronized to the replica shard, queries, primary and replica act as the same effect. replica shard can have multiple copies, you can not, there is a replica shard has two roles, one disaster, if the primary shard hung up, data is not lost, the cluster can still work normally; second is to improve performance, because the replica and primary shard can process the query. In addition, as shown above on the right as shown in the red box, shard number and replica number can be set, however, shard number can only be set when creating index, the latter can not be changed, however, replica number can be changed at any time. However, due to the very friendly Elasticsearch this part of the package, in use Elasticsearch process, we only need to focus on the general index can be, without concern shard.
 
When shard, node, cluster configured in a physical cluster of Elasticsearch, field, type, index constitutes a basic concept index logically, the use Elasticsearch process, we are generally concerned about the logic of the concept is good, as we use MySQL we generally focus on the DB Name, Table and schema can be, and not be concerned about how deploying the same DBA maintains several MySQL instance, master and slave and so on.
ES index principle of
(1) the traditional relational database
Binary tree search efficiency is logN, while inserting a new node does not have to move all nodes, using a tree structure to store the index, taking into account the performance can insert and query. Hence on this basis, then the binding characteristics of the disk read (sequential read / random read), using a traditional relational database B-Tree / B + Tree indexing such a data structure
(2 IS
Using inverted index
So, inverted index is what is it?
 
First, to clear up some concepts, for example:
Assume a user index, which has four fields: namely, name, gender, age, address. When drawn out, probably looks like this, with the same relational database
 
Term (word): After a text parser later analysis will output a string of words, this one is called Term
Term Dictionary (word dictionary): As the name suggests, it is maintained inside the Term, can be understood as a collection of Term
Term Index (word index): In order to quickly find a word, we indexed word
Posting List (inverted list): inverted list records the location information of a word appeared in the list of documents and all documents of words that appear in this document, each record is called an inverted item (Posting). According inverted list, you can learn which documents that contain certain words. (PS: the actual posting list does not just save the document ID so simple, there are some other information, such as: word frequency (the number appears Term), offset (offset), etc., can be thought of in Python tuple, or Java objects)
(PS: If the analogy Modern Chinese Dictionary, then Term equivalent words, Term Dictionary equivalent Chinese dictionary itself, Term Index Contents Index equivalent dictionary)
We know that each document has an ID, if inserted when not specified, Elasticsearch will automatically generate one ID field and therefore not to say
Example above, the index Elasticsearch established as follows:
name field:
 
age fields:
 
gender fields:
 
address field:
 
Elasticsearch separately for each field have established an inverted index. For example, in the above "Joe Smith", "Beijing", these are 22 Term, and [1,3] is the Posting List. Posting list is an array to store all the documents in line with a Term of ID.
Just know the document ID, you can quickly find the document. However, to find how quickly it through our Term given keyword?
Of course, is to build the index, the index, the best is the B-Tree index (MySQL is the best example B-tree index) for the Terms.
We look Term process with the process ID of the record in MyISAM in roughly the same
MyISAM, the index data separately and can be found through the index address record, this record can be found further
In the inverted index, the index can be found by Term Term Term Dictionary in position, and then find Posting List, with inverted list you can find documents based on a ID
(PS: can be understood, MyISAM analogy, then, Term Index corresponds to the index file, Term Dictionary corresponds to the data file)
(PS: Actually, we divided the front three steps, we can Term Index and Term Dictionary as step is to find Term Therefore, it can be understood inverted index: find the corresponding inverted lists by word, according to the inverted list the inverted items and then find documentation)
For a further understanding, FIG using two instantiated to this process:
 https://blog.csdn.net/zhenwei1994/article/details/94013059
 

Guess you like

Origin www.cnblogs.com/bolang100/p/11819248.html