Full text search Elasticsearch Profile

Outline

Elasticsearch  (ES) is based on  Lucene  open source search engine that is not only stable, reliable and fast, but also has a good level of scalability, is specifically designed for distributed environments.

characteristic

  • Easy installation: no other dependencies, installation is easy after downloading; only change a few parameters you can build up a cluster
  • JSON: Input / output format is JSON, means that no definition of Schema, fast and convenient
  • RESTful: All basic operations (index, query, and even configuration) can be carried out via HTTP interface
  • Distributed: External manifestation of nodes, etc. (each node can be used for entrance); adding nodes automatically balanced
  • Multi-tenant: a subindex for different purposes; can simultaneously operate a plurality of index

Clusters

One of the nodes is a ES process, multiple nodes in a cluster. Usually after each node running on different operating systems, configure the cluster parameters ES will automatically form a cluster (node ​​discovery mode can also be configured). Within a cluster selected by the main algorithm ES is selected from the master node (the current version 1.2 split brain problem exists), and is outside the cluster may be operated by any node, sub-node of the master-slave (external performance peer / to the center, in favor of the client program, such as failure reconnection).

index

"Index" has two meanings:

  • As a verb, it refers to a document, "Save" to process the ES, the index a document, we can use this document to search for ES
  • As a noun, it refers to the place save the document, the equivalent of a database concept of "library"

To facilitate understanding, we can ES correspond to some of the concepts we are familiar with relational databases:

    IT IS         index         Types of         File    
    DB         Storehouse          table     Row

Fragmentation

ES is a distributed system, we should begin to cluster ways to use it. It will select the appropriate time storage index " master slice" (Primary Shard), wherein the stored index to (slice we can be understood as a physical storage area). Slice division method are fixed, but the installation must be decided for the time (default 5), can not be changed later.

Since the main fragment, it certainly is a "from" fragment, called the ES in "a copy of the fragment" (Replica Shard). Copy fragments have two main functions:

  • Availability: a slice, then the node may be linked to copies of fragments take other node, the above-sliced ​​data can be restored by the other nodes after node recovery
  • Load balancing: ES will automatically search for routes based on load control, load a copy of fragmentation can be shared equally

An example

A content to summarize the above examples (see below in conjunction with FIG.):

  • 3 ES nodes (es-58/59/60) consisting of a cluster
  • When setting up a default cluster master slice number 5, shard0 ~ shard4
  • Within this cluster has added two index index1, index2
  • These two indexes are the "Index" (save) the two documents
  • index1 index the document is automatically saved to the ES in the slice 2, slice the master node es-58, a copy of the fragment es-59 node
  • index2 index the document is automatically saved to the ES in the slice 2, slice the master node es-59, a copy of the fragment es-58 node

shards

(Which is a RESTful ES interface using the acquired common interface will be introduced later)

Multi-tenant

ES multi-tenant simply put, is provided through a multi-indexing mechanism to simultaneously use multiple services, each service uses an index (defined in detail with respect to the use of multi-tenant, can refer here ). We mentioned earlier the index can be understood as a relational database library that multiple indexes can create multiple libraries to understand different business use for a database system.

In actual use, we can isolate their data by way of an index for each tenant, and each index can be a separate configuration parameters (may be tuned to a particular tenant), under which in a typical multi-tenant scenario very useful: for example, one of our multi-tenant applications need to provide search support, then it can be indexed according to the tenant by ES, so that each tenant can search the content in its index.

RESTful

This feature is very convenient, the most critical is the ES HTTP interface can not just business operations (indexing / search), can also be configured, even close ES cluster. Here we introduce a few very common interface:

  •  / _Cat / nodes v:? Check the status of the cluster
  • / _Cat / shards v:? View fragment status
  • / {Index} / index / {type} / _ search: search

v is the mean verbose, so can be more readable (with the header, with alignment), _ CAT monitoring related APIs, / _ cat? help to get all interfaces. {Index} and {type}, respectively, and the index is an index is a specific type, are hierarchical. We can also search directly on all indices of all types: / _ search.

Official Glossary

Finally, part of the official translation glossary to consolidate understanding about:

analysis analysis

Analysis is the text (text) into the query term (term) process. Using different analyzers, these three phrases: FOO BAR, Foo-Bar, foo, bar are likely to be broken into query words foo and bar. These query words will actually be stored in the index. Once FoO: full-text query bAR of (not to Query) may be analyzed as a query word foo, bar, can match the query term is stored in the index. This is the analysis process (includes indexing and search), which allows full-text queries can be es.

cluster Cluster

One or more nodes in the cluster have the same name form a cluster. Each cluster will automatically elect a master node if the primary node fails, the cluster will automatically select a new primary node to replace the failed node.

document document

A document is a JSON text stored in the es, it can be understood as a relational database table row. Each document is stored in the index, with one type and id. A document is a JSON object (some of the language hash / hashmap / associative array) comprising a plurality of fields or 0 (key-value pairs). The original JSON text after the index is stored in _source field, the search is completed to return the default value is contained in this field.

id

Id is used to identify the document, a document indexing / type / id must be unique. Document id is generated automatically (if not specified).

field field

A document contains a number of fields, or so-called key-value pair. The value field may be simple (scalar) value (e.g., string, integer, date), may also be nested structure, such as an array or an object. A field similar to a relational database table. Mapping each field has a field type (not to be confused with the type of document), which describes the value types that can be saved in this field, such as integers, strings, objects. Mapping also allow us to define how the value of a field of analysis.

index index

A database index is similar to a relational database, which can be mapped to a variety of types. An index is a logical namespace corresponds to one or more primary slices, can have zero or more copies of fragments.

mapping Mapping

A mapping schema definition is similar to a relational database. Each index exist a mapping that defines each type in the index, and an index associated configuration. Mapping can display defined or automatically created when the document is indexed.

node node

A node is a running instance es cluster. When testing, you can enable multiple nodes simultaneously on the same server, the production environment in general is a node on a server. It will be used when the node starts unicast (or multicast) to find the same cluster name and its own cluster configuration, and attempts to join the cluster.

primary shard master slice

Each document will be stored on a master slice. When we index a document, it will be a main sub-index on a chip, and then put it on a copy of each slice of the master slice. By default, a main index five slices. We can specify fewer or more primary fragments to stretch the index number of documents that can be processed. Note that, once the index is created, it can not modify the primary slice number.

replica shard copy fragment

Each master slice can have zero or more copies of fragments. A copy is a copy of the master slice slice, this has two main reasons:

  1. Failover: When the primary slice fails, a copy of the fragment will be promoted to the main fragments
  2. Improve performance: obtaining a search request may be a copy of the master slice or slice processing. By default, each slice has a copy of the master slice, the number of copies of the fragments can be dynamically adjusted. On the same node, a copy of the primary slice and the slices that do not run at the same time

routing Routing

When we index a document, it is saved on a master slice, slice selection is made by routing the resulting hash value. By default, the value of the route from the document id, if the document specified by the parent document, the parent document from the route values ​​id (this is to ensure that the child and parent documents stored on the document are the same slice). This value can be specified when the index map may be specified by the routing field.

fragment shard

A slice is a Lucene example, it is the underlying es management "unit of work." An index is a logical namespace, fragmentation and directed to the primary copy fragment. Main index fragmentation and the number of copies of fragments must explicitly specify the good, only need to process and cross-reference in the application code to use, and does not involve interaction fragmentation. Will set all nodes on the slice Elasticsearch in the cluster, but the mobile node automatically when a node failure or fragments added new node.

source field source field

By default, the search request and return _source acquiring the field value stored JSON text source, which allows us to directly access the source data in the returned results, without the need to resend the search request based on the id. Note: The index JSON string will return intact, regardless of whether it is a valid JSON. Contents of this field will not describe how the data is indexed.

term query words

A query is a word to be the exact value of the index es. Query word foo, Foo, FOO is different. You can use the query words to Query interface to obtain.

text text

Text (or so-called full) is normal, unstructured text, for example, if the present segment. By default, the text will be analyzed as a query term, the query words will be stored in the index. To be able to perform full-text search, text fields in the index will be analyzed as a query term, keyword query in the search will be analyzed as a query term, whether by comparing the query words to complete the same full-text search.

type type

One type is similar to a relational database in a table. Each type has a number of fields that can be used to specify the type of document. Mapping defines the document how each field is analyzed.

reference

   https://88250.b3log.org/full-text-search-elasticsearch

Published 43 original articles · won praise 28 · views 40000 +

Guess you like

Origin blog.csdn.net/u013380694/article/details/93973912