Getting started with ES distributed search

1 Mainstream distributed search engine

1.1 Lucene

Lucense official website address : http://lucene.apache.org

Lucene is a set of open source libraries for full-text retrieval and search , supported and provided by the Apache Software Foundation . Lucene provides a simple but powerful application program interface that can do full-text indexing and search. Lucene is now the most popular free Java information retrieval library.

1.2 Solr

Solr official website address : https://lucene.apache.org/solr/

Solr (pronounced "solar") is the open source enterprise search platform of the Apache Lucene project . Its main functions include full-text search , hit marking [ 1] , faceted search , dynamic clustering, database integration, and processing of rich text (such as Word , PDF ). Solr is highly scalable and provides distributed search and index replication. Solr is the most popular enterprise search engine, [ 2] Solr 4 also adds NoSQL support. [ 3]

Solr is an independent full-text search server written in Java and running in a Servlet container (such as Apache Tomcat or Jetty ). Solr uses the Lucene the Java search library as the core of full-text indexing and search, and have similar REST the HTTP / XML and JSON 's API . Solr's powerful external configuration function makes it possible to adjust it to suit many types of applications without requiring Java coding. Solr has a plug-in architecture to support more advanced customization.

Because of the merger of the Apache Lucene and Apache Solr projects in 2010, the two projects were produced and implemented by the same Apache Software Foundation development team. When it comes to technology or products, Lucene/Solr or Solr/Lucene is the same.

1.3 Elasticsearch

ElasticSearch official website address : https://www.elastic.co/cn/elasticsearch/

Elasticsearch is a search engine based on the Lucene library . It provides a distributed, multi-tenant full-text search engine with HTTP Web interface and schemaless JSON documents. Elasticsearch is developed in Java and released as open source software under the Apache license . The official client is available in Java , .NET ( C# ), PHP , Python , Apache Groovy , Ruby and many other languages. [ 5] According to the ranking of DB-Engines, Elasticsearch is the most popular enterprise search engine, followed by Apache Solr , which is also based on Lucene. [ 6]

Elasticsearch was developed with a data collection and log parsing engine called Logstash and an analysis and visualization platform called Kibana. These three products are designed as an integrated solution called "Elastic Stack" (previously called "ELK stack").

Elasticsearch can be used to search various documents. It provides scalable search, has near real-time search, and supports multi-tenancy. [ 5] "Elasticsearch is distributed, which means that the index can be divided into shards, and each shard can have 0 or more copies. Each node hosts one or more shards and acts as a coordinator to operate Delegate to the correct shards. Rebalancing and routing are done automatically. "[ 5] Relevant data is usually stored in the same index, which consists of one or more primary shards and zero or more replicated shards . Once the index is created, the number of primary shards cannot be changed. [ 7]

Elasticsearch uses Lucene and tries to provide all its features through JSON and Java API. It supports facetting and percolating[ 8] , which is very useful for notification if the new document matches the registration query.

Another feature is called the "gateway", which deals with the long-term persistence of the index; for example, in the event of a server crash, the index can be restored from the gateway. [ 9] Elasticsearch supports real-time GET requests, suitable for NoSQL data storage [ 10] , but lacks distributed transactions. [ 11]

2 ES core terminology

2.1 Basic terminology

Index

An index contains a bunch of document data with similar data structures, such as a movie index. An index contains many documents, and an index represents a type of similar or identical documents.

In simple terms, the index is relative to 关系型数据库的库.

Type

Type, each index can have one or more types, type is a logical classification of index, for example, to create an index of movies, movies can be divided into multiple types: science fiction type, comedy type, magic type, etc. The field in the document under each type may be different.

Type is simply relative to 关系型数据库的表.

Document

A document is the basic unit of information. A document is equivalent to a piece of data and can be indexed. The document is expressed in JSON format.

The document is relative to 关系型数据库的行.

Field

Field, a document consists of multiple fields, and fields with the same name in different types of documents must have the same type.

Mapping

Mapping is similar to关系数据库的表结构定义 schema

2.2 Cluster related

Near Realtime

Near real time, there are two meanings:

  • There is a small delay from when data is written to when data can be searched (about 1s)
  • Search and analysis based on es can reach the second level

Cluster

A cluster contains multiple nodes, and which cluster each node belongs to is determined by a configuration. For small and medium-sized applications, it is normal for a cluster to have one node at the beginning.

Node node

Node is a node in the cluster. Each node has a unique name, which is randomly assigned by default. The default node will join a elasticsearchcluster named . If you start a bunch of nodes directly, they will automatically form an elasticsearch cluster. Of course, a node can also form an elasticsearch cluster.

shard

A single machine cannot store a large amount of data. es can divide the data in an index into multiple shards and distribute them on multiple servers for storage. With a shard, you can scale horizontally, store more data, and distribute search and analysis operations to multiple servers for execution, improving throughput and performance. Each shard is a Lucene index.

replica

Any server may fail or go down at any time, and the shard may be lost at this time. Therefore, multiple replica copies can be created for each shard. Replica can provide backup services when the shard fails to ensure that data is not lost. Multiple replicas can also improve the throughput and performance of search operations. Primary shard (set once during index creation, cannot be modified, the default is 5), replica shard (modify the number at any time, the default is 1), the default is 10 shards for each index, 5 primary shards, 5 replica shards, the smallest height Available configuration is 2 servers.

Let's put it this way, shards are divided into primary shards (primary shards) and replica shards (backup nodes). The primary shard is generally referred to as shard, and the replica shard is generally referred to as replica.

3 Inverted index

source data

Document ID Document content
1 elasticsearch is the most popular search engine
2 php is the best language in the world
3 How the search engine was born

Inverted index

word Document ids Word frequency TF: position POS
elasticsearch 1 1:1:<1>
popular 1 1:1:<2>
search engine 1,3 1:1:<3>,3:1:<1>
php 2 2:1:<1>
world 2 2:1:<2>
the best 2 2:1:<3>
Language 2 2:1:<4>
how is it 3 3:1:<2>
Born 3 3:1:<3>

Meaning of 1:1:<3>,3:1:<1> :

DocId TF Position
1 1 3
3 1 1

docId: Document ID

TF: Represents the number of times the word segmentation item appears in the document at a certain point (term frequency)

Position: The document list of all documents of a certain word and the position information of the word in the document

The inverted index comes from the need to find records based on the value of attributes in practical applications. Each item in this index table contains an attribute value and each record address containing the attribute value. Since the attribute is not determined according to the record, but the position of the record is determined according to the attribute, it is called an inverted index.

4 Related information

  • The blog post is not easy, everyone who has worked so hard to pay attention and praise, thank you

Guess you like

Origin blog.csdn.net/qq_15769939/article/details/114209338