Elasticsearch is the most popular distributed search engine today

Elasticsearch is the most popular distributed search engine today, used by GitHub, SalesforceIQ, Netflix and other companies for full-text retrieval and analysis applications. At Insight, we use many different features of Elasticsearch, such as:

  • Full Text Search 

Such as finding the most relevant Wikipedia articles for a search term.

  • polymerization 

For example, in an advertising network, visual bid histograms for search terms.

  • Geospatial API 

For example, on the ride-hailing platform, matching the nearest driver and passenger.

Precisely because Elasticsearch is so popular and all around us, I decided to dig into it. In this article, I will share how Elasticsearch's storage model and CRUD operations work.

When I think about how distributed systems work, the pattern in my mind goes something like this:

Anatomy of an Elasticsearch Cluster: Storage Models and Read and Write Operations

<iframe style="position: static; display: block; padding: 0; margin: 0; border: none; vertical-align: baseline; width: 100%; height: 74px;" frameborder="0" scrolling="no"></iframe>

 

 

Above the surface is the API, below is the real engine, all the magic happens underwater. This article focuses on the underwater part, we will focus on:

  • Is Elasticsearch a master-slave architecture or a masterless architecture

  • What does Elasticsearch's storage model look like

  • How Elasticsearch performs write operations

  • How Elasticsearch performs read operations

  • How to define the relevance of search results

Before we dive into these concepts, let's familiarize ourselves with the related terminology.

1 Distinguish between Elasticsearch's index and Lucene's index

An index in Elasticsearch is a logical space (like a database) for organizing data. An Elasticsearch index has 1 or more shards (5 by default). The shard corresponds to the Lucene index that actually stores the data, and the shard itself is a search engine. Each shard has 0 or more replicas (default is 1). Elasticsearch's indexes also contain "types" (like tables in a database), which are used to logically isolate the data in the index. In an Elasticsearch index, given a type, all its documents will have the same properties (like a table's schema).

Anatomy of an Elasticsearch Cluster: Storage Models and Read and Write Operations

<iframe style="position: static; display: block; padding: 0; margin: 0; border: none; vertical-align: baseline; width: 100%; height: 74px;" frameborder="0" scrolling="no"></iframe>


Figure a shows an Elasticsearch index with 3 shards, each with 1 replica. These shards form an Elasticsearch index, and each shard is itself a Lucene index. Figure b shows the logical relationship between Elasticsearch indexes, shards, Lucene indexes, and documents.

 

Corresponds to relational database terminology 

Elasticsearch Index == Database 
Types == Tables 
Properties == Schema

Now that we're familiar with the terminology in the Elasticsearch world, let's take a look at the different roles nodes have.

2 Node Types

An Elasticsearch instance is a node, and a group of nodes forms a cluster. Nodes in an Elasticsearch cluster can be configured in 3 different roles:

  • Master node : Controls the Elasticsearch cluster and is responsible for operations in the cluster, such as creating/deleting an index, tracking nodes in the cluster, and assigning shards to nodes. The master node processes the state of the cluster and broadcasts it to other nodes and receives confirmation responses from other nodes.

    Each node can become the master node by setting the node.master property in the configuration file elasticsearch.yml to true (default).

    For large production clusters, it is recommended to use a dedicated master node to control the cluster, which will not handle any user requests.

  • Data Node : Holds data and inverted index. By default, each node can become a data node by setting the node.data property in the configuration file elasticsearch.yml to true (default). If we want to use a dedicated master node, its node.data property should be set to false.

  • Client node : If we set both the node.master property and the node.data property to false, then the node is a client node, acting as a load balancer, routing incoming requests to various nodes in the cluster.

The node that is connected as a client in the Elasticsearch cluster is called the coordinator node. The coordinator node routes client requests to the appropriate shard in the cluster. For read requests, the coordinator node will select different shards to process requests each time to achieve load balancing.

Before we start looking at how CRUD requests sent to coordinator nodes are propagated across the cluster and executed by the engine, let's take a look at how data is stored internally in Elasticsearch to support low-latency serving of full-text search results.

storage model

Elasticsearch uses Apache Lucene, which is a full-text search tool library developed by Doug Cutting (the father of Apache Hadoop) using Java. It uses a data structure called inverted index internally, which is designed to provide low-level search results for full-text search results. Delay in providing service. A document is the data unit of Elasticsearch. It divides the terms in the document, creates an ordered list of deduplicated terms, and associates the terms with the list of positions they appear in the document to form an inverted index.

This is very similar to the index at the back of a book, where the words contained in the book are associated with a list of page numbers on which they appear. When we say a document is indexed, we mean an inverted index. Let's see how the following two documents are indexed inverted:

Document 1(Doc 1) : Insight Data Engineering Fellows Program 
Document 2(Doc 2) : Insight Data Science Fellows Program

term document
data Doc 1, Doc 2
engineering Doc 1
fellows Doc 1, Doc 2
insight Doc 1, Doc 2
program Doc 1, Doc 2
science Doc 2

If we want to find documents that contain the term "insight", we can scan this (word-ordered) inverted index, find "insight" and return the IDs of the documents that contain the changed words, Doc 1 and 2 in our example.

In order to improve retrievability (such as wanting both upper and lower case words to be returned), we should analyze the document before indexing it. The analysis consists of 2 parts:

  • Turn sentence entries into separate words

  • Normalize words to standard form

By default, Elasticsearch uses the standard analyzer, which uses:

  • The standard tokenizer divides words with words as boundaries

  • lowercase token filter to convert words

There are many available analyzers that are not listed here, please refer to the relevant documentation.

In order to obtain the corresponding results when querying, the same analyzer as that used for indexing should be used to analyze the document.

Note : The standard analyzer includes a stopword filter, but it is not enabled by default.

Now that the concept of inverted index is clear, let's start the study of CRUD operations. We start with the write operation.

Anatomy of a write operation

create((C)reate)

When we send a request to index a new document to the coordinator, the following set of operations occur:

  • Elasticsearch集群中的每个节点都包含了改节点上分片的元数据信息。协调节点(默认)使用文档ID参与计算,以便为路由提供合适的分片。Elasticsearch使用MurMurHash3函数对文档ID进行哈希,其结果再对分片数量取模,得到的结果即是索引文档的分片。

    shard = hash(document_id)%(num_of_primary_shards)

  • 当分片所在的节点接收到来自协调节点的请求后,会将该请求写入translog(我们将在本系列接下来的文章中讲到),并将文档加入内存缓冲。如果请求在主分片上成功处理,该请求会并行发送到该分片的副本上。当translog被同步(fsync)到全部的主分片及其副本上后,客户端才会收到确认通知。

  • 内存缓冲会被周期性刷新(默认是1秒),内容将被写到文件系统缓存的一个新段上。虽然这个段并没有被同步(fsync),但它是开放的,内容可以被搜索到。

  • 每30分钟,或者当translog很大的时候,translog会被清空,文件系统缓存会被同步。这个过程在Elasticsearch中称为冲洗(flush)。在冲洗过程中,内存中的缓冲将被清除,内容被写入一个新段。段的fsync将创建一个新的提交点,并将内容刷新到磁盘。旧的translog将被删除并开始一个新的translog。

下图展示了写请求及其数据流。

<iframe style="position: static; display: block; padding: 0; margin: 0; border: none; vertical-align: baseline; width: 100%; height: 74px;" frameborder="0" scrolling="no"></iframe>

 

 

更新((U)pdate)和删除((D)elete)

删除和更新也都是写操作。但是Elasticsearch中的文档是不可变的,因此不能被删除或者改动以展示其变更。那么,该如何删除和更新文档呢?

磁盘上的每个段都有一个相应的.del文件。当删除请求发送后,文档并没有真的被删除,而是在.del文件中被标记为删除。该文档依然能匹配查询,但是会在结果中被过滤掉。当段合并(我们将在本系列接下来的文章中讲到)时,在.del文件中被标记为删除的文档将不会被写入新段。

接下来我们看更新是如何工作的。在新的文档被创建时,Elasticsearch会为该文档指定一个版本号。当执行更新时,旧版本的文档在.del文件中被标记为删除,新版本的文档被索引到一个新段。旧版本的文档依然能匹配查询,但是会在结果中被过滤掉。

文档被索引或者更新后,我们就可以执行查询操作了。让我们看看在Elasticsearch中是如何处理查询请求的。

剖析读操作((R)ead)

读操作包含2部分内容:

  • 查询阶段

  • 提取阶段

我们来看下每个阶段是如何工作的。

查询阶段

在这个阶段,协调节点会将查询请求路由到索引的全部分片(主分片或者其副本)上。每个分片独立执行查询,并为查询结果创建一个优先队列,以相关性得分排序(我们将在本系列的后续文章中讲到)。全部分片都将匹配文档的ID及其相关性得分返回给协调节点。协调节点创建一个优先队列并对结果进行全局排序。会有很多文档匹配结果,但是,默认情况下,每个分片只发送前10个结果给协调节点,协调节点为全部分片上的这些结果创建优先队列并返回前10个作为hit。

提取阶段

当协调节点在生成的全局有序的文档列表中,为全部结果排好序后,它将向包含原始文档的分片发起请求。全部分片填充文档信息并将其返回给协调节点。

下图展示了读请求及其数据流。

<iframe style="position: static; display: block; padding: 0; margin: 0; border: none; vertical-align: baseline; width: 100%; height: 74px;" frameborder="0" scrolling="no"></iframe>

 

 

如上所述,查询结果是按相关性排序的。接下来,让我们看看相关性是如何定义的。

搜索相关性

相关性是由搜索结果中Elasticsearch打给每个文档的得分决定的。默认使用的排序算法是tf/idf(词频/逆文档频率)。词频衡量了一个词项在文档中出现的次数 (频率越高 == 相关性越高),逆文档频率衡量了词项在全部索引中出现的频率,是一个索引中文档总数的百分比(频率越高 == 相关性越低)。最后的得分是tf-idf得分与其他因子比如(短语查询中的)词项接近度、(模糊查询中的)词项相似度等的组合。

接下来有什么?

These CRUD operations are backed by some data structures inside Elasticsearch, which are very important to understand how Elasticsearch works. In the next series of articles, I will take you into similar concepts and tell you what pitfalls are in using Elasticsearch.

  • Split-brain problem and preventive measures in Elasticsearch

  • transaction log

  • Lucene's segment

  • Why deep pagination when searching is dangerous

  • Difficulties and Tradeoffs in Computing Search Relevance

  • concurrency control

  • Why Elasticsearch is near real-time

  • How to ensure read and write consistency

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326641406&siteId=291194637