ES write data process & read data process principle analysis

1. es write data process

The client selects a node to send the request to, and this node is the coordinating node.

The coordinating node routes the document and forwards the request to the corresponding node (with primary shard).

The primary shard on the actual node handles the request and then synchronizes the data to the replica node.

If the coordinating node finds that the primary node and all replica nodes are completed, it will return the response result to the client.

2. es read data process

You can query by doc id, hash according to doc id, determine which shard the doc id is assigned to at that time, and query from that shard.

The client sends a request to any node to become a coordinate node.

The coordinate node hashes the doc id and forwards the request to the corresponding node. At this time, the round-robin random polling algorithm is used to randomly select one among the primary shard and all its replicas to balance the load of read requests.

The node receiving the request returns the document to the coordinate node.

The coordinate node returns the document to the client.

3. es search data process

The most powerful thing about es is to do full-text search, that is, for example, you have three pieces of data:

java is so fun
, java is so hard to learn,
j2ee is so awesome

You search according to the java keyword, and the document containing java will be searched. es will give you back: java is so fun, java is so hard to learn.

The client sends a request to a coordinate node.

The coordinating node forwards the search request to the primary shard or replica shard corresponding to all shards, either.

query phase: Each shard returns its own search results (in fact, some doc ids) to the coordinating node, and the coordinating node performs data merging, sorting, paging and other operations to produce the final result.

fetch phase: Then the coordinating node pulls the actual document data from each node according to the doc id, and finally returns it to the client.

Write requests are written to the primary shard and then synchronized to all replica shards; read requests can be read from the primary shard or the replica shard, using a random polling algorithm.

4. The underlying principle of writing data

First write to the memory buffer, the data cannot be searched in the buffer; at the same time, the data is written to the translog log file.

If the buffer is almost full, or after a certain time, the memory buffer data will be refreshed to a new segment file, but at this time, the data will not directly enter the segment file disk file, but first enter the os cache. This process is refresh.

Every 1 second, es writes the data in the buffer to a new segment file, and a new disk file segment file is generated every second. The segment file stores the data written in the buffer in the last 1 second.

But if there is no data in the buffer at this time, then of course the refresh operation will not be performed. If there is data in the buffer, the refresh operation will be performed once every second by default, and it will be flushed into a new segment file.

In the operating system, disk files actually have something called os cache, that is, the operating system cache, which means that before data is written to the disk file, it will first enter the os cache, and then enter a memory cache at the operating system level. As long as the data in the buffer is flushed into the os cache by the refresh operation, the data can be searched.

Why is es called quasi-real-time? NRT, full name near real-time. The default is to refresh every 1 second, so es is quasi-real-time, because the written data can only be seen after 1 second. You can manually perform a refresh operation through the restful api or java api of es, that is, manually flush the data in the buffer into the os cache, so that the data can be searched immediately. As long as the data is entered into the os cache, the buffer will be emptied, because the buffer does not need to be reserved, and the data has been persisted to the disk in the translog.

Repeat the above steps, new data continuously enters the buffer and translog, and the buffer data is continuously written into new segment files one after another. After each refresh, the buffer is emptied, and the translog is retained. As this process progresses, the translog will get larger and larger. When the translog reaches a certain length, the commit operation will be triggered.

The first step in the commit operation is to refresh the existing data in the buffer to the os cache and clear the buffer. Then, write a commit point to the disk file, which identifies all the segment files corresponding to the commit point, and forcibly fsync all the current data in the os cache to the disk file. Finally, clear the existing translog log file, restart a translog, and the commit operation is completed.

This commit operation is called flush. By default, a flush is automatically executed once every 30 minutes, but if the translog is too large, a flush will also be triggered. The flush operation corresponds to the whole process of commit. We can manually perform the flush operation through the es api, and manually fsync the data in the os cache to the disk.

What is the purpose of the translog log file? Before you perform the commit operation, the data either stays in the buffer or in the os cache. Both the buffer and the os cache are memory. Once the machine dies, all the data in the memory will be lost. Therefore, it is necessary to write the operations corresponding to the data into a special log file translog. Once the machine is down and restarts again, es will automatically read the data in the translog log file and restore it to the memory buffer and os cache. .

In fact, translog is also written to the os cache first. By default, it is flushed to the disk every 5 seconds, so by default, there may be 5 seconds of data that will only stay in the os cache of the buffer or translog file. If the machine If it hangs up, 5 seconds of data will be lost. But this performance is better, losing up to 5 seconds of data. It is also possible to set translog so that each write operation must be fsynced directly to disk, but the performance will be much worse.

In fact, you are here. If the interviewer didn't ask you about es losing data, you can show off to the interviewer here. You said, in fact, es first is quasi-real-time, and the data can be searched after 1 second of writing. ; data loss may occur. There are 5 seconds of data, which stay in the buffer, translog os cache, segment file os cache, and not on the disk. If the machine goes down at this time, it will cause 5 seconds of data loss.

3. Summary

The data is first written to the memory buffer, and then every 1s, the data is refreshed to the os cache, and the data in the os cache can be searched (that's why we say that es can be searched from being written, there is a 1s delay in the middle). Every 5s, the data is written to the translog file (so that if the machine is down, the memory data is completely lost, and the data will be lost for up to 5s). The data in the area is flushed to the segment file disk file.

After the data is written to the segment file, the inverted index is established at the same time.

 1. The underlying principle of deleting/updating data

If it is a delete operation, a .del file will be generated when a commit is made, and a doc will be marked as deleted. Then, when searching, you will know whether the doc has been deleted according to the .del file.

If it is an update operation, it is to mark the original doc as the deleted state, and then write a new piece of data.

Every time the buffer is refreshed, a segment file will be generated, so by default, a segment file will be generated every 1 second, so that there will be more and more segment files, and merge will be executed regularly at this time. Each time you merge, multiple segment files will be merged into one, and the doc marked as deleted will be physically deleted here, and the new segment file will be written to the disk. A commit point will be written here to identify all new segments. the segment file, then open the segment file for search use, and delete the old segment file.

2. The underlying lucene

To put it simply, lucene is a jar package that contains various packaged algorithm codes for building inverted indexes. When we develop in Java, we can introduce the lucene jar and then develop it based on the lucene api.

Through lucene, we can index the existing data, and lucene will give us the data structure to organize the index on the local disk.

3. Inverted index

In a search engine, each document has a corresponding document ID, and the content of the document is represented as a collection of keywords. For example, document 1 has been tokenized and 20 keywords have been extracted, and each keyword will record the number and location of its occurrence in the document.

Then, the inverted index is the mapping of keywords to document IDs, each keyword corresponds to a series of files, and the keywords appear in these files.

Case:

have the following documents

DocId Doc

1

The father of Google Maps jumps to Facebook

2

The father of Google Maps joins Facebook

3

Google Maps founder Russ leaves Google to join Facebook

4

Google Maps father jumps to Facebook in connection with cancellation of Wave project

5

Lars, the father of Google Maps, joins social networking site Facebook

After tokenizing the document, the following inverted index is obtained.

WordId Word DocIds

1

Google

1,2,3,4,5

2

map

1,2,3,4,5

3

father

1,2,4,5

4

job hopping

1,4

5

Facebook

1,2,3,4,5

6

join

2,3,5

7

founder

3

8

Russ

3,5

9

Leave

3

10

and

4

..

..

..

In addition, a practical inverted index can also record more information, such as document frequency information, indicating how many documents in the document collection contain a certain word.

Then, with the inverted index, the search engine can easily respond to the user's query. For example, if a user enters a query on Facebook, the search system looks up the inverted index and reads out the documents that contain the word. These documents are the search results provided to the user.

There are two important details to note about inverted indexes:

All terms in the inverted index correspond to one or more documents;

The terms in the inverted index are sorted in ascending lexicographical order

The above is just a simple example and is not strictly lexicographically ascending.

Reprinted: How did the interviewer check your understanding of the ES search engine? 

Guess you like

Origin blog.csdn.net/yangbindxj/article/details/123912023