1. Basic instructions

A de facto standard in the industry is that distributed search engines generally use elasticsearch and solr, but it is true that in the past two years, everyone has generally used the easier-to-use es. What might the interviewer ask about this piece of content?

(1) Can you talk about the principle of es's distributed architecture (how does es realize distributed architecture)?

(2) What is the working principle of es writing data? What is the working principle of es query data?

(3) How can es improve query performance when the amount of data is large (billions of levels)?

(4) What is the deployment architecture of es production cluster? What is the approximate amount of data for each index? How many shards are there for each index?

2. Architecture design of distributed search engine

1. Interview questions

Can you tell me about the principle of es distributed architecture (how does es realize distributed architecture)?

2. The interviewer's psychological analysis

In search, Lucene is the most popular search library. A few years ago, the industry generally asked, do you know lucene? Do you know the principle of inverted index? It is already out now, because many projects now directly use the distributed search engine based on Lucene-elasticsearch, referred to as es.

And now distributed search has basically become the standard configuration of java systems in most Internet industries, and the most popular one is es. When es was not popular in the past few years, everyone generally used solr. But in the past two years, most companies and projects have begun to switch to es.

Therefore, Internet interviews will definitely talk to you about distributed search engines, and will definitely talk about es. If you really don't know, then you are really out.

If the interviewer asks you the first question, they will generally ask you, can you introduce the distributed architecture design of es? Just look at your basic understanding of distributed search engine architecture.

3. Analysis of interview questions

The design concept of elasticsearch is a distributed search engine, and the bottom layer is actually based on lucene. The core idea is to start multiple es process instances on multiple machines to form an es cluster.

The basic unit of storing data in es is the index. For example, if you want to store some order data in es, you should create an index in es, order_idx, and all order data is written to this index, an index It is almost equivalent to a table in mysql. index -> type -> mapping -> document -> field.

Index, a table in mysql; type, cannot be compared with mysql. There can be multiple types in an index. The fields of each type are the same, but there are some slight differences.

For example, there is an index, which is the order index, which contains order data. It's like saying you build a table in mysql. Some orders are orders for physical goods, like a piece of clothing or a pair of shoes; some orders are orders for virtual goods, like a game card and phone bill recharge. For the two orders, most of the fields are the same, but a small number of fields may be slightly different.

Therefore, two types will be created in the order index, one is the physical product order type, and the other is the virtual product order type. Most of the fields of these two types are the same, and a few of the fields are different.

In many cases, there may be one type in an index, but if it is true that there are multiple types in an index, you can think that the index is a category table, and each specific type represents a specific mysql table.

Each type has a mapping. If you think that a type is a specific table, index represents a type of multiple types that belong to the same type. Mapping is the table structure definition of this type. You create a table in MySQL. To define the structure of the table, which fields are there, and what type each field is. . .

Mapping represents the definition of the table structure of this type, which defines the name of each field in this type, what type of field is, and various configurations of this field.

In fact, you write a piece of data in a type in the index, called a document. A document represents a row in a table in mysql. Each document has multiple fields, and each field represents this document. The value of a field in.

Then you build an index, which can be split into multiple shards, and each shard stores part of the data. Then the data of this shard actually has multiple backups, that is, each shard has a primary shard, responsible for writing data, but there are several replica shards. After the primary shard writes the data, it will synchronize the data to several other replica shards.

Through this replica solution, each shard's data has multiple backups. If one machine goes down, it doesn't matter, there are other data copies on other machines. High availability now.

Multiple nodes in the es cluster will automatically elect one node as the master node. This master node is actually doing some management tasks, such as maintaining index metadata pull, responsible for switching the identity pull of primary shard and replica shard, and so on.

If the master node goes down, a new node will be elected as the master node.

If the non-master node goes down, the master node will transfer the identity of the primary shard on that down node to the replica shard on other machines. In a hurry, if you repair the down machine, after restarting, the master node will control the distribution of the missing replica shards, synchronize subsequent modified data and so on, so that the cluster will return to normal.

In fact, the above is the most basic architecture design of elasticsearch as a distributed search engine.

3. Workflow of writing and querying distributed search engines

1. Interview questions

What is the working principle of es writing data? What is the working principle of es query data?

2. Psychological analysis of interviewers

Ask this, in fact, the interviewer wants to see if you understand some basic principles of es, because using es is nothing more than writing data and searching data. If you don't understand what es is doing when you initiate a write and search request, then you really are. . .

Yes, es is basically a black box, what else can you do? The only thing you can do is to use the es API to read and write data. If something goes wrong and you don’t know anything, what can you expect from you? is not it.

3. Analysis of interview questions

es write data process

The client selects a node to send the request, and this node is the coordinating node (coordinating node)
Coordinating node, route the document, and forward the request to the corresponding node (primary shard)
The primary shard on the actual node processes the request and then synchronizes the data to the replica node
Coordinating node, if it is found that the primary node and all replica nodes are settled, the response result is returned to the client

es read data process

Query, GET a piece of data, write a document, this document will automatically assign you a globally unique id, doc id, and also hash and route it to the corresponding primary shard based on the doc id. You can also manually specify the doc id, such as order id, user id.

You can query by doc id, which will be hashed according to doc id to determine which shard the doc id is assigned to at the time, and then query from that shard.

The client sends a request to any node to become a coordinate node
The coordinate node routes the document and forwards the request to the corresponding node. At this time, the round-robin random polling algorithm is used, and one of the primary shards and all its replicas is randomly selected to balance the load of read requests.
The node receiving the request returns the document to the coordinate node
The coordinate node returns the document to the client

es search data process

The most powerful thing about es is to do full-text search, that is, for example, you have three pieces of data:

java真好玩儿啊
java好难学啊
j2ee特别牛

You search based on java keywords, and the documents containing java are searched out. es will return you:

java真好玩儿啊，java好难学啊

The client sends a request to a coordinate node
The coordinating node forwards the search request to the primary shard or replica shard corresponding to all shards.
query phase: Each shard returns its own search results (in fact, some doc IDs) to the coordinating node, and the coordinating node performs operations such as data merging, sorting, and paging to produce the final result
fetch phase: Then the coordinating node pulls the actual document data from each node according to the doc id, and finally returns it to the client

The underlying principle of writing data

1) Write to the buffer first, and the data cannot be searched when in the buffer; at the same time, write the data to the translog log file.

2) If the buffer is almost full, or after a certain period of time, the buffer data will be refreshed to a new segment file, but at this time, the data does not directly enter the segment file disk file, but enters the os cache first. This process is refresh.

Every 1 second, es writes the data in the buffer to a new segment file, and a new disk file, segment file, is generated every second. This segment file stores the data written in the buffer in the last 1 second .

But if there is no data in the buffer at this time, then of course the refresh operation will not be executed. An empty segment file is created every second. If there is data in the buffer, the refresh operation is executed once per second by default, and a new segment file is flushed. in.

In the operating system, disk files actually have a thing called os cache, operating system cache, that is, before data is written to disk files, it will first enter the os cache, and first enter a memory cache at the operating system level.

As long as the data in the buffer is refreshed and flushed into the os cache, it means that the data can be searched.

Why is es called quasi real time? NRT, near real-time, near real-time. The default is to refresh every 1 second, so es is quasi real-time, because the written data can only be seen after 1 second.

You can manually perform a refresh operation through the restful api or java api of es, that is, manually flush the data in the buffer into the os cache, so that the data can be searched immediately.

As long as the data is entered into the os cache, the buffer will be emptied, because there is no need to retain the buffer, and the data has been persisted to disk in the translog.

3) As long as the data enters the os cache, the data of this segment file can be searched externally.

4) Repeat steps 1 to 3, new data continuously enters buffer and translog, and continuously writes buffer data into new segment files one after another. Each time the refresh is finished, the buffer is emptied and the translog is retained. As this process progresses, translog will become larger and larger. When the translog reaches a certain length, the commit operation is triggered.

The data in the buffer is good, it is flushed to the os cache every 1 second, and then the buffer is emptied. So the data in this buffer can always be kept and will not fill the memory of the es process.

Every time a piece of data is written to the buffer, a log is written to the translog log file at the same time. Therefore, the translog log file is constantly getting bigger. When the translog log file is large enough, the commit operation will be executed.

5) The first step in the commit operation is to refresh the existing data in the buffer to the os cache and clear the buffer.

6) Write a commit point to a disk file, which identifies all the segment files corresponding to this commit point.

7) Forcibly fsync all current data in the os cache to the disk file.

What is the role of translog log files? Before you execute the commit operation, the data either stays in the buffer or the os cache. Both the buffer and the os cache are memory. Once the machine dies, all the data in the memory is lost.

Therefore, the operation corresponding to the data needs to be written to a special log file, the translog log file, once the machine is down at this time, when restarting again, es will automatically read the data in the translog log file and restore it to the memory buffer and os Go in the cache.

Commit operation: 1. Write the commit point; 2. Forcibly flush the os cache data fsync to the disk; 3. Clear the translog log file

8) Clear the existing translog, then restart a translog again, and the commit operation is complete. By default, commit is automatically executed every 30 minutes, but if the translog is too large, the commit will also be triggered. The entire commit process is called the flush operation. We can manually perform the flush operation, which is to flush all os cache data to the disk file.

Not called commit operation, flush operation. The flush operation in es corresponds to the whole process of commit. We can also manually perform the flush operation through the es api, manually flush the data in the os cache fsync to the disk, record a commit point, and clear the translog log file.

9) Translog is actually written to the os cache first. By default, it is flushed to the disk every 5 seconds, so by default, there may be 5 seconds of data that will only stay in the os cache of the buffer or translog file. If this When the machine hangs up, 5 seconds of data will be lost. But this performance is better, losing up to 5 seconds of data. You can also set translog so that each write operation must be directly fsynced to disk, but the performance will be much worse.

Actually you are here. If the interviewer didn’t ask you the question about es losing data, you can show off to the interviewer here. You said, es first is quasi real-time, and the data can be searched after 1 second. ; You may lose data. Your data has 5 seconds of data, and it stays in buffer, translog os cache, segment file os cache, and 5 seconds of data is not on the disk. If it goes down at this time, it will cause 5 seconds data lost.

If you want to not lose data, you can set a parameter, the official document, Baidu. Each time a piece of data is written, it is written to the buffer and the translog on the disk at the same time, but this will cause the write performance and write throughput to drop by an order of magnitude. Originally, you could write 2000 entries in one second, but now you can only write 200 entries in one second. It's all possible.

10) If it is a delete operation, a .del file will be generated when committing, and a certain doc will be marked as deleted. Then when searching, you will know that the doc has been deleted according to the .del file.

11) If it is an update operation, mark the original doc as a deleted state, and then write a new piece of data.

12) Each time the buffer is refreshed, a segment file is generated, so by default, it is a segment file per second, and there will be more and more segment files. At this time, merge will be executed regularly.

13) During each merge, multiple segment files will be merged into one. At the same time, the doc identified as deleted will be physically deleted, and then the new segment file will be written to the disk. Here, a commit point will be written. All new segment files, and then open the segment file for searching, and delete the old segment file.

The writing process in es has 4 underlying core concepts , refresh, flush, translog, and merge.

When the segment files reach a certain level, es will automatically trigger the merge operation and merge multiple segment files into one segment file.

4. How to optimize query performance for distributed search engines in scenarios with billions of data levels

1. Interview questions

How can es improve query efficiency when the amount of data is large (billions of levels)?

2. The interviewer's psychological analysis

Asking this question is affirmative. To put it bluntly, it is to see if you have actually done es, because of what? To put it plainly, the performance is not as good as you think. Many times the amount of data is large, especially when there are hundreds of millions of pieces of data, you may be stupefied to find out how to run a search for 5 to 10 seconds, which is cheating. When searching for the first time, it is 5~10 seconds, but it will be faster later, maybe a few hundred milliseconds.

You are so confused, every user will be slower the first visit, is it more stuck?

So if you haven't played es before, or just played the demo by yourself, it is easy to be confused when asked this question, which shows that you really don't play es well.

3. Analysis of interview questions

To be honest, es performance optimization has no magic parameters (different from MySQL), what do you mean? Just don't expect to adjust a parameter at will, you can handle all the slow performance scenarios omnipotent. There may be scenarios where you can change the parameters or adjust the grammar, but this is definitely not possible in all scenarios.

The killer of performance optimization-filesystem cache

os cache, the cache of the operating system. The data you write to es is actually written to the disk file, and the data in the disk file is automatically cached by the operating system to the os cache.

The search engine of es relies heavily on the underlying filesystem cache. If you give the filesystem cache more memory, try to make the memory can accommodate all the indx segment file index data files, then you basically use the memory when you search. Performance It will be very high.

The performance gap can be large. We can take a look at it under pressure. If you go to the disk, it will usually take a second. The search performance is definitely at the second level, 1 second, 5 seconds, and 10 seconds. But if you use the filesystem cache, it is pure memory. Generally speaking, the performance is an order of magnitude higher than that of the disk, which is basically milliseconds, ranging from a few milliseconds to hundreds of milliseconds.

For example, your es node has one machine with 64G total memory. Then you have to give es jvm heap32G, then the remaining filesystem cache is only 32g.

If you have 300G index data files on the disk at this time, do you think your performance can be good? The memory of the filesystem cache is only 32g, one-tenth of the data can be stored in the memory, and the rest are on the disk, and then you perform a search operation, most of the operations are from the disk, and the performance is definitely poor.

In the final analysis, you have to make es perform better. In the best case, your machine's memory can hold at least half of your total data.

For example, if you want to store 1T of data in es, then the memory of filesystem cache of your multiple machines should be combined, at least 512G, at least half of the cases, the search is through the memory, and the performance is generally OK To a few seconds, 2 seconds, 3 seconds, 5 seconds.

If under the best circumstances, only a small amount of data is stored in es, that is, the indexes you want to search, and the memory is reserved for the filesystem cache, which is 100G, then you should control it within 100gb, which is equivalent to yours Almost all data is searched in the memory, and the performance is very high, generally within 1 second.

Now you may be thinking again, there is so much data, how can I control its size?

For example, you now have a row of data, id name age ... 30 fields. But if you search now, you only need to search based on the three fields of id name age.

If you stupidly write all the fields of a row of data into es, it will lead to saying that 70% of the data is not used for searching, and the result is that it occupies the space of the filesystem cache on the es machine. The larger the data volume of single-handed data , It will cause the filesystem cahce to cache less data.

Just write to a few fields to be used for retrieval in es, for example, just write the three fields of es id name age, and then you can store other field data in mysql, generally It is recommended to use an architecture of es + hbase (faster than MySQL under a large amount of data).

From es to search according to name and age, the result may be 20 doc ids, and then according to doc id to hbase to query the complete data corresponding to each doc id, find out, and then return to the front end.

You'd better write data to es less than or equal to, or slightly larger than the memory capacity of es filesystem cache. Then it may take 20ms for you to retrieve from es, and then query in hbase according to the id returned by es, and check 20 pieces of data, which may also take 30ms. Maybe you are playing like that, and 1T data is put in es every time. The query is 5~10 seconds, the performance may be very high now, each query is 50ms.

elastcisearch reduces the amount of data and only puts a few key fields to be used for search, and the amount of data written to es is as close as possible to the filesystem cache of the es machine; other data not used for retrieval is placed in hbase, or mysql.

Data warm-up

If you say that even if you follow the above scheme, the amount of data written by each machine in the es cluster is more than double the filesystem cache. For example, if you write 60g data to a machine, the filesystem cache will be 30g. , There are still 30g of data left on the disk.

For example, for example, on Weibo, you can send some big v, the data of a lot of people who usually watch a lot of data to your own back-end system in advance, every once in a while, your own back-end system to search for hot data, refresh Go to the filesystem cache. When users actually look at the hot data later, they search directly from the memory, and it's very fast.

For e-commerce, you can program some of the most frequently viewed products, such as iPhone X, and hot data in the background in advance. You can actively visit them every 1 minute and flash them to the filesystem cache.

For those data that you think is relatively hot and often accessed by people, it is best to make a special cache preheating subsystem, that is, for the hot data, you will visit it in advance every once in a while and let the data enter the filesystem cache. . In this way, I hope that the performance will be better next time when others visit.

Hot and cold separation

Regarding es performance optimization and data splitting, I said before that a large number of fields that are not searched are split into other storage. This is similar to the vertical split of MySQL sub-database sub-table to be discussed later.

es can do a horizontal split similar to mysql, that is, write a separate index for a large amount of rarely accessed and low frequency data, and then write a separate index for hot data that is frequently accessed.

You'd better write cold data into one index, and then write hot data into another index, so as to ensure that the hot data is kept in the filesystem os cache as much as possible after it is warmed up. Don't let cold data give Washed away.

You see, suppose you have 6 machines, 2 indexes, one for cold data, one for hot data, and 3 shards for each index. 3 machines release heat data index; the other 3 machines release cold data index.

Then in this case, you are accessing the hot data index when you have a lot of them, and the hot data may account for 10% of the total data volume. At this time, the data volume is very small, almost all of which are kept in the filesystem cache, and you can ensure the hot data access The performance is very high.

But for cold data, it is in another index, and on a machine that is no longer the same as the hot data index, everyone has no contact with each other. If someone accesses cold data, a large amount of data may be on disk. At this time, the performance is poor, and 10% of people are accessing cold data; 90% of people are accessing hot data.

document model design

Look at the following two tables

Order form:

id order_code total_price
1  测试订单    5000

Order entry table:

id order_id goods_id purchase_count price
1  1        1        2              2000
2  1        2        5              200

In mysql, both are

select * from order join order_item on order.id=order_item.order_id where order.id=1

1 测试订单 5000 1 1 1 2 2000
1 测试订单 5000 2 1 2 5 200

How to play in es, the complicated relational query in es, the complicated query syntax, try not to use it, once it is used, the performance is generally not good. When writing es, it creates two indexes, order index and orderItem index.

The order index contains id order_code total_price; the orderItem index, when written into it, the join operation is completed, id order_code total_price id order_id goods_id purchase_count price.

Write into the es java system to complete the association and write the associated data directly into es. When searching, there is no need to use the es search syntax to complete the join to search.

Document model design is very important. There are many operations. Don't want to perform all kinds of complicated and messy operations when searching. There are so many operations that es can support. Don't consider using es to do things that are not easy to operate. If there is such an operation, try to complete it when the document model is designed and written. In addition, for some too complicated operations, such as join, nested, parent-child search, try to avoid it, and the performance is very poor.

So, how to perform many complicated and messy operations? Two ideas, when searching/querying, you need to perform some particularly complex operations that are strongly related to the business:

When writing data, design the model, add a few fields, and write the processed data into the added fields
Encapsulate with java program yourself, what es can do, use es to do it, search the data, do it in java program, for example, we, based on es, use java to encapsulate some particularly complex operations
Paging performance optimization

The pagination of es is rather tricky, why? For example, if you have 10 pieces of data per page, and you want to query the 100th page, in fact, the first 1000 pieces of data stored on each shard will be checked on a coordinating node. If you have a 5 A shard, then there are 5000 pieces of data, and then the coordination node will merge and process these 5000 pieces of data, and then get the final 10 pieces of data on the 100th page.

Distributed, you want to check 10 pieces of data on page 100. Is it impossible to say that from 5 shards, each shard will check 2 pieces of data? Finally, to the coordination node to merge into 10 data? You have to check 1000 pieces of data from each shard, and then sort, filter, etc. according to your needs, and finally paging again to get the data on page 100 inside.

When you turn the page, the deeper you turn, the more data each shard returns, and the longer it takes for the coordination node to process. Very cheating. So when using es for paging, you will find that the more you turn to the back, the slower it is.

Using es for paging, the first few pages take tens of milliseconds, and after turning to 10 pages, when there are dozens of pages, it basically takes 5 to 10 seconds to find out a page of data. How should this situation be handled?

1) Does not allow deep paging / default deep paging performance is miserable

Your system does not allow him to turn that deep page, pm, by default, the deeper the turn, the worse the performance.

2) Similar to the recommended products in the app, which are continuously pulled down page by page

Similar to Weibo, you can use scroll api and Baidu by yourself.

Scroll will generate a snapshot of all the data for you at one time, and then move through the cursor every time you turn the page to get the next page and the next page. The performance will be much higher than the paging performance mentioned above.

To solve this problem, you can consider using scroll for processing. The principle of scroll is actually to keep a data snapshot, and then within a certain period of time, if you continue to slide backwards to turn the page, it is similar to you are currently browsing Weibo , Keep refreshing and flipping down. Then use scroll to continuously get the next page of data through the cursor, this performance is very high, much better than the actual page turning of es.

But the only thing is that this is suitable for the kind of scenes that are similar to the micro-blog pull-down page and cannot jump to any page at will. At the same time, this scroll is to keep a snapshot of the data over a period of time, and you need to ensure that users will not continue to turn pages and turn for several hours.

No matter how many pages are turned, the performance is basically milliseconds.

Because the scroll api can only turn back page by page, it cannot be said. First go to page 10, then go to page 120, and return to page 58. You can't jump pages randomly. So now many products do not allow you to turn pages at will. Apps, and some websites, do that you can only scroll down and turn pages one by one.

V. Search engine deployment in the production environment

1. Interview questions

What is the deployment architecture of es production cluster? What is the approximate amount of data for each index? How many shards are there for each index?

2. The interviewer's psychological analysis

This question, including the following redis, talks about es, redis, mysql sub-database sub-table and other technologies, interviews must be asked! How did your production environment deploy? To put it bluntly, this question has no technical content, it just depends on whether you have done it in a real production environment!

Some people may have never done it in a production environment, have not actually deployed an es cluster with an online machine, have not actually played, and have not imported tens or even hundreds of millions of data into the es cluster. , You may not know the details of some of the production projects inside.

If you have played the demo by yourself and have not touched a real ES cluster, then you may be confused at this time, but don't be confused. You must answer this question lightly and lightly, indicating that you have indeed done this.

3. Analysis of interview questions

In fact, this problem is fine. If you have done es, then you must know the actual situation of your production es cluster. How many machines have been deployed? How many indexes are there? How much data does each index have? How many shards are given for each index? You must know!

But if you haven't done it before, don't make it false. Anyway, it's just a number. Just remember it. Then just say it briefly.

In the es production cluster, we deployed 5 machines, each machine is 6-core 64G, and the total memory of the cluster is 320G
The daily incremental data of our es cluster is about 20 million, the daily incremental data is about 500MB, and the monthly incremental data is about 600 million, 15G. The system has been running for several months, and the total amount of data in the es cluster is about 100G.
There are currently 5 indexes online (combined with your own business to see what data you have can put es), the data volume of each index is about 20G, so within this data volume, we allocate each index There are 8 shards, 3 shards more than the default 5 shards.

Probably just say that.

Search engine interview preparation

1. Basic instructions

2. Architecture design of distributed search engine

1. Interview questions

2. The interviewer's psychological analysis

3. Analysis of interview questions

3. Workflow of writing and querying distributed search engines

1. Interview questions

2. Psychological analysis of interviewers

3. Analysis of interview questions

4. How to optimize query performance for distributed search engines in scenarios with billions of data levels

1. Interview questions

2. The interviewer's psychological analysis

3. Analysis of interview questions

V. Search engine deployment in the production environment

1. Interview questions

2. The interviewer's psychological analysis

3. Analysis of interview questions

Guess you like