es principle

Three, es in the case of a large amount of data (billions of level) how to improve query performance ah?
In this scenario of massive data, how to improve the performance of search es, but also our environment prior to production resulting experience

(1) Performance Optimization Cache killer --filesystem
os Cache, OS cache
you go es to write data actually written to the disk file to go, and disk data in the file will automatically inside the operating system data os cache buffer to go inside
es search engine is heavily dependent on the underlying filesystem cache, if more memory to your filesystem cache, try to make the memory can accommodate all indx segment file index data file, so when you search on the basic is to take the memory, the performance will be very high.
The performance gap can be large, and before we test a lot of pressure measurement, if the disk to go on a general affirmation seconds, search performance is definitely the second level, 1 second, 5 seconds, 10 seconds. But if it is to go filesystem cache, taking a pure memory, then the general disk performance than go to an order of magnitude, basically millisecond, ranging from a few milliseconds to a few hundred milliseconds.
Before there is a student, I've been asked that his search performance, polymerization performance, inverted index, a positive index, disk file, ten seconds. . . .

Students real case

For example, you, es node has three machines, each machine, it seems a lot of memory, 64G, total memory, 64 * 3 = 192g
each machine to es jvm heap is 32G, then the rest is to be left to the filesystem cache each machine only 32g, to a total cluster filesystem cache is 32 * 3 = 96g memory
I asked him, data ok, so what you write to the cluster es how much the amount of data?
If you are at this time, your whole, the index data file on disk, on three machines, taking up a total disk capacity of 1T, your es the amount of data is 1t, the amount of data each machine is 300g

Do you think your performance can be okay? filesystem cache memory only 100g, you can put one-tenth of the data memory, the rest are in the disk, and then you perform a search operation, most of the operations are taking disk, performance is certainly poor
when their situation is like this, es in test, get three machines, feel pretty good, 64G memory physical machine. I think they can hold the amount of data of 1T.

Ultimately, you have to let es performance is better, the best case, is your machine's memory, you can fit at least half of the total amount of data
for example, you want to total the data stored in 1T es, then you more the machine leave a filesystem cache memory integrated together, at least until 512G, at least half of the cases, the search is to take the memory, performance generally to a few seconds, two seconds, three seconds, five seconds
if the best-case scenario under our own production experience, so that we had a strategy, it is only in existence es small amount of data that you want to use those indexes to search, left filesystem cache memory, it is 100G, then you it is controlled within 100gb, equivalent, almost all of your data go to search for memory, performance is very high, generally less than 1 sec.

For example, you now have one row of data
id name age .... 30 fields
but now you search, you only need to search by id name age three fields
all fields silly if you write a line of data to es years, will 70% said the cause of the data is not used to search, the results simply take up space on the filesystem cache es machine, the greater the amount of data singled out data, it will lead to filesystem cahce can cache less data
than just write es to be used to retrieve the few fields can be, for example, is written es id name age three fields on it, then you can put other fields exist inside the mysql data, we generally recommend using es + hbase of such a framework.
hbase feature is applicable to online storage of massive data, that is, hbase can write huge amounts of data, do not do complex searches, it is to do some very simple operation such a query based on the id or scope on it

The name and age to search from es, the results might get id DOC 20, and then go hbase id to query the complete data for each corresponding DOC id, to check out, return to the leading end according to the doc.
Es you write data is preferably less than or equal, or slightly larger than the es of the filesystem cache memory
then you might take to retrieve from es 20ms, then according to the returned ID hbase es in the query, search data 20, It may also spend a 30ms, the original so you might play, 1T data is put es, will each query is 5 to 10 seconds, will now be high performance, each inquiry is 50ms.
elastcisearch reduce the amount of data to be used to search only put a few key fields can try with the amount of data written es es machine filesystem cache is almost on it; others do not have to retrieve the data put hbase years, or mysql.
So before some students also asked, I also told them that, as far es, the data is stored must be used to search, for example, you now have a copy of the data, there are 100 field, in fact, used to search for only 10 fields, 10 is the recommended data fields, into es, the remaining 90 data fields, you can put mysql, hadoop hbase, can be
so, es small amount of data, the data field 10, can be put memory, it is used to search, search out some of the id, id go through mysql, hbase which to query data details of

(2) Data preheat
If you say, even if you do it in accordance with the above scheme, the amount of data written to each machine in the cluster es or exceed the filesystem cache doubled, for example, you write a machine data 60g the results filesystem cache on the 30g, 30g or have data left on the disk.
For example, they say, microblogging, you can put some big v, usually see a lot of people ahead of your own data to the back-end system put forward, every moment, your own back-office systems to search thermal data, brush go to the filesystem cache, back when the user actually look at this hot data, they are search directly from memory, and soon.
Electricity supplier, you can usually see the most of some commodities, such as iphone 8, the heat ahead of the background data to put forward a program, every minute access time on their own initiative, brush filesystem cache to go.
For data that you think is hot, often someone visits, it is best to do a special pre-caching subsystem, the thermal data is, every once in a while, you'll visit in advance and let the data into the filesystem cache to go inside . Such people look forward to the next visit, be sure the performance will be better.

(3) separation of hot and cold
on es performance optimization, data split, I said not before a number of fields to search, store and going to another split, this is similar to the back of my last talk of sub-sub-library mysql table vertical resolution.
es can do things like mysql split level, that will access a large number of small, very low frequency data, write a single index, and then write a separate data access is frequently a heat index
you better be cold data write an index, and then write data to another heat index, which ensures data after being preheated hot, try to let them stay in the filesystem os cache inside, do not let the cold data to wash away.

Look, suppose you have six machines, two indexes, a data allowed to cool, a heat data, each index 3 Shard
3 machines exothermic index data; 3 machines allowed to cool further index data
and this is the case, you a lot of time in accessing hot data index, thermal data probably 10% of the total amount of data, at this time the amount of data is small, almost all remain in the filesystem cache inside, you can ensure that access to data is a high performance heat of.
But for cold data, is in the other index, the data with the heat index is no longer the same machine, we have no touch with each other. If someone visits the cold data, it may be a lot of data on the disk, in which case the performance handicap, 10% of people go to visit the cold data; 90% of people accessing data in the heat.

(4) document model design
there are many students ask me, mysql, there are two tables
Orders table: id order_code total_price

1 Test Order 5000
line items table: the above mentioned id order_id goods_id purchase_count. Price
1 1 1 2 2000
2 1 2 5 200

I mysql years, are from the Order * the Join Order_Item the SELECT ON order.id order_item.order_id the WHERE order.id = 1 =
1 Test Order 1 1 1 2 2000 5000
1 Test orders 50,002,125,200
in es in the how play, complex queries associated es inside the complex query syntax, try not to use, once with the performance generally not very good
design es in the data model

While the write-es, the two into an index, the index order, orderItem index
order index, which contains order_code total_price ID
orderItem index, which when written into, join operations complete, id order_code total_price id order_id goods_id purchase_count price

Es of java system writes, the association is complete, the data is written directly associated with good es, the time of the search, there is no need to use search syntax es to accomplish join to search for the
document model design is very important, many operation, do not only want to perform a complicated mess of various operations at the time of the search. es to support the operation is so much to do something do not consider it a bad operation with es. If you really have that kind of operation, as far as possible when document model design, when the write is complete. In addition to some of the very complex operations, such as join, nested, parent-child search should be avoided, performance is poor.

Many students ask me, a lot of complicated mess of some operations, how to perform
two ideas, search / query time, especially complex operations to be performed strongly related to the number of business:
1) at the time of writing data, it is designed good model, plus several fields, field writes the processed data plus inside
2) with their java program package, ES do, do with ES, search out data, which do the java program, such as we say, based on es, with the java package of some particularly complex operation

(5) Performance Optimization tab
page is relatively es pit, so why then? To give an example, if you are 10 data per page, and you now want to query on page 100, in fact, before 1000 the data stored on each shard would have found on a coordinator node, if you have a 5 a shard, then there is 5000 data, then the coordinator node some of these 5000 data merge process, and then get to the final 10 on page 100 of the data.
Distributed, you have to check 10 on page 100 of the data, you can not say is from 5 shard, each shard to check 2 data? Finally, the coordinator node merged into 10 data? You have to check all the data came from 1000 for each shard, and then sorted according to your needs, screening, etc. operations, the last page again, to get the data inside on page 100.
You turn the pages, turn the deeper, the more each shard data returned, and the longer the coordinator node processing. Very pit father. So do pagination with es, you will find more turn back, the more slowly.
Before we also encountered this problem, use es as paging, the first few pages to tens of milliseconds, after turn 10, when dozens of pages, basically going from 5 to 10 seconds to check out a page of data

1) does not allow the depth of sort / paging default depth miserable performance
of your system does not allow him to turn the page so deep, pm, default turn deeper, worse performance

2) similar to the app in the Testimonials continue to drop down out page after page of
similar microblogging, microblogging drop-down brush, brush out page by page, you can scroll api, own Baidu
scroll will give you a one-time generate a snapshot of all the data, and then move the cursor through each page is to get nextnext this way, the performance than the kind of page it says a lot of high performance
for this problem, you can consider using scroll for processing, to retain the principle scroll is actually a snapshot of the data, and then within a certain time, if you continue to slide back flip when you are browsing similar micro-blog, and constantly refresh the page down. Then scroll through a cursor with a continuous access to data Next, this performance is very high, much better than the actual es page.
But the only thing is, this is suitable for the kind of pull-down flip of a similar microblogging and can not jump to any page of the scene. While the scroll is to retain data for a period of time snapshot, you need to ensure that users are not constantly flip turn a few hours.
No matter how many pages turn, basically millisecond performance
because scroll api is only one page next turn, and can not say, first enter the page 10, then go 120 back to 58 and can not bounce page. So now many products, you are not allowed to randomly flip, app, there are some sites, you can only do is pull down, turn page by page


Fourth, what deployment architecture es production clusters is? The amount of data each index is about how much? Each index about how many slices?
If you really done es, you certainly know your actual production situation es cluster, the deployment of several machines? How many indexes? Each index How much amount of data? Each index to how many slices? You know for sure!
But if you really did not do, do not true, I tell you a basic version, when you go to simply say something like
(1) es production clusters we deployed five machines, each machine is a 6-core 64G, the cluster is total memory 320G
(2) May we es incremental data cluster is about 20 million a day, daily incremental data about 500MB, monthly incremental data is about 600 million, 15G. Currently, the system has been running for several months, and now the total amount of data es cluster is probably about 100G.
(3) line, there are currently five index (the combination to your own business and see for yourself what the data can be put es), the amount of data each index is about 20G, so the amount of data within it, each of us index distribution is 8 shard, shard three more than the default five shard.

 

First, the principle of the distributed architecture of es can say about it (es is how to achieve distributed ah)?
elasticsearch design concept is a distributed search engine, but it is still on the bottom of lucene.
The core idea is to start multiple processes es instances on multiple machines, es form a cluster.
es in the basic unit of data is stored in an index, for example, now you want to store some data in order es, you should create an index es in, order_idx, all order data are written to go inside this index, an index It is almost the equivalent of mysql in a table. index -> type -> mapping - > document -> field.
index: mysql in a table
type: mysql can not compare with the go, a index where you can have multiple type, each type of fields are similar, but there are some slight differences.
It is like saying that there is a index, is the order index, which is dedicated to put order data. Like saying you built in mysql table, some orders are orders for physical goods, like saying a piece of clothing, a pair of shoes; some orders are orders for virtual goods, like said game cards, prepaid recharge. These line on most of the fields are the same, but a small part of the field there might be some slight differences.
It will be in the order index, the construction of two type, is a type of physical goods orders, is a type of virtual goods orders, the two most fields are the same type, a small number of fields are not the same.
In many cases, a index was probably a type, but it does say that if there is an index multiple type of situation, you can be considered a type of table index, specific to each type represents a specific mysql in table
Each type has a mapping, if you think of a specific type is a table, index represents a type more of the same type belonging, mapping is the definition of this type of table structure, you create a table in mysql, certainly It is to define the table structure, which has what fields, each field is what type. . .
mapping definition table represents the structure of this type, the definition of this type in the name of each field, what type of field, and then there are various configurations of this field
you actually write to the index in which a type of a data, called a document, a document on behalf of a mysql table in a row for each document there are multiple field, each field represents the value of this document in a field
and then you get an index, this can be split into a plurality of shard index, each shard data storage section.
Then is this shard of data actually have more than one backup, that is, each shard has a primary shard, responsible for writing the data, but there are several replica shard. primary shard data after writing, data is synchronized to the replica shard several other up.
Through this replica of the scheme, each shard have multiple backups of data, if a machine goes down, it does not matter ah, there are other copies of data on other machines do. Availability of it.
es multiple cluster nodes, one node will be automatically elected as the master node, the master node is actually doing some administrative work, such as maintenance index metadata pull, responsible for switching primary shard and a replica shard identity pull, and the like.
If the master node goes down, then the node will be re-elected as a master node.
If the non-master node goes down, it will be by the master node, so that the identity of the primary shard that is down to the replica shard node metastasis on other machines. If you hurry to fix that machine downtime, after the restart, master control node will be missing replica shard distribution in the past, like data synchronization of subsequent modification, so cluster back to normal.
In fact, the above is elasticsearch as a distributed search engine, a basic architecture design


Second, what es works written data is ah? What works es query data is ah?
(1) es the process of writing data

1) client selects a node sends a request in the past, the node is Coordinating node (coordinator node)
2) Coordinating node, on the document routing, forwards the request to the corresponding node (there are Primary Shard)
. 3) Primary on actual node shard processing request, then the data is synchronized to the replica node
. 4) Coordinating Node, if it is found after all the replica node and the primary node are buttoned, the result is returned in response to the client

(2) es during data read

Query, GET a piece of data, a written document, this document will automatically give you assign a globally unique id, doc id, but also be routed to hash corresponding primary shard above, according to the doc id. You can also specify a manual doc id, such as with the order id, user id.

You can query by doc id, will be based on hash doc id, was judged to allocate doc id which shard to go above, to make inquiries from the shard

1) The client sends a request to a Node any, become Node Coordinate
2) on a document Node Coordinate route, forwards the request to the corresponding Node, this time using the round-robin algorithm randomly polling, primary shard and its replica in all is selected randomly, so that the read requested load balancing
3) the node receives the request returns the document to the node Coordinate
. 4) Coordinate node returns the document to the client

(3) es data search process

es most powerful full-text search is to do is to say you have three data

ah really fun java
java so hard to learn ah
j2ee especially cattle

You to search by keyword java, java will contain the document to search out

es will give you returns: java really fun ah, java so hard to learn ah

1) The client sends a request to the Node Coordinate
2) coordinating node search request will be forwarded to all of the corresponding primary shard shard or replica shard can also
3) query phase: Each shard their search results (in fact, some of the doc id ), and returns to the coordinator node, the data merging, sorting, paging and other operations by a coordinating node, outputs the final result
4) fetch phase: followed by a coordinating node, pulling the actual document data based on doc id to each node, the final returned to the client

The underlying principle (4) search inverted index, drawing illustrates the difference between traditional database and inverted index
https://www.jianshu.com/p/3abaa0083bac


(5) Write underlying principle of data
1) first write buffer, when data in the buffer is less than the search; while data is written to a log file translog
2) If the buffer is almost full, or to a certain time, they will be buffer data refresh to a new segment file, the data at this time but not directly into the segment file disk file, but first enter the os cache. This process is to refresh.
Every one second, es the data buffer is written to a new segment file, every second will generate a new disk file, segment file, this segment file is stored in the data recently in one second buffer written
However, if the buffer there is no data at this time, of course, does not perform refresh operations slightly, create an empty segment file for a second, if there are data buffer, default 1 second to perform a refresh operation, the brush into a new segment file the
operating system which, in fact, has a disk file something called the os cache, the operating system cache, that is to say before the data is written to a disk file, it will first enter os cache, before entering the operating system level to a memory cache
as long as the buffer in the data is refresh operation, the brush into the os cache, on behalf of the data can be searched
Why is it called es is near real-time? NRT, near real-time, near-real time. The default is every second refresh time, so es quasi real-time, because the data is written after one second in order to be seen.
Restful es through the api api, or Java, to manually perform a refresh operation, the data buffer is to manually brush into the os cache, so that the data can be immediately found.
As long as the data is entered os cache, buffer will be cleared because the buffer is not required to retain the data in the translog which has persisted to disk to copy the
3) as long as the data enters os cache, then you can make this segment file data provide external search
4) repeat steps 1 to 3, continue to enter new data buffer and translog, will continue to write data to a buffer and a new segment file to go, finish each refresh buffer empty, translog reserved. With the advance of this process, translog will become bigger and bigger. When translog reaches a certain length of time, it will trigger the commit operation.
data in the buffer, all very well, every one second was os cache brush to go, and the buffer was emptied. So this buffer data is always possible to hold not fill es process of memory.
Each time a data write buffer, while the log is written to a log file translog go, so this is a translog log file becomes larger and larger, and when translog log file to a certain extent, will perform the commit operation.
5) commit operation occurs first step, is in the refresh buffer data prior to the os cache, empty Buffer
. 6) will be written to a disk file commit point, which identifies all the segment file corresponding to the commit point
7) forcibly the os cache all current data to a disk file fsync to
what translog log file is the role? That is, before you perform the commit operation, the data in the buffer either stay in or stay in the os cache, whether or os cache is a buffer memory, once the machine is dead, on the whole data memory lost.
It is necessary to operation of writing data corresponding to a special log file, the log file translog, once the time the machine is down, when restarting again, ES translog will automatically read the data in the log file, and restored to the buffer memory os cache to go.
commit: 1, write commit point; 2, the strong brush fsync os cache data to disk up; 3, clear the log file translog
8) existing translog empty, then restart enable a translog again, the commit operation completes. 30 minutes by default automatically performs a commit, but if translog too large, will commit every trigger. The entire commit process, called flush operation. We can manually perform the flush operation, it is to brush all os cache data to disk file.
Not called the commit operation, flush operation. es in the flush operation, it corresponds to the whole process of the commit. We can also es api, to flush manually, manually os data in the cache to disk fsync strong brush up, recording a commit point, empty the translog log file.
9) translog os cache is actually written first, the default brush once every 5 seconds to go to disk, so by default, maybe five seconds of data will only stay in the os cache buffer or translog file, if this when the machine hung up, I lost 5 seconds of data. But this performance is better, lose up to 5 seconds of data. Translog also can be set to each write operation must be directly fsync to disk, but the performance will be much worse.
In fact you're here, if the interviewer does not ask you questions es lose data, you can here to dazzle an interviewer, you say, in fact, es is the first near real-time, data can be written after one second search ; you may lose data, your data and 5 seconds of data, remain in the buffer, translog os cache, segment file os cache , there are five seconds of data on the disk is not, at this time if downtime will result in 5 seconds data lost.
If you want must not lose data, you can set the parameters, official documents, look at Baidu. Every piece of data is written, is written into buffer, while translog written on the disk, but it can cause write performance, write throughput will drop an order of magnitude. Originally a second can write 2000, and now you can only write 200 for a second, are likely.
10) If the delete operation, commit time will generate a .del file, which will be marked as deleted a doc state, so when searching files according to .del know the doc was deleted
11) If it is an update operation, that is, the original doc marked as deleted state, and then write a new data
12) buffer refresh every time, it will produce a segment file, so the default is 1 second a segment file, segment file will be more and more, At this performs regularly merge
13) each merge when multiple segment file will be merged into one, while there will be marked as deleted the doc to physically removed, and then the new segment file written to disk, there will write a commit point, identify all new segment file, and then open the segment file to use for the search, and delete the old segment file.
es in the writing process, there are four core concepts underlying, refresh, flush, translog, merge
When multi-segment file to a certain extent, es will automatically trigger the merge operation, multiple segment file to merge into a segment file.

 

Guess you like

Origin www.cnblogs.com/muzinan110/p/11105731.html