Let Elasticsearch fly - Performance Optimization Practice dry

0, Inscription

Elasticsearch performance optimization of the ultimate goal: the user experience cool.

Definition of cool - well-known product Ning Liang once said, "When people meet is called the state of pleasure, people will not be satisfied uncomfortable, it will begin to seek if the person looking in, get instant gratification at once. this feeling is cool!. "

Elasticsearch cool point is: fast, accurate and full!

About Elasticsearch performance optimization, Ali, Tencent, Jingdong, Ctrip, Didi, 58, etc. have had a lot of in-depth practical work, are very good reference. This article for a train of thought, based Shuang Elasticsearch, and performance optimization related discussion.

1, the cluster planning and optimization practice

1.1 clusters based on the target amount of data plan

In the early business, frequently asked questions, to several nodes of the cluster, memory, CPU to how much, or not to SSD?

The main points to consider are: the amount of data storage your goal is how much? Amount of data that can target thrust reversers node number.

Buffer capacity set aside 1.2

Note: Elasticsearch warning level three lines, disk usage reaches 85%, 90%, 95%.

Different lines have different alert level emergency response strategy.

This point, disk capacity planning, including the selection you want. Under the control of 85% is reasonable.

Of course, it can also be adjusted by configuration.

1.3 ES cluster nodes and other business functions try not to reuse a machine.

Unless memory is very large.

Example: ordinary server, install the ES + Mysql + redis, after a large amount of business data, bound out of memory and other problems.

Try to choose SSD disk 1.4

Elasticsearch official documents certainly recommend SSD, taking into account cost reasons. It requires a combination of business scenarios,

If the business to write, you have a higher rate of retrieval rate requirements, recommend the use of SSD disk.

Ali's business scene, SSD disk promoted five times faster than the rate of mechanical hard disk.

But due to business scene to another.

1.5 memory configuration to be reasonable

Official recommendations: the size of heap memory is the official advice is: Min (32GB, installed RAM / 2).

Medcl and wood are clearly uncle said, necessary to provide 32 / 31GB so big, it is recommended: thermal data set: 26GB, cold data: 31GB.

The overall memory size is no specific requirement, but it is certainly the greater the content, the better retrieval performance.

Experience for reference: 200GB + business scenarios incremental data, the server should at least 64GB of memory per day.

In addition to the JVM reserved memory should be sufficient, or will often OOM.

1.6 CPU core number is not too small

And the number of CPU cores is associated ESThread pool. And write, has an associated retrieval performance.

Recommendations: 16 + nucleus.

1.7 super middleweight business scenario, consider inter-cluster retrieval

Unless the order is very big business, such as: drops, Ctrip PB + business scenarios, or basically not need to retrieve across the cluster.

1.8 odd number of cluster nodes without

ES maintain internal cluster communication, not based on distributed deployment mechanism zookeeper, therefore, no odd.

However discovery.zen.minimum_master_nodes value to be set to: the number of candidate master node / 2 + 1, in order to effectively avoid the split brain.

1.9 Optimization distribution node type

Cluster nodes: <= 3, is recommended: all nodes master: true, data: true. Both the master node is routing node. Cluster nodes:> 3, based on business scenarios required, it is recommended: gradual independence and coordination of the Master node / routing node.

1.10 recommends that the separation of hot and cold data

Thermal data storage SSD and mechanical disks ordinary historical data storage, retrieval efficiency improve physically.

2, index optimization practice

Mysql and other relational database to sub-library sub-table. Elasticserach then have to be well considered.

How many index 2.1 set?

Recommended storage based on business scenarios.

Different types of data to sub-channel index storage. Example: to store information collected know almost almost-known index; APP APP collected information is stored in the index.

How many slices 2.2 set?

Recommended to measure the amount of data.

Experience: recommends that each fragment size should not exceed 30GB.

2.3 Number of set pieces?

The recommended size of the number of nodes in the cluster, the number of slices is recommended> = the number of cluster nodes.

5-node cluster, five fragmentation is more reasonable.

Note: Unless reindex operations, the fragments can not be amended.

2.4 Number of copies setting?

The robustness of the system unless you have unusually high requirements, such as: the banking system. Consider two or more copies.

Otherwise, a sufficient copies.

Note: The number of copies can be modified at any time by the configuration.

2.5 Do not create more than one type at the next index

Even if you are 5.X version, consider upgrading to future versions follow scalability.

Recommendation: An index corresponds to a type. 6.x corresponding default _doc, 5.x you type directly correspond unified doc.

2.6 Planning index by date

With the increase in business volume, and the amount of data a single index surged to a prominent contradictions.

According to the index date planning is the inevitable choice.

Benefit 1: seconds can be achieved historical data deleted. You can delete the history of the index. Note: if an index needs the delete_by_query + force_merge operation, delete slow and incomplete.

Benefits 2: easy to separate hot and cold data management, data retrieval in recent days, the date corresponding to the specified index in direct physical force a fast!

Operation Reference: template + rollover API use.

2.7 Be sure to use an alias

ES not change the index name mysql respect. Alias ​​is a relatively flexible options.

3. Data model optimization practice

3.1 Do not use the default Mapping

Mapping is the default field type of automatic identification system. Wherein: string type into Default: text and keyword types. If your business does not require segmentation, retrieval, only exact match is required, it can be set only for the keyword.

Business need to choose the appropriate type based on, will help save space and improve the accuracy, such as: floating-point option.

3.2 Mapping the fields of selection process

Let Elasticsearch fly - Performance Optimization Practice dry

 

3.3 choose the right word breaker

Common open-source Chinese word includes: ik tokenizer, ansj tokenizer, hanlp word, a stammer word, a massive tokenizer, "ElasticSearch most complete word comparator and methods of use" search to see the contrast.

If you choose ik, we recommended ik_max_word. Because: word that contains the results of coarse-grained basic result of fine-grained ik_smart.

3.4 date, long, or keyword

According to business needs, if you need to do the analysis based on the timeline, you must date type;

If you need only return to the second level, it is recommended to use keyword.

4, data is written to optimize practice

4.1-second response or not?

Elasticsearch near real essence is: the fastest 1s written data can be queried.

If refresh_interval set to 1s, is bound to generate a lot of segment, retrieval performance will be affected.

Therefore, the non-real-time transfer large scene can be set to 30s, and even -1.

4.2 to reduce the copy to improve write performance.

Before writing, the number of copies is set to 0,

After writing, the number of copies is set to the original value.

4.3 Batch can not write a single

Batch interface bulk, batch size to combine the size of the queue, the queue size and the number of cpu core and thread pool size of the machine.

4.4 Disabling swap

On Linux systems, by running the following command to temporarily disable the exchange:

1sudo swapoff -a

 

5, the actual retrieval optimization polymerization

5.1 disable wildcard fuzzy match

Data even after the order to achieve higher TB +, wildcard is prone to jamming in the case of multi-field combination, and even lead to the collapse of a cluster node is down.

The consequences could be disastrous.

alternative plan:

Scheme One: the requirements for high precision solutions: two tokenizer binding, binding Standard and ik, using match_phrase retrieval.

Option Two: For an alternative to the accuracy of the less demanding: ik word suggestions by match_phrase and slop combined queries.

5.2 very small probability of using match match

Chinese match match result is obviously inaccurate. Big business scenario will use phrase match "match_phrase".

match_phrase combined with reasonable word dictionary, thesaurus, search results will make greater accuracy to avoid noise data.

5.3 binding business scenario, a large number of filters using filter

For without the use of computing relevance scores scene, no doubt filter caching mechanism will make the retrieval faster.

Example: filtering a zip code number.

5.4 Results and returns control field

And mysql query like business development, select * operation is almost not necessary.

Similarly, ES in, _source return the entire field is non-essential.

To control field by _source return, return only business-related field.

Web page text content, page snapshot html_content similar field of bulk returns, probably a design flaw in the business.

Obviously, the summary field should be written in advance, instead of querying the content after the interception process.

5.5 Paging depth query and traversal

Paging query uses: from + size;

Traversal using: scroll;

Parallel traversal using: scroll + slice.

Selection of a collection of business use discretion.

5.6 Size reasonable set of polymerization

The polymerization results are inaccurate. Unless you set size is 2 to the 32nd power -1, otherwise the result of the polymerization is to take the sorted values ​​integrated for each slice Top element size.

The actual business scenarios require precise feedback results should pay attention to.

Try not to get the full amount of the polymerization results - take TopN polymerization resulting value is very reasonable from a business level. Because it is really not the sort of result values ​​by meaning.

5.7 to achieve reasonable polymerization tab

When polymerization results show, it is bound to face problems after polymerization paging, and ES official performance reasons do not support the aggregation page.

If you need to aggregate after page, we need to develop self-realization. Including but not limited to:

A program: take the result of polymerization got paged memory return.

Option II: scroll scroll after collection redis binding achieved.

6, business optimization

Let Elasticsearch do what it is good at, it is clear that it is better at search based on inverted index.

The operational level, the user wants to see the results as quickly as they want, in the middle of the "field processing, formatting and standardization" and a bunch of operation, the user is not concerned.

To make Elasticsearch more efficient search suggestions:

1) to work for "foreplay"

Extraction field, orientation analysis, classification / clustering, correlation decision on the ETL stage before writing ES;

2) "sleeping clothes" Product Manager

Product Manager based on a variety of wonderful variety of business scenarios might mention unreasonable demands.

As a technician, to "inform with affection enlighten them with reason," explained product manager to understand how search engines, Elasticsearch principle, what they can do, which is really "his concubine can not do."

7 Summary

The actual business development, the company generally require want horse can not eat, want to run fast horse.

For Elasticsearch development is insufficient hardware resources (cpu, memory, disks are full) almost no way to improve performance.

In addition to retrieving polymerization, let Elasticsearch do N more relevant, irrelevant work and then concludes "Elastic it as slow, not as fast."

You mind if there are similar scenarios emerge it?

Provide hardware resources are relatively NB, and take all preparatory work, let Elasticsearch traveling light, I believe you Elasticsearch also will fly!

Guess you like

Origin www.cnblogs.com/CQqf2019/p/11229964.html