Big Data Talk about Micro Classroom | Features and Improvements of Elasticsearch 5.0 New Version

Elastic will release a major version of Elasticsearch V5.0 this fall. This WeChat sharing will introduce some new features and improvements in version 5.0.

5.0? Oh my god, do you think the version jumps too fast?

Well, let's talk about the reasons behind it first.

I believe everyone has heard of ELK, which is the acronym for Elasticsearch, Logstash, and Kibana. Now Elastic has added a new open source project member: Beats.

 

Someone suggested to call it later: ELKB?

For better scalability in the future :) ELKBS? ELKBSU? .....

So we're going to name the product line ElasticStack

At the same time, due to the confusion of the current version, the version number of each product is different. Elasticsearch and Logstash are currently 2.3.4; Kibana is 4.5.3; Beats is 1.2.3;

 

The version number is too messy. Is there any version of Kibana for which version of ES? Are there compatibility issues?

So we plan to unify the version numbers of these products, that is, v5.0, why is it 5.0, because Kibana is all 4.x, the next version can only be 5.0, and other products will jump along, the first The official version 5.0 will be released in the fall of this year, and the latest test version is: 5.0 Alpha 4

At present, the teams are intensively developing and testing, and there are new functions and improvements every day. This sharing mainly introduces the main changes of Elasticsearch.

What's New in Elasticsearch 5.0

First, let's take a look at what new features have been introduced in 5.0.

Let's look at performance first.

The first is support for Lucene 6.x.

Elasticsearch5.0 is the first to integrate Lucene6 version, the most important feature of which is Dimensional Point Fields, multi-dimensional floating-point fields, related fields in ES such as date, numeric, ip and Geospatial will greatly improve performance.

Let’s just say, disk space is reduced by half; indexing time is reduced by half; query performance is improved by 25%; IPV6 is also supported.

Why is it fast? The bottom layer uses Block kd trees. The core idea is to encode the numeric type into a fixed-length byte array, encode and sort the content of the fixed-length byte array, and then build a binary tree, and then recursively build it in turn. At present, the bottom layer It supports 8 dimensions and a maximum of 16 bytes per dimension, which basically meets most scenarios.

Having said so much, it is more straightforward to look at the picture.

In the figure, the total bytes soared from 2015/10/32 because es enabled docvalues. We pay attention to the red line. After the recent introduction of a new data structure, the red index size is only half of the original size.

After the index is small, the time of the merge response is also reduced, see the following figure:

The corresponding Java heap memory usage is only half of the original size:

Looking at the performance of the index , it is also soaring:

Of course, there are many optimizations and improvements in Lucene6, which are not listed here.

Let's look at other optimizations in index performance.

ES5.0 removes the competitive lock used to avoid concurrent updates of the same document at the Internal engine level, resulting in a 15%-20% performance improvement #18060 .

The above screenshots are from ES's daily continuous performance monitoring: https://benchmarks.elastic.co/index.html

Another improvement with aggregation is also very large, Instant Aggregations.

Elasticsearch has provided Aggregation cache at the Shard level. If your data has not changed, ES can directly return the last cached result.

But there is a special scene, that is, date histogram, whether the conditions above kibana are relative times that are often set, such as:

from:now-30d to:now, well, now is a variable that changes all the time, so the query conditions keep changing, and the cache is not used.

After a year of extensive refactoring, the query can now be rewritten flexibly:

First, the `now` keyword will eventually be rewritten to a concrete value;

Secondly , each shard will rewrite the query as a `match_all` or `match_none` query according to its own data range, so the current query can be effectively cached, and only shards with individual data changes need to be recalculated. Greatly improve query speed.

Also look at the Scroll related.

Now a new one has been added: Sliced ​​Scroll type

Have you used the Scroll interface, is it very slow? If you have a large amount of data, it is really unacceptable to use Scroll to traverse the data. Now the Scroll interface can be used for concurrent data traversal.

Each Scroll request can be divided into multiple Slice requests, which can be understood as slices. Each Slice is independent and parallel, and it is many times faster to use Scroll to rebuild or traverse.

Check out this demo

You can see two scroll requests, the ids are 0 and 1 respectively, and max is the maximum supported parallel task, which can traverse and obtain data independently.

Let's take a look at the work that es does in query optimization.

A new Profile API has been added.

https://www.elastic.co/guide/en/elasticsearch/reference/master/search-profile.html#_usage_3

It is said that if you want to get rich, you need to build roads first. To tune, you need to monitor first. Elasticsearch provides stats at many levels to facilitate you to monitor and tune, but it is not enough. In fact, in many cases, the query speed is slow, and a large part of the reason is bad It is caused by the query. Anyone who has played SQL knows that the execution plan of the database service is very useful. You can see whether the query has gone through the index and execution time, which is used for tuning. Elasticsearch now provides a Profile API to perform For query optimization, you only need to enable profile: true when querying, and the performance consumption of each component during a query execution process can be collected.

It is clear at a glance how much time and how much the subquery takes.

Both search and aggregation profiles are supported.

There is also a problem related to page turning, that is, deep paging , which is a long-standing problem. Because it requires global sorting (number_of_shards * (from + size)), it needs to consume a lot of memory. There was no limit to es in the past. Some students turned to Several thousand pages found that es directly overflowed and hung up, and elasticsearch added a limit later, from+size cannot exceed 1w, and if you need to turn pages deeply, it is recommended to use scroll to do it.

However, there are several problems with scroll. The first one is that there is no sequence, and it is traversed and read directly from the underlying segment. The second one is that the real-time performance cannot be guaranteed. The scroll operation is stateful, and es will maintain the context of the scroll request for a period of time. Release, in addition, you have modified the index data during the scroll process. At this time, the scroll interface cannot be obtained, and the flexibility is poor. Now there is a new Search After mechanism, which is similar to scroll and is also a cursor mechanism. Its principle is to sort documents according to multiple fields, and then use the last document of the previous result as the starting value to take size documents. Generally, we recommend using the _uid field, whose value is a unique id.

#(Search After

https://github.com/elastic/elasticsearch/blob/148f9af5857f287666aead37f249f204a870ab39/docs/reference/search/request/search-after.asciidoc

Let's take a look at a demo of Search After, and understand it more intuitively:

In the demo above, the two parameters after search_after are the two results of sort.

According to your sorting conditions, three sorting conditions, just pass three parameters.

Take a look at the new features related to indexing and shard management.

Added a Shrink API

https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-shrink-index.html#_shrinking_an_index

I believe everyone knows that the number of shards in the elasticsearch index is fixed, and it cannot be modified after it is set. If you find that there are too many or too few shards, if you want to set the number of shards in Elasticsearch, you can only set it when you create the index. , and the data cannot be modified after it comes in. If you want to modify it, you can only rebuild the index.

Now with the Shrink interface, it can shrink the number of shards to its factor. For example, if you had 15 shards before, you can shrink it into 5 or 3 or 1, then we can imagine it as such In this scenario, in the collection stage where the writing pressure is very high, set enough indexes to make full use of the parallel writing capability of shards. After the indexes are written, they shrink into fewer shards to improve query performance.

Here is an example of an API call

The above example scales my_source_index into a sharded my_target_index, using optimal compression.

Some people will definitely ask is it slow? very fast! The process of shrinking will use the hardlink of the operating system to link the index files. This operation is very fast, and the shrinking can be completed in milliseconds. Of course, windows does not support hard links, and files need to be copied, which may be very slow.

Let's take a look at another interesting new feature. In addition to being interesting, it is of course very powerful.

A new Rollover API has been added.

https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-rollover-index.html#indices-rollover-index

The scenario mentioned above is very useful for log data. Generally, we divide the index by day (the data volume can be further divided), we used to set up a template to automatically generate the index in the program, everyone If you have used logstash, you should remember that there is a template such as logstash-[YYYY-MM-DD]. Now es5.0 provides a simpler way: Rollover API

The API call method is as follows:

As you can see from the above, first create an index of logs-0001, which has an alias of logs_write, and then we create a rollover rule for this logs_write, that is, the index document does not exceed 1000 or saves up to 7 days of data, more than It will automatically switch the alias to logs-0002. You can also set the parameters of index setting and mapping, and the rest of es will be automatically handled for you. This feature is extremely friendly to scenarios where log data is stored.

New: Reindex.

In addition, with regard to index data, we often rebuild the data before. The data source is in various scenarios, and it is very difficult to rebuild. Then we have to talk about the newly added Reindex interface. Reindex can directly rebuild the data in the Elasticsearch cluster. If When your mapping needs to be rebuilt due to modification, or when the index settings need to be rebuilt, Reindex can be used to facilitate asynchronous reconstruction and support data migration across clusters.

For example, an index created on a daily basis can be periodically rebuilt and merged into a monthly index.

Of course, _source should be enabled in the index.

Let's take a look at this demo. During the reconstruction process, you can still process the data.

Let's take a look at the most relevant one for Java developers, which is RestClient.

5.0 provides the first Java native REST client SDK. Compared with the previous TransportClient, the version depends on the binding, the cluster upgrade is troublesome, and the calls across Java versions are not supported. Dependency decoupling, there is no jar package conflict, it provides automatic cluster node discovery, log processing, and automatic request polling for node request failure, giving full play to the high availability of Elasticsearch, and its performance is comparable. #19055 .

Then let's look at other features:

Added a Wait for refresh function.

In simple terms, it is equivalent to providing a document-level Refresh: https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html.

The index operation adds a refresh parameter. We all know that elasticsearch can set the refresh time to ensure the real-time nature of the data. If the refresh time is too frequent, it will cause a lot of overhead, and if it is too small, it will cause data delay. The _refresh interface at the index level was provided before. , but this interface works at the index level. We do not recommend calling it frequently. What if you need to modify a document and need the client to see it in real time?

In 5.0, the interfaces for adding and modifying data such as Index, Bulk, Delete, and Update can be refreshed at the single document level. There are two options. One is to create a small segment and then perform refresh guarantee. It is visible and consumes a certain amount of overhead. The other is to request to wait for the regular refresh of es before returning.

Calling example:

New: Ingest Node

https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html

Another important feature is IngestNode. Before, if you needed to process data, it was processed before indexing. For example, logstash can structure and convert logs. Now you can process them directly in es. Currently, es provides Some commonly used processors such as convert and grok are used. When using it, first define a pipeline pipeline, which sets the processing logic of the document, and specifies the pipeline name when building the index, then the index will be pre-defined. The pipeline to handle;

Demo again:

```

The above figure first creates a processing pipeline named my-pipeline-id, and then the next index operation can directly use this pipeline to operate on the foo field. The above example is to set the foo field to the bar value.

The above is not too cool, let's look at another example, and now there is such an original log, the content is as follows:

{

  "message": "55.3.244.1 GET /index.html 15824 0.043”

}

Google later learned that the pattern of its Grok is as follows:)

 

Then we can use Ingest to define a pipeline like this:

So what does the document look like after processing through our pipeline, let's get the content of this document and see:

Obviously, the original field message has been split into a more structured object.

Take a look at the script changes

Added Painless Scripting

Remember the loopholes in the Groove script. After the Groove script is enabled, if it is misused, it may bring loopholes. Why? The main reason is that these external script engines are too powerful and can do anything. It will cause security risks. Based on security and performance, we have developed a new scripting engine called Painless. As the name suggests, it is simple, safe, and painless to use. Unlike Groove's sandbox mechanism, Painless uses a whitelist to restrict The access of functions and fields is optimized for es scenarios. It only operates on es data, which is more lightweight and several times faster. It supports Java static types, and the syntax is similar to Groove. It also supports Java lambda expressions. .

Let's compare the performance, see the figure below

Groovy is the weak green one.

Let's see how to use:

def first = input.doc.first_name.0;

def last  = input.doc.last_name.0;

return first + " " + last;

Is it the same as before?

Or it can be strongly typed (10x faster than the dynamic type above)

String first = (String)((List)((Map)input.get("doc")).get("first_name")).get(0);

String last  = (String)((List)((Map)input.get("doc")).get("last_name")).get(0);

return first + " " + last;

Scripts can be used in many places, such as searching for custom scores; processing fields when updating, etc.

Such as:

Let's take a look at the changes in the infrastructure

New: Task Manager

This is the task scheduling management mechanism introduced in 5.0, which is used to manage offline tasks. For example, long-running reindex and update_by_query all run on the TaskManager mechanism, and the tasks are manageable. You can cancel at any time, and The task state is persistent and supports failure recovery;

Also added a new one: Depreated logging

When you use ES, in fact, some interfaces may be marked with the Depreated label, that is, they are deprecated and will be removed in a future version. You can currently use them because generally deprecated interfaces will not be removed immediately. Give enough time to migrate, but you also need to know which ones are not available, and you need to change the application code, so now there is a Depreated log. When you open this log, if the interface you call is already deprecated, the log will be recorded. , then you will know what to do next.

New: Cluster allocation explain API

"Who can give me a reason why shards cannot be allocated?" Now I have it. If you have encountered the problem that shards cannot be allocated normally before, but you don't know the reason, you can only try to manually route or restart the node, but you may not be able to. There are actually many reasons for the solution. The explain interface provided now is to tell you the reason why it cannot be allocated normally at present, so that it is convenient for you to solve it.

In addition, in the data structure, add: half_float type

https://www.elastic.co/guide/en/elasticsearch/reference/master/number.html

Using only 16 bits is enough to meet most scenarios of storing and monitoring numeric types. The supported range is from 2 to the power of 24 to 65504, but it only occupies half of the storage space of float.

Aggregation Added: Matrix Stats Aggregation #18300

Very useful in the financial field, can calculate multiple vector element covariance matrices, correlation coefficient matrices, etc.

Another important feature: add sequence numbers for index writes #10708

Everyone knows that es is written on the primary and then synchronously written to the copy. These requests are all concurrent, although the conflict can be controlled by version,

However, there is no way to guarantee the operation order of other copies. The sequence number is generated when writing, and the checkpoint is also written locally to record the operation point.

In this way, when the copy is restored, the data location of the current copy can also be known, and the restoration only needs to start from the specified data, instead of doing a complete file synchronization as before, and these sequence numbers are also persistent, restart After that, you can quickly restore the copy information. Think about the large number of useless copies and the back and forth of data.

Other improvements in Elasticsearch 5.0

Let's take a look at the improvement of mapping .

Introduce new field type Text/Keyword to replace String

The previous string type is divided into two types: Text and Keyword. The data of the keyword type can only be completely matched, which is suitable for those data that do not need word segmentation.

It is very friendly to filtering and aggregation. Of course, text is the field type that requires word segmentation for full-text retrieval. The advantage of separating the types is that it is easier and clearer to use. In the past, the analyzer and index needed to be set, and many of them were custom tokenizers. From the name, it was impossible to tell whether the word segmenter was or not. It was very troublesome to use.

In addition, the string type is still there for the time being and will be removed in 6.0.

There are also improvements to Index Settings

There are too many configurations of Elasticsearch. In previous versions, many useless configurations have been removed. Do you often make mistakes?

Now, configuration validation is more strict and atomic, if one of them fails, the entire update request will fail, not half-success and half-failure. Here are two main points:

  1. The setting can be reset to the default value, just set it to `null`
  2. The new parameter `?include_defaults` of the get settings interface can directly return all settings and default values

Improvements to cluster handling: Deleted Index Tombstones

In the previous es version, if your old node contains some index data, but the index may have been deleted later, after you start the node, the index will be added to the cluster again. Do you feel a little shaky? Now, es5.0 will keep 500 deleted index information in the cluster status information, so if it is found that the index has been deleted, it will be automatically cleaned up and will not be added again.

Improvements to document objects: The field name supports English periods again. In 2.0, the support of dot in field names was removed. Now the problem is solved and it is supported again.

es will consider the following two documents to have the same content:

And some other improvements

Changes to Cluster state are now acked with all nodes.

If a copy of the shard fails, when the Primary mark fails, it will confirm with the Master node and return.

Using UUIDs as the physical pathname for the index has many advantages and avoids naming conflicts.

_timestamp and _ttl have been removed and need to be handled in Ingest or the terminal.

ES can directly use HDFS for backup and restore (Snapshot/Restore) #15191 .

Delete-by-query and Update-by-query are back to the core. They used to be plugins and can now be used directly. They are also built on the Reindex mechanism.

HTTP requests support compression by default. Of course, the HTTP caller needs to transmit the corresponding support information in the header information.

Creating an index will no longer make the cluster red, and the cluster will not be stuck because of this.

The BM25 scoring algorithm is used by default, and the effect is better. It used to be TF/IDF.

Snapshots add UUID to resolve conflict #18156 .

Limit index request size to avoid overwhelming ES #16011 with a large number of concurrent requests.

Limit the number of shards for a single request, the default is 1000 #17396.

Removing site plugins means that neither head nor bigdesk can be directly installed in es, but you can deploy independent sites (static files anyway) or develop kibana plugins #16038 .

Allow existing parent types to add child types #17956.

This feature should be very useful for people using the parent-child feature.

Support semicolon (;) to split url parameters, same as ampersand (&) #18175 .

For example the following example:

curl http://localhost:9200/_cluster/health?level=indices;pretty=true

Well, it looks like a lot. In fact, the above is just a part of many features and improvements. Es5.0 has done a lot of work. I was going to talk about bug fixes, but there are too many, limited time, and some important bugs. In 2.x, it has been solved for the first time, you can check the following link for more detailed update log:

https://www.elastic.co/guide/en/elasticsearch/reference/master/release-notes-5.0.0-alpha1-2x.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/release-notes-5.0.0-alpha1.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/release-notes-5.0.0-alpha2.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/release-notes-5.0.0-alpha3.html

https://www.elastic.co/guide/en/elasticsearch/reference/master/release-notes-5.0.0-alpha4.html

Download and experience the latest version : https://www.elastic.co/v5

Upgrade Wizard: https://github.com/elastic/elasticsearch-migration/blob/2.x/README.asciidoc

If you have questions related to es, please go to the Elastic Chinese community : http://elasticsearch.cn

For communication and discussion, you can add me on WeChat to discuss it separately. You are also welcome to post on elasticsearch.cn for discussion. Thank you.

Q&A

Q1: Is it possible to use es as the secondary index of hbase?

A1: To be honest, there are relatively few cases of this kind, because the cost is relatively high. It is difficult to combine two distributed systems and meet sufficient performance. It is not recommended to do so.

Q2: When updating data in batches, a small amount of data will be unsuccessfully updated

A2: First of all, it depends on the reasons for a small number of failures. The return information of es will contain specific information. If the json format is invalid, it will fail.

Q3: Does the ik plugin plan to support the hot update of synonyms and proper nouns? For the application scenarios where the thesaurus is updated frequently, can only re-indexing be adopted?

A3: The synonym has a separate filter, which can be used in combination with ik. Regarding the hot update, it really needs to be reconstructed. After the thesaurus changes, the term generated by the word segmentation will be different. The query will fail.

Q4: Hello, teacher, I have a question to ask. Our original product basic data, product evaluation data, and collections are all in mysql, but we want to go to es now, we want to put the basic data of products on es , collect and evaluate these real-time data, or put mysql, but when doing the sorting function, it will refer to the collection and evaluation of a product. At this time, when data paging is also involved, how to combine es and mysql data to sort Woolen cloth?

A4: This question depends on the specific business scenario. If the update is frequent, but it is still within the es tolerance range and business response indicators, you can directly put it in the es and sort it in the es. If it is too large, it is recommended to put it in external storage. There are many ways to combine with es. Does collection evaluation really need to be so real-time? In addition, the scoring mechanism of ES is extensible. It is also feasible to use custom plug-ins to read external data sources in the scoring stage and perform mixed scoring.

Q5: Can large agg queries be canceled now?

A5: Not yet.

Q6: Have you considered providing sql syntax query?

A6: There is no plan for now.

Q7: For machines with 128g memory, the official recommendation is to put two es instances on the machine. Is this also recommended at present?

A7: This actually depends on the scenario. If the index on a single machine is relatively large, it is recommended to leave a little more for the operating system for caching. Multiple instances can provide sufficient throughput.

Q8: Has the algorithm used to calculate unique count changed?

A8: Yes, it is called cardinality in elasticsearch. There is an article here: https://www.elastic.co/blog/count-elasticsearch.

Q9: In ES5, each server has 256G memory, how much storage is appropriate for each server? Is it 24T, 48T or can it be more?

A9: It depends on the scene, there are more than 48T.

Q10: Do you only need to install Elastic Stack once, or do you need to install different components like the original elk?

A10: The installation method is the same as before.

Q11: How to de-duplicate a field in es? The specific problem is this. We have an article index with 200 million data volumes. There are always a large number of duplicate titles in the results of each search. We hope to de-duplicate according to the title when querying. That is, the Field Collapsing feature. Officially, there is a solution for deduplication through terms aggregation, but the effect is not very satisfactory, and there will still be many repetitions. We hope that even if the title is strictly equal to deduplication, it is acceptable. In addition, we have an idea of ​​​​de-duplication through simhash, which is to calculate the simhash of the title, store it in the index, and calculate the similarity through simhash in the search stage, but this requires full recalculation, and the amount of data is too large. So I still hope that this function can be achieved through some kind of trick without moving the existing index.

A11: Directly remove duplicates. There is no better solution for this at present, but there are many alternatives. First of all, your scenario needs to be confirmed to see if title duplication is not allowed. If so, you can hash it as the primary key when building the index. In this way, there will be no duplication. If you think that the original data is also required, then the indexing stage generates an independent index that removes the title to do the join. Of course, it depends on the specific research of your business scenario.

Q12: In the case of limited hardware, the strategy for cleaning up expired data.

A12: If your data structure is fixed, combined with the Rollover interface of 5.0, estimate the maximum amount of indexes that can be carried, and check and delete indexes regularly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326217441&siteId=291194637