With the rapid development of Elasticsearch on a global scale, its functions and application scenarios are becoming more and more abundant. At ElasticConference 2023 today, we learned about a series of exciting new features for the Elasticsearch 7 and 8 series. This article will introduce these new features and their applications in detail to help you better understand and use Elasticsearch.
1. New cluster balancing strategy
Strategy 1: Rebalance the disk according to the size of the fragment. In this strategy, the system monitors the disk usage on each node in the cluster. If the disk usage of a node is found to exceed the preset threshold, the system will automatically trigger a shard migration operation to migrate some shards on the node to other nodes with lower usage. This shard size-based rebalancing strategy helps achieve balanced allocation of disk resources in the cluster, thereby improving overall performance.
Strategy 2: Rebalance the index according to the imported data load For the load of read and write requests, the system will monitor the imported data load on each node in the cluster. According to the incoming data load, the system will automatically adjust the distribution of index shards on each node, so that the number of shards on nodes with higher load decreases, while the number of shards on nodes with lower load increases. In this way, the balanced distribution of the data load introduced in the cluster can be realized while ensuring the system performance.
This new cluster balancing strategy has the following advantages: the new cluster balancing strategy can make full use of the resources of each node and improve the overall performance by balancing disk distribution and introducing data load. Realize load balancing between nodes, reduce the impact of a single node failure on the cluster, and enhance system stability.
In addition, this strategy can automatically adjust resource allocation according to actual needs, avoid resource waste, and improve resource utilization. At the same time, automatic adjustment reduces the burden on O&M personnel, reduces the risk of manual intervention, and helps reduce O&M costs.
2. Kibana supports ARM architecture
3. Centralized collection platform and security scene functions
Elastic Stack introduces a centralized collection platform and provides a series of integration solutions and a unified management platform.
In addition, in terms of security scenarios, the Elastic Stack provides the EQL time series function, which is suitable for scenarios that require sequence matching.
4. Storage-computing separation architecture and new search language ESQL
The future development direction of the Elastic Stack mainly revolves around the architecture of separation of service and storage and calculation. In the cloud-native architecture, using object storage as the medium can reduce the cost of data handling and improve the automatic scaling capability.
In addition, Elastic Stack will also introduce a new search language ESQL to provide higher data processing flexibility and performance. ESQL uses pipelines to connect, and can realize search operations in multiple steps such as data conversion and filtering.
5. Full observation solution
Elastic Stack provides full observation solutions, including logs, indicators, APM, RUM real user monitoring, Synthetic monitoring, general performance analysis, etc. These functions can help users understand and monitor the running status of the system more comprehensively.
6. Security solutions
Elastic Stack also provides security solutions, including collecting security-related data, analyzing and detecting abnormal behavior, and automatic response. Elastic Security can provide a one-stop security solution, integrating SIEM, Endpoint Security and Threat Hunting functions on one platform to help enterprises achieve more efficient security protection.
7. Machine Learning Integration
Elasticsearch has integrated machine learning functions, which can be used for tasks such as anomaly detection and time series prediction. The new version of Elasticsearch will further optimize machine learning functions, improve model training and prediction performance, and provide more machine learning algorithms for users to choose from.
For this, I personally highly recommend the GPT4 VS Elasticsearch taught by Mr. Li Jie in the second part. It is very good and worth learning repeatedly! (As shown below)
Faster than faster, Elasticsearch 8.0 is officially released!
8. Geospatial Search and Visualization
Elasticsearch 7 and 8 series further enhance geospatial search and visualization capabilities. New features include support for GeoJSON data, optimizations for processing geospatial data, and more geospatial aggregation and visualization tools. These functions will help users to process and analyze geospatial data more conveniently.
Visualization of IP address distribution map based on Elasticsearch + kibana
9. Flexible computing resource scheduling and cost optimization
Elasticsearch introduces the elastic computing resource scheduling function, which can dynamically allocate computing resources according to actual business needs. In addition, the new version also provides cost optimization tools to help users evaluate and optimize the operating costs of Elasticsearch clusters.
10. More powerful API and client library support
Elasticsearch 7 and 8 series will provide more powerful API and client library support to meet the needs of various programming languages and platforms. This will make it easier for developers to integrate and use Elasticsearch functionality.
11. Optimization at the retrieval level
Regarding optimization at the retrieval level, Elasticsearch 7 and 8 series also have many significant improvements. Here are some key search optimization features:
In-depth explanation of Elasticsearch retrieval classification - basic articles
11.1. Point In Time (PIT)
Point In Time (PIT) is a new feature introduced after Elasticsearch 7.10 release. It allows users to create a snapshot while searching that remains consistent over time. This enables users to get a consistent view across different search requests, avoiding inconsistent results due to index updates.
Dry goods | Comprehensive and in-depth interpretation of Elasticsearch pagination query
11.2. Wildcard Field Types
The Wildcard field type is a new field type designed to support efficient wildcard and regular expression queries. It can help users execute complex queries containing wildcards and regular expressions faster and improve query performance.
Dry goods | Elasticsearch search type selection guide
11.3. Runtime Fields
Runtime Fields is a new field type that allows users to dynamically calculate field values at query time. This means that users do not need to calculate and store these fields when indexing, thus saving storage space and improving indexing performance. In addition, Runtime Fields also supports the Painless scripting language, enabling users to flexibly define field calculation logic.
In-depth explanation of Elasticsearch runtime type Runtime fields
11.4. Retrieving snapshots
Elasticsearch 7 and 8 series support the retrieval snapshot function, allowing users to specify a historical index snapshot when querying. This is very useful for application scenarios that need to query historical data or analyze data changes. Users can easily go back to the data status at any point in time to meet various business needs.
Dry goods | Elasticsearch searchable snapshot in-depth explanation
11.5. Enrich Pipeline
Enrich Pipeline is a new data processing pipeline that allows users to find and enrich data in real time while indexing. This is similar to the lookup operation in the database, which can help users combine related data into one document for subsequent search and analysis. Enrich Pipeline supports multiple search strategies, such as exact matching, fuzzy matching and geospatial matching, to meet the needs of different scenarios.
Enrich Processor - a new way for Elasticsearch to link data across indexes
11.6 Search Optimization Sorting
The Block Max WAND algorithm is an efficient document retrieval algorithm based on an inverted index, designed to quickly identify and skip documents that are not competitive, thereby improving query efficiency.
The implementation process of the Block Max WAND algorithm includes dividing the document collection into multiple blocks, building an inverted index for each block, and using the inverted index to calculate the document score. When selecting the highest-ranked chunks for the next round of retrieval, those chunks with a score lower than the lowest score of the documents already found are skipped. This process is repeated until a sufficient number of documents are found or all blocks are skipped.
11.7 Match only Text
The "Match only Text" query is suitable for scenarios that require fuzzy matching queries on text-type fields, for example, in applications such as search engines and e-commerce platforms, users enter keywords to query, or unstructured or semi-structured data. Word matching, such as log data, social media data, etc. However, it should be noted that this query is usually not suitable for scenarios that require exact matches or range queries. In this case, other query types should be selected, such as "term" query or "range" query.
Through the optimization of the above retrieval level, Elasticsearch 7 and 8 series have achieved significant improvements in query performance, data storage, real-time computing, and data processing, providing users with more powerful and flexible retrieval functions.
11.8 Save only the Doc Value field
Elasticsearch can choose to save only Doc Values when processing field data. Doc Values is an on-disk columnar storage format that allows Elasticsearch to perform queries and aggregations more efficiently. The benefits of saving only Doc Value fields include: Saving disk space: Keeping only Doc Values can reduce the disk space required to store the index, because it contains only the data actually needed for query and aggregation. Improve query performance: Since Doc Values is columnar storage, Elasticsearch can process data more efficiently when performing operations such as aggregation and sorting.
In-depth interpretation of Elasticsearch internal data structure
This helps to speed up query response times. Reduced memory usage: Doc Values are stored on disk, not in memory, so memory usage can be reduced, especially when performing heavy aggregation operations. Cache-friendly: Since Doc Values are stored in columns, CPU cache lines can be better utilized when caching. This helps improve query performance.
It should be noted that saving only the Doc Value field limits some functionality. For example, the document source (_source) field will not be available, meaning that the original document content cannot be updated or retrieved with a partial document. Therefore, these limitations should be weighed against the above benefits when only Doc Values are retained.
12. Summary
ElasticConference 2023 brings us many exciting new features for the Elasticsearch 7 and 8 series. These new capabilities will help increase data processing capabilities, reduce storage costs, enhance real-time computing flexibility, and improve security and observability. As a mature search and analysis engine, Elasticsearch is constantly being optimized and improved to bring users a better experience.
Note: The content of this article is based on the sharing of Mr. Zhu Jie , the official senior architect of Elastic .
China's largest ElasticStack unofficial public account