10 minutes quick start with Elasticarch, a massive data search and analysis engine

Author: jeremyshi, Tencent TEG background development engineers

1. Background

With the vigorous development of information technologies such as the mobile Internet, the Internet of Things, and cloud computing, the amount of data has exploded. Nowadays, we can easily find the information we want from the massive data, which is inseparable from the help of search engine technology. Especially for the indexing, retrieval and sorting mechanisms, we can achieve basic full-text retrieval functions without having to understand the complex information retrieval principles behind them. The amount of data reaches one billion, and the scale of tens of billions can still return search results in seconds. The actual problems we are concerned about, such as system disaster tolerance, data security, scalability, and maintainability, can be effectively solved in Elasticsearch, which ranks first in the field of open source search engines.

2. Introduction to Elasticsearch

Elasticsearch (ES) is an open source distributed search analysis engine based on Lucene, which can index and retrieve data in near real time. It has the characteristics of high reliability, ease of use, and active community, and is widely used in scenarios such as full-text retrieval, log analysis, and monitoring analysis. Due to the high scalability, the cluster can be expanded to a hundred-node scale to process petabytes of data. Write, query, cluster management and other operations can be realized through simple RESTful API. In addition to searching, it also provides rich statistical analysis functions. And the official function expansion package XPack meets other needs, such as data encryption, alarms, machine learning, etc. In addition, custom plug-ins, such as COS backup, QQ segmentation, etc., can be used to meet specific functional requirements. The following mainly introduces the architecture and basic principles of ES.

2.1 Elasticsearch architecture and principle

Elasticsearch cluster

Basic concepts :

Cluster "Cluster": It consists of ES nodes deployed on multiple machines to process larger data sets and achieve high availability.
Node "Node": ES process on the machine, different types of nodes can be configured.
Master Node: Used to select the master of the cluster. One of the nodes serves as the master node, responsible for cluster metadata management, such as index creation, and nodes leaving to join the cluster.
Data Node "Data Node": Responsible for index data storage.
Index "Index": A logical collection of index data, which can be analogous to the DataBase of relational data.
Shard "sharding": Index data subsets, by assigning shards to different nodes in the cluster, data horizontal expansion is achieved. To solve the problem of insufficient CPU, memory, and disk processing capabilities of a single node.
Primary Shard "Primary Shard": Data sharding adopts the master-slave mode, and the shards receive index operations.
Replica Shard "replica shard": A copy of the primary shard to improve query throughput and achieve high data reliability. When the primary shard is abnormal, one of the replica shards will be automatically promoted to the new primary shard. In order to make it easier for everyone to understand the data model in ES, compare it with the relational database MySQL:

Data model category

As can be seen from the above architecture diagram, the ES architecture is very concise. Built-in automatic discovery realizes Zen discovery. When a node is started, it can join the cluster by contacting the cluster member list. One of the nodes serves as the master node for cluster metadata management and maintains the distribution of shards among nodes. When a new node joins the cluster, the Master node will automatically migrate some fragments to the new node to balance the cluster load.

Node joins the cluster

Distributed clusters inevitably have node failures. The master node will periodically detect the survival status of other nodes in the cluster. When a node fails, it will move the node out of the cluster. And automatically restore the fragments on the failed node on other nodes. When the primary shard fails, one of the replica shards will be promoted as the primary shard. Other nodes will also detect the active master node. When the master node fails, the built-in Raft-like protocol will be triggered to select the master, and the minimum number of candidate master nodes will be set to avoid cluster brain split.

Node leaves the cluster

In addition to cluster management, index data reading and writing is also an important part of our concern. ES adopts peer-to-peer architecture, each node saves the full amount of fragment routing information, that is, each node can receive user reads and writes. For example, if a write request is sent to node 1, the write request uses the hash value of the document ID to determine which primary shard to write to by default. Here, it is assumed that shard 0 is written. After writing the main shard P0, parallelly forward the write request to the node where the replica shard R0 is located. When the node where the replica shard is located confirms that the write is successful, it returns to the client to report that the write was successful to ensure data security. And before writing, it will ensure the number of copies of the quorum number to avoid inconsistent written data caused by network partition.

Write operation

The query adopts distributed search. For example, after a request is sent to node 3, the request will be forwarded to the node where the primary or replica shard of the index is located. Of course, if you write and query both with routing field information. The request will only be sent to some shards, avoiding full shard scanning. After these nodes complete the query, the results are returned to the requesting node, and the requesting node gathers the results of each node and returns it to the client.

Query operation

2.2 Lucene principle

After introducing the basic principles of ES clusters, let's briefly introduce Lucene, the underlying storage engine of ES. First of all, Lucene is a high-performance information retrieval library that provides basic functions of indexing and retrieval. ES solves the problems of reliability and distributed cluster management on this basis and finally forms a productized full-text search system.

The core problem that Lucene solves is full-text search. Different from traditional retrieval methods, full-text retrieval avoids all content scanning during query. For example, after data is written, the content of the written document field is divided into words to form a dictionary table and an inverted table associated with it. When querying, the result of keyword segmentation directly matches the contents of the dictionary table, and the related document list is obtained to quickly obtain the result set. And through sorting rules, the documents with high matching degree are displayed first.

Inverted index

In order to speed up the indexing speed, Lucene adopts the LSM Tree structure, and first caches the index data in the memory. When the memory space occupies high or reaches a certain time, the data in the memory will be written to the disk to form a data segment file (segment). The segment file contains multiple files such as dictionary, inverted table, field data, etc.

LSM Tree structure

In order to be compatible with writing performance and data security, for example, to prevent data in the memory buffer from being lost due to machine failure. ES writes the transaction log Translog while writing to the memory. The data in the memory will periodically generate new segment files, which can be opened and read by writing to the file system cache with lower overhead for near real-time search.

Write buffering and persistence

3. Elasticsearch application scenarios

Typical usage scenarios of ES include log analysis, timing analysis, full-text search, etc.

3.1 Real-time log analysis scenario

Logs are a broad-based data format in the Internet industry. Typical logs include operation logs used to locate business problems, such as slow logs and exception logs; business logs used to analyze user behavior, such as user clicks and access logs; and audit logs for security behavior analysis.

Elastic ecology provides a complete log solution. Through simple deployment, you can build a complete log real-time analysis service. ES Ecology perfectly solves the real-time log analysis scenario requirements, which is also an important reason for the rapid development of ES in recent years. Logs are generally at the 10s level from generation to access, which is very time-efficient compared to the tens of minutes and hours of traditional big data solutions. The bottom layer of ES supports data structures such as inverted index and column storage, so that the very flexible search and analysis capabilities of ES can be used in log scenarios. With ES interactive analysis capabilities, even in the case of trillions of logs, the log search response time is seconds. The basic process of log processing includes: log collection -> data cleaning -> storage -> visual analysis. Elastic Stack helps users complete the full-link management of log processing through a complete log solution.

Log analysis link

among them:

Log collection: The lightweight log collection component FileBeat reads business log files in real time and sends data to downstream components such as Logstash.
Text analysis: Use regular analysis and other mechanisms to convert log text data into structured data. You can use the independent Logstash service or Elasticsearch's built-in lightweight data processing module Ingest Pipeline to complete data cleaning and conversion.
Data storage: Persistent data storage through the Elasticsearch search and analysis platform, providing full-text search and analysis capabilities.
Visual analysis: through the rich graphical interface, you can search and analyze log data, such as the visualization component Kibana.

Visualization component Kibana

3.2 Time sequence analysis scenario Time sequence data is data that records equipment and system status changes in chronological order. Typical time series data include traditional server monitoring index data, application system performance monitoring data, intelligent hardware, industrial IoT sensor data, etc. As early as 2017, we also explored time series analysis scenarios based on ES. The timing analysis scenario has the characteristics of high concurrent writing, low query latency, and multi-dimensional analysis. Because ES has the capabilities of cluster expansion, batch write, read-write with routing, data sharding, etc., the largest online single cluster has reached 600+ nodes, 1000w/s write throughput, single curve or single timeline. The query delay can be controlled at 10ms. In addition, ES provides flexible, multi-dimensional statistical analysis capabilities, enabling viewing and monitoring to perform flexible statistical analysis based on regions and business modules. In addition, ES supports column storage, high compression ratio, and on-demand adjustment of the number of copies, which can achieve lower storage costs. Finally, time series data can also be easily visualized through Kibana components.

Time series data visualization panel

3.3 Search Service Scenarios Typical search service scenarios include commodity search on JD.com, Pinduoduo, and Mogujie; application search in application stores; website search in forums and online documents. In this type of scenario, users focus on high performance, low latency, high reliability, search quality, etc. For example, a single service needs to reach a maximum of 10w+QPS, the average response time of requests is within 20ms, and the query glitch is less than 100ms. High availability, such as search scenarios, usually requires the availability of 4 9s, and supports single-machine room fault tolerance. At present, the Elasticsearch service on the cloud already supports multi-zone disaster recovery, and the ability to recover from failure in minutes. Through ES efficient inverted index, as well as custom scoring, sorting capabilities and rich word segmentation plug-ins, the full-text search requirements are realized. In the field of open source full-text retrieval, ES has been ranked first in the DB-Engines search engine category for many years.

DB-Engines search engine ranking

4. Tencent Elasticserch Service

There are a large number of real-time log analysis, time series data analysis, and full-text search demand scenarios inside and outside the company. At present, we have teamed up with Elastic to provide a kernel-enhanced ES cloud service on Tencent Cloud, abbreviated as CES. The kernel enhancements include the Xpack commercial kit and kernel optimization. In the process of serving companies and public cloud customers, many problems and challenges have also been encountered, such as ultra-large clusters, tens of millions of data writes, and rich usage scenarios for users on the cloud. The following mainly introduces our optimization measures at the kernel level in terms of availability, performance, and cost.

4.1 Usability optimization

Availability problems in three aspects, first, ES core systems lack robustness , which is a distributed system common problems. For example, abnormal queries and pressure overload clusters are prone to avalanches. Insufficient scalability of clusters. For example, if the number of cluster fragments exceeds 10w, there will be obvious metadata management bottlenecks. And the cluster expansion, the node is added back to the cluster after abnormal, there is a problem of uneven data between nodes and multiple hard disks. Second, in terms of disaster tolerance, it is necessary to ensure that services can be quickly restored in the event of a network failure in the computer room, prevent data loss under natural disasters, and quickly restore data after misoperations and other reliability and data security issues. In addition, it also includes some ES system defects found in the operation process , such as Master node blockage, distributed deadlock, slow rolling restart, etc.

For the above problem, the system robustness , our service by limiting, network fault tolerant machines, unusual for such services due to instability. By optimizing the cluster metadata management logic, the cluster expansion capability is improved by an order of magnitude, supporting thousand-level node clusters and million-level fragmentation. In terms of cluster balance, by optimizing the shard balance between nodes and multiple hard drives, the pressure balance of large-scale clusters is ensured.

In terms of disaster recovery solutions , we extend the ES plug-in mechanism to achieve data backup and back-up, which can back up ES data to COS to ensure data security; through the construction of a management and control system to support cross-availability zone disaster recovery, users can deploy as many as needed Availability zone to tolerate the failure of a single computer room. The trash can mechanism is adopted to ensure that the cluster data can be quickly restored in scenarios such as arrears or misoperations. In terms of system defects, we have fixed a series of bugs such as rolling restart, Master blockage, and distributed deadlock. Among them, the rolling restart optimization can accelerate the node restart speed by 5+ times. Master blockage problem, we optimized it in ES 6.x version together with the official.

4.2 Performance optimization

Performance issues, such as timing scenarios represented by logs and monitoring, have very high write performance requirements, and write concurrency can reach 1000w/s. However, we found that ES performance will degrade 1+ times when writing with a primary key. In the stress test scenario, it is found that the CPU cannot be fully utilized. Generally, search services have very high requirements for queryability. Generally, 20w QPS is required, and the average response time is less than 20ms, and it is necessary to avoid query glitch problems caused by GC and poor execution plan.

To solve these problems. In terms of writing, for the primary key deduplication scenario, we use the maximum and minimum values recorded on the segment file to perform query tailoring to accelerate the process of primary key deduplication, and the write performance is improved by 45%. For details, please refer to Lucene-8980 . For the problem that the CPU cannot be fully utilized in the stress test scenario, by optimizing the lock granularity when ES refreshes the Translog to avoid resource preemption, the performance is improved by 20%. For details, please refer to ES-45765 / 47790 . We are also trying to optimize the write performance through vectorized execution. By reducing the number of branch jumps and miss instructions, we expect the write performance to be doubled.

In terms of query, we optimize the segment file merging strategy to automatically trigger the merging of inactive segment files and converge the number of segment files to reduce resource overhead and improve query performance. Query pruning based on the maximum and minimum values recorded in each segment file, which improves query performance by 40%. Through the CBO strategy, avoid cache operations that require large cache overhead and cause 10+ times of query glitches. For details, please refer to Lucene-9002 . It also includes optimizing performance issues in Composite aggregation, realizing real page turning operations, and optimizing aggregation with sorting scenarios to increase performance by 3-7 times. In addition, we are also trying to optimize performance through some new hardware, such as Intel's AEP, Optane, QAT, etc.

4.3 Cost optimization

The cost aspect is mainly reflected in the consumption of machine resources in time series scenarios represented by logs and monitoring. Combined with typical online log and time series business statistics, it is found that the cost ratio of hard disk, memory, and computing resources is close to 8:4:1. It can be concluded that hard disk and memory are the main contradiction, followed by calculation cost. And this kind of time series scene has obvious access characteristics, that is, the data has hot and cold characteristics. Time series data access has the characteristics of how close it is and how far it is. For example, the proportion of data access in the past 7 days can reach more than 95%, while historical data access is less, and it is usually access to statistical information.

In terms of hard disk cost , because data has obvious cold and hot characteristics, we adopt a cold and hot separation architecture and use a hybrid storage solution to balance cost and performance. Since historical data is usually only access statistics, we use pre-computed Rollup in exchange for storage and query performance, similar to materialized views. For not using historical data at all, it can also be backed up to a cheaper storage system such as cos. Other optimization methods include multi-disk strategy compatible data throughput and data disaster recovery, and regular deletion of expired data through life cycle management.

In terms of memory cost , we found that especially for large-storage models, only 20% of the storage resources are used, and the memory is insufficient. In order to solve the problem of insufficient memory, we use Off-Heap technology to improve the utilization of in-heap memory, reduce GC overhead, and improve the ability of a single node to manage disks. Move the FST that accounts for a large amount of memory to off-heap management, and avoid data copying inside and outside the heap by storing the addresses of objects outside the heap in the heap. The memory recovery of objects outside the heap is realized through the Java weak reference mechanism to further improve the memory usage. The 32GB in-heap memory can manage about 50 TB of disk space, which is 10 times higher than the native version, and the performance is the same, while the GC advantage is significantly improved.

In addition to optimization at the kernel level, through the management and control platform at the platform layer, it supports service resource management on the cloud, instance instance management, etc. to realize service hosting. Convenient and quick for instance creation and specification adjustment. The quality of service is guaranteed through the monitoring system and operation and maintenance tools in the operation and maintenance support platform. And through the intelligent diagnosis platform that is under construction, potential service problems are discovered, and stable and reliable ES services are provided internally and externally.

Within Tencent, we have led the open source collaboration of ES products to discover potential problems and jointly optimize and improve ES to avoid repeated pitfalls by different teams. At the same time, we also actively contribute excellent solutions to the community, and promote the development of ES with the official and community ES fans. The team represented by Tencent ES kernel R&D has submitted more than 60 PRs so far, 70% of which have been merged. There are 6 ES/Lucene community contributors in the company's ES open source collaborative PMC members.

5. Postscript

Elasticsearch is widely used inside and outside Tencent in scenarios such as real-time log analysis, time series data analysis, and full-text retrieval. At present, the scale of a single cluster has reached 1,000-level nodes and trillion-level throughput. Provide high-reliability, low-cost, high-performance search and analysis services through the kernel enhanced version of ES. In the future, we still need to continue to optimize ES in terms of availability, performance, and cost. For example, the problem of insufficient cluster scalability is to support the creation of indexes at the level of millions of fragments and seconds by optimizing the scalability of the cluster. ES storage costs are currently being developed to separate storage and computing solutions to further reduce costs and improve performance. As well as the problem of high usage and maintenance costs, subsequent multi-level partitioning, intelligent diagnosis, etc. will be used to improve the automation and fault self-healing capabilities of ES to reduce user usage and maintenance costs. In the future, we will further explore other possibilities of ES in the field of multidimensional analysis. Continue to provide more valuable search analysis services in the field of big data.

understand more

Students who are interested in ES core technology are welcome to scan the QR code below to discuss and exchange with us, and welcome everyone to experience the Elasticarch service on Tencent Cloud.