Problems in the management and maintenance of ElasticSearch large clusters and a GProxy solution


Preface


The user search component and log management platform are an important part of the push service. ElasticSearch (ES for short), as an open source distributed search engine, can better meet the above requirements. After many years of iteration in the use of ES, I push has accumulated rich experience, especially when the amount of data continues to increase, how to manage the cluster, maintain the stability of the cluster, and optimize the performance of the cluster, we have carried out many practices.


This article will describe the evolution of the ElasticSearch architecture from three parts: the challenges of large clusters, how GProxy supports multiple clusters, and the current operating conditions.



f829fd81ce85452585001f13fe7992b6~tplv-k3u1fbpfcp-zoom-1.image

Author | Xiaoyao, R&D Director of Itweet Platform & Senior Java Engineer Nanyang



01

Overview of Push ES Service


The business scenario of a push using ES is mainly to add, delete, modify, and store user information and log storage. It is one of the core services that support push, and its characteristics are:

● Fast data update: Need to update user portrait and other information in real time

● Complicated query conditions: support multiple dimensions to complement and complement, and search for combined conditions

● Large amount of query data: A push task may need to query tens of millions of data


There are three main reasons why we choose ES:

● Provide near real-time index, the shortest can search the written document within 1s

● It can support queries with complex conditions and meet our various search conditions

● Distributed features can better meet the requirements of horizontal expansion and high availability

In addition, the ES official and the community are very active, and there are many ecological products around, and the problems encountered can basically be solved well.




02

Challenges of large clusters

Individual push cluster evolution

b8e481a468e943f9b15e5bb57cdfdcbd~tplv-k3u1fbpfcp-zoom-1.image


The figure above shows the process of pushing ES cluster evolution. One push used ES earlier. At first, it started using version 0.20.x. At that time, the cluster was small. Later, in order to scroll query results without source, the index-source was separated and upgraded to version 0.90.x. Then the official release of version 1.2.0 that supports no source query function, we carried out cluster merging. Subsequently, in order to use some of the official new features, we upgraded to 1.4.x and 1.5.x versions.


At this point, the cluster size is already very large, it is inconvenient to upgrade the version once, and the subsequent new features are not very attractive to us, so we haven't upgraded for a long time, skipping the official 2.x version. However, due to the large scale of the cluster, many difficult to solve problems began to appear. It needed to be split and upgraded, and the official version was also updated to version 5.x. It does not support upgrading from 1.x to 5.x, so we considered using the data gateway. To solve the difficulties of upgrading and restarting, the cluster has also evolved to the last version of the architecture above.


Problems with large clusters

Large clusters also bring many problems. The following are some of the relatively difficult problems we encountered during use.


● The first problem is that JVM memory tends to stay high.

ES memory occupies several parts: Segment Memory, Filter Cache, Field Data cache, Bulk Queue, Indexing Buffer and Cluster State Buffer. Segment Memory will grow with the growth of segment files, and the memory usage of a large cluster cannot be avoided.


Filter Cache and Field Data cache are not used in many scenarios, so we disable them through parameter configuration. Bulk Queue and Indexing Buffer are relatively fixed, the memory will not continue to grow, and there is no need to adjust parameters. Cluster State Buffer was a tricky issue at the time. The way we set up the mapping was to configure the default_mapping.json file in the config directory. This file would match all the written documents and perform format analysis. After ES processing, it will be in memory. A copy of ParserContext is cached in. As the number of types continues to increase, this part of the memory usage will increase until it is exhausted. If the new type is not indexed, it will not increase.


When there was a problem of high memory, we analyzed its dump file, as shown in the figure below, and found that it was caused by the occupation of ParserContext. At that time, there was no good way to completely solve it, and the only way to clear the ParserContext by restarting was to temporarily relieve it for a while. However, with the continuous writing of documents, this part of the memory usage will increase again.


14f4e7b2b6b24acabefc363f97de24e1~tplv-k3u1fbpfcp-zoom-1.image

● The second problem is oversized fragments and oversized segments.

By default, we use docid as the basis of routing, and hash documents to different shards through the hash algorithm. In this way, the fragment size is relatively uniform when the number of documents is small, but when the number of documents rises to a certain extent, the size gap of the fragments will become larger. When the average size of each slice is 100g, the gap can reach up to 20g. At this time, the number of segments reaching the maximum segment threshold limit is also very large, which leads to time-consuming merging of segments.


● The third problem is high disk IO.

We often use scroll to query documents. Large segment files will reduce the efficiency of file disk search. In addition, most of the machine's memory is occupied by the JVM of the ES node, which makes it difficult for the file system to use the memory to cache segment file pages. This causes the scroll query to directly read the disk file, and the IO is full. From the monitoring point of view, the IO of the cluster is basically full at all times.

15299ce07b204dd8b985f97c4e352079~tplv-k3u1fbpfcp-zoom-1.image


In addition, there are many hidden dangers.


The first is the expansion bottleneck. The preset number of shards in the cluster was originally sufficient. However, after multiple expansions, the number of instances is equal to the number of shards. After the instances are expanded, the cluster will not allocate shards to new ones. Instance. Secondly, the disk space of the original machine is gradually insufficient. The ES default water mark is 85%, and the fragments will start to jump randomly after reaching it and it is difficult to recover.


Then it is difficult to adjust. Restart and recovery are very slow. If you upgrade and rebuild, the index is also very slow.


Finally, the robustness of the cluster will also be affected. As the number of documents increases, the pressure will increase and failures are more likely to occur. After an instance fails, the pressure will be distributed to other nodes, and the business will have perception.

cab353c898e84d17bae270da97fa752f~tplv-k3u1fbpfcp-zoom-1.image




03

GProxy's solution


For the problems described above, we hope to find a solution that can provide the following functions:

1. It can smoothly upgrade the cluster version without affecting business usage during the upgrade

2. By splitting large clusters into small clusters, business is split and the pressure on clusters is reduced

3. Provide multiple IDC hot backup of cluster data, and provide data layer support for multiple activities in different places


However, in the previous architecture, business services directly access the ES cluster, and the coupling is severe, making it more difficult to achieve these requirements. Therefore, we chose a proxy-based architecture to isolate storage clusters and business service clusters by adding middle-layer proxy to provide support for more flexible operation and maintenance of storage clusters.

4d823b932df24e2aac8ead67236f2a46~tplv-k3u1fbpfcp-zoom-1.image


The picture above is the overall architecture of GProxy, which is a three-tier architecture:

● The top layer is the business layer, which only needs to interact with the proxy, and realize the service discovery of the proxy through etcd

● In the middle is the GProxy layer, which provides request forwarding and cluster management

● At the bottom are multiple ES clusters


GProxy contains several components:

● etcd: a highly available meta-information storage database, including routing rules, cluster information, proxy service addresses, migration tasks, etc.

● SDK: Extends the native SDK of ES, and encapsulates functions such as proxy service discovery and fuse through its sniffer mechanism

● Proxy is a lightweight proxy service, it is very convenient to expand, you can register your address in etcd after startup

● The dashboard is the management service for the entire cluster and provides a web interface to facilitate the management and monitoring of the cluster by the operation and maintenance personnel

● The migrate service provides migration functions between different clusters


Service discovery and routing rules


With the above overall architecture, there are two more problems to be solved:

1. How does business service discover proxy, that is, service discovery problem

2. Which cluster the proxy forwards the request to, that is, the routing problem



Service discovery

etcd is a highly available distributed key-value database, and it interacts through http api, which is easy to operate, so we choose etcd to realize service discovery and metadata storage.

3671019835ed44cb9318b3732614b2ca~tplv-k3u1fbpfcp-zoom-1.image



Proxy is a stateless service. After startup and initialization are completed, it registers its address in etcd. Through etcd's lease mechanism, the system can monitor the survival status of the proxy. When the proxy service is abnormal and the lease cannot be renewed regularly, etcd will remove it to prevent it from affecting normal business requests.


The sdk provided by ElasticSearch reserves the sniffer interface, and the sdk can obtain the back-end address through the sniffer interface. We have implemented the sniffer interface, which regularly obtains the proxy list from etcd, and monitors the online and offline of the service through etcd's watch mechanism, and updates the internal connection list in time. The business side can still use the native SDK in the original way, without too much modification, just inject the sniffers into the SDK.


Routing rules

In a push push business scenario, the data required for each app push can be regarded as a whole, so we choose to route the request according to the dimensions of the app, and the data required for each app push is stored in a cluster.


The routing information is stored in etcd, and the format is appid->clusterName such a correspondence. If there is no such a correspondence, the proxy will assign the appid to a default cluster.

When the proxy starts, it will pull the latest routing table and monitor the changes of the routing table through the watch mechanism of etcd.

72b0a33b32f4468d8d5f561e34cbc8fc~tplv-k3u1fbpfcp-zoom-1.image


The change of the routing relationship is realized through the migration operation. The following is an introduction to the migration process.


Migration process

Each app belongs to a cluster. When the load of the cluster is not balanced, the administrator can use the migration service to migrate data between the clusters according to the app dimension.


301c826985fb409091bcf537b72390d6~tplv-k3u1fbpfcp-zoom-1.image

The migration process consists of two steps: data synchronization and routing rules modification. Data synchronization needs to synchronize two pieces of data: full data and incremental data.

1. The full data is exported through ElasticSearch's scroll api

2. Because ElasticSearch does not provide a way to obtain incremental data (similar to mysql's binlog protocol to achieve incremental data acquisition), we use proxy double write to achieve incremental data acquisition.


The migration service is responsible for data synchronization, and notifies the dashboard after the data synchronization is completed, and the dashboard updates the routing relationship of etcd. The proxy obtains the new routing relationship through the watch mechanism and updates the internal routing table. At this time, the new request of the app will be routed to the new cluster.



Multiple IDC data hot backup


In the actual business scenario of a push, push is an enterprise-level service, which requires high service availability. In a push, there are multiple computer rooms to provide external services, and each app belongs to one computer room. In order to deal with computer room-level failures, we need to perform multi-IDC hot backup of data so that after the computer room fails, the customer's request can be routed to the non-faulty computer room, so as not to affect the normal use of customers.


ac813ac8c8914f06a4ba923bc8930b3b~tplv-k3u1fbpfcp-zoom-1.image

We use cluster dimensions for data hot backup, and the data of each cluster will be backed up to another computer room. After receiving the request, the proxy writes the incremental data to the MQ in real time according to the cluster's hot standby information, and the consumer service in another computer room continuously consumes the incremental data of the MQ and writes it to the corresponding cluster. The dashboard service is responsible for controlling the status of all IDC hot backup tasks.


performance

After introducing an intermediate layer, a certain performance loss will inevitably be brought about. The reason we choose GO development is to minimize the loss as much as possible. The final performance results are as follows:

bb6bb193554d4f88a4a57d195fbcddcb~tplv-k3u1fbpfcp-zoom-1.image

As can be seen from the above figure, QPS is reduced by about 10%, and the average delay is approximately equal to the sum of the average delay of the ES call and the proxy itself. Although there is a 10% performance degradation, it brings more flexible operation and maintenance capabilities.


Current operation

After the GProxy service went online, the ES version upgrade (from 1.5 to 6.4) was successfully completed, and the original large cluster was split into multiple small clusters. The entire upgrade and split process is insensitive to the business side, and the lossless rollback function provided by GProxy can make the operation more assured (data migration needs to be very cautious).


With the support of GProxy, DBA's daily ES operation and maintenance operations, such as parameter optimization and pressure balance between clusters, become more convenient.





Conclusion

Through the use of Go language, the company independently developed Gproxy, successfully solved the problems existing in the large ElasticSearch cluster, and provided stable and reliable data storage services for the upper-level business. In addition, GeTui will continue to polish its own technology, continue to explore in the field of search and data storage, continue to expand the application scenarios of ElasticSearch, and share the latest practices on how to ensure high availability of data storage with developers.


5bc8ecc0e30b406eb762b3285c640286~tplv-k3u1fbpfcp-zoom-1.image



Guess you like

Origin blog.51cto.com/13031991/2540743