ES 500 million order query evolution process

In the business of JD Daojia’s order center system, whether it is the order production of external merchants or the dependence of internal upstream and downstream systems, the call volume of order query is very large, resulting in the situation of more reads and less writes of order data.

 

We store order data in MySQL, but it is obviously not advisable to support a large number of queries only through DB. At the same time, for some complex queries, MySQL is not friendly enough, so the order center system uses Elasticsearch to carry the main pressure of order queries.

 

 

As a powerful distributed search engine, Elasticsearch supports near-real-time storage and search data, and plays a huge role in JD Daojia's order system. Currently, the ES cluster of the order center stores 1 billion documents, and the average daily query volume reached 500 million.

 

With the rapid development of JD Daojia's business in recent years, the ES erection scheme of the order center has also continued to evolve. Up to now, the ES cluster erection is a set of real-time mutual backup scheme, which has well guaranteed the stability of ES cluster read and write. The following is given Let me introduce this process and some pitfalls encountered in the process.

 

The Evolution Road of ES Cluster Architecture

 

1. Initial stage

 

The initial stage of the order center ES is like a blank sheet of paper, there is basically no erection plan, and many configurations are to maintain the default configuration of the cluster. The entire cluster is deployed on the group's elastic cloud, and the deployment of nodes and machines in the ES cluster is chaotic. At the same time, according to the cluster dimension, an ES cluster will have a single point problem, which is obviously not allowed for the order center business.

 

2. Cluster isolation stage

 

Like many businesses, the ES cluster adopts a mixed distribution method. However, since the order center ES stores online order data, occasional mixed-distributed clusters may seize a large amount of system resources, resulting in abnormal service of the entire order center ES.

 

Obviously, any situation that affects the stability of the order query cannot be tolerated. Therefore, for this situation, first move out the cluster nodes where the order center ES is located in the elastic cloud, and the ES cluster situation has improved slightly. . However, as the cluster data continues to increase, the elastic cloud configuration can no longer satisfy the ES cluster, and for complete physical isolation, the order center ES cluster is finally deployed on a high-configuration physical machine, and the performance of the ES cluster is improved.

 

3. Node copy tuning stage

 

The performance of ES has a lot to do with hardware resources. When an ES cluster is deployed on a physical machine alone, the nodes inside the cluster do not monopolize the resources of the entire physical machine. When the cluster is running, the nodes on the same physical machine will still seize resources. The problem. Therefore, in this case, in order to allow a single ES node to use the maximum machine resources, each ES node is deployed on a separate physical machine.

 

But then, the problem came again, what if a single node has a bottleneck? How should we optimize it?

 

The principle of ES query is that when a request hits a shard of a certain number, if the shard type (Preference parameter) query is not specified, the request will be loaded to each node of the corresponding shard number. The cluster's default replica configuration is one master and one replica. In response to this situation, we thought of a way to expand the replica, changing from the default one master and one replica to one master and two replicas, and adding corresponding physical machines at the same time.

 

Schematic diagram of order center ES cluster setup

 

As shown in the figure, the entire setup method uses VIP to load balance external requests:

 

The entire cluster has a set of primary shards and two sets of secondary shards (one primary and two secondary), and the requests forwarded from the gateway node will be balanced by polling before reaching the data nodes. The method of adding a set of replicas to the cluster and expanding the capacity of the machine increases the throughput of the cluster, thereby improving the query performance of the entire cluster.

 

The following figure is a schematic diagram of the performance of each stage of the order center ES cluster, which intuitively shows the significant improvement in the performance of the ES cluster after each stage of optimization:

 

 

Of course, the number of shards and the number of shard copies are not as good as possible. At this stage, we have further explored the selection of an appropriate number of shards. The number of fragments can be understood as sub-database and sub-table in MySQL, and the current order center ES query is mainly divided into two types: single ID query and paging query.

 

 

The larger the number of shards, the larger the scale of cluster horizontal expansion, and the single-ID query throughput based on shard routing can be greatly improved, but the aggregated paging query performance will be reduced; the smaller the number of shards, the larger the scale of cluster horizontal expansion. Smaller, the query performance of a single ID will also decrease, but the performance of paging queries will improve.

 

So how to balance the number of shards and the existing query business, we made many adjustments and pressure tests, and finally chose the number of shards with better cluster performance.

 

4. Master-slave cluster adjustment stage

 

So far, the ES cluster of the order center has begun to take shape. However, due to the high timeliness requirements of the order center business, the ES query stability requirements are also high. If any node in the cluster is abnormal, the query service will be affected, thereby affecting the entire Build-to-order process. Obviously, this abnormal situation is fatal, so in order to deal with this situation, our initial idea is to add a standby cluster. When an abnormality occurs in the main cluster, the query traffic can be downgraded to the standby cluster in real time.

 

How should the backup cluster be set up? How to synchronize data between master and backup? What kind of data should the standby cluster store?

 

Considering that the ES cluster does not have a good master/standby solution for the time being, and in order to better control the writing of ES data, we adopt the method of business double writing to set up the master/standby cluster. Every time a business operation needs to write ES data, the primary cluster data is written synchronously, and then the standby cluster data is written asynchronously. At the same time, since most of the ES query traffic comes from the orders in recent days, and the order center database data has a set of archiving mechanism, the orders that have been closed before the specified number of days are transferred to the historical order database.

 

Therefore, the logic of deleting standby cluster documents is added to the archiving mechanism, so that the order data stored in the newly built standby cluster is consistent with the data volume in the online database of the order center. At the same time, ZK is used to make a flow control switch in the query service to ensure that the query traffic can be downgraded to the standby cluster in real time. Here, the master-slave cluster of the order center is completed, and the stability of the ES query service is greatly improved.

 

 

5. Today: Real-time mutual backup dual-cluster stage

 

During this period, since the ES version of the main cluster is lower than 1.7, and the current ES stable version has been iterated to 6.x, the new version of ES not only greatly optimizes performance, but also provides some new and useful functions, so we The main cluster has undergone a version upgrade, directly upgrading from the original 1.7 to version 6.x.

 

The cluster upgrade process is cumbersome and lengthy. It not only needs to ensure that there is no impact on the online business, and the upgrade is smooth and non-perceptual. At the same time, because the ES cluster does not support data migration across multiple versions from 1.7 to 6.x, it is necessary to rebuild the index. To upgrade the main cluster, the specific upgrade process will not be repeated here.

 

When the main cluster is upgraded, unavailability will inevitably occur, but for the order center ES query service, this situation is not allowed. Therefore, during the upgrade phase, the standby cluster temporarily acts as the primary cluster to support all online ES queries, ensuring that the upgrade process does not affect normal online services. At the same time, for online business, we re-planned and defined the two clusters, and re-divided the online query traffic undertaken.

 

The standby cluster stores hot data online in recent days, and the data size is much smaller than that of the main cluster, which is about one-tenth of the number of documents in the main cluster. The data volume of the cluster is small. Under the same cluster deployment scale, the performance of the standby cluster is better than that of the primary cluster.

 

However, in online real scenarios, most of the online query traffic also comes from hot data, so the standby cluster is used to carry the queries of these hot data, and the standby cluster gradually evolves into a hot data cluster. The previous main cluster stored the full amount of data, and used this cluster to support the remaining small part of the query traffic. This part of the query is mainly a special scenario query that needs to search for a full amount of orders and an internal query of the order center system. The main cluster is also slowly Evolved into a cold data cluster.

 

At the same time, the backup cluster adds the function of downgrading to the primary cluster with one click. The status of the two clusters is equally important, but both can be downgraded to another cluster. The dual-write strategy is also optimized as follows: Assuming there is an AB cluster, the normal synchronous mode writes to the master (A cluster) and the asynchronous mode writes to the backup (B cluster). When an exception occurs in cluster A, write to cluster B (main) synchronously, and cluster A (standby) asynchronously.

 

 

Synchronization scheme of ES order data

 

Synchronization of MySQL data to ES can be roughly divided into two solutions:

 

  • Solution 1: Monitor the Binlog of MySQL, analyze the Binlog and synchronize the data to the ES cluster.

  • Solution 2: Write data directly to the ES cluster through the ES API.

 

Considering the business particularity of the ES service of the order system, the real-time performance of order data is relatively high. Obviously, the method of monitoring Binlog is equivalent to asynchronous synchronization, which may cause a large delay. And plan 1 is similar to plan 2 in essence, but a new system is introduced, and the maintenance cost is also increased. Therefore, the order center ES adopts the method of writing order data directly through the ES API. This method is simple and flexible, and can well meet the needs of order center data synchronization to ES.

 

Since the synchronization of ES order data is written in the business, when an exception occurs in creating or updating a document, retrying will inevitably affect the response time of the normal operation of the business.

 

Therefore, ES is only updated once for each business operation. If an error or exception occurs, a remedial task is inserted into the database, and a Worker task will scan the data in real time, and update the ES data again based on the database order data. Through this compensation mechanism, the final consistency of ES data and database order data is guaranteed.

 

Some pits encountered

 

1. Queries with high real-time requirements go to DB

 

Students who have an understanding of the ES writing mechanism may know that newly added documents will be collected into the Indexing Buffer and then written into the file system cache, where they can be indexed like other files.

 

However, by default, the document is automatically refreshed every second from the Indexing Buffer to the file system cache (that is, the Refresh operation), so this is why we say that ES is a near-real-time search rather than real-time: changes in the document are not immediately visible to the search. But it becomes visible within a second.

 

The current order system ES adopts the default Refresh configuration, so for those businesses with relatively high real-time order data, directly use the database query to ensure the accuracy of the data.

 

 

2. Avoid deep pagination query

 

The paging query of the ES cluster supports the from and size parameters. When querying, each fragment must construct a priority queue with a length of from+size, and then send it back to the gateway node, and the gateway node sorts these priority queues to find the correct one. size documents.

 

Assuming that in an index with 6 primary shards, from is 10000, and size is 10, each shard must generate 10010 results, and 60060 results are aggregated in the gateway node, and finally 10 documents that meet the requirements are found.

 

It can be seen that when the from is large enough, even if OOM does not occur, it will affect the CPU and bandwidth, thereby affecting the performance of the entire cluster. Therefore, deep paging queries should be avoided, and try not to use them.

 

3. FieldData and Doc Values

 

FieldData

 

The online query occasionally times out. By debugging the query statement, it is determined that it has something to do with sorting. The sorting uses the FieldData structure in the es1.x version. FieldData occupies the JVM Heap memory, and the JVM memory is limited. A threshold is set for the FieldData Cache.

 

If the space is insufficient, use the longest unused (LRU) algorithm to remove FieldData and load a new FieldData Cache at the same time. The loading process consumes system resources and takes a lot of time. Therefore, the response time of this query skyrockets, and even affects the performance of the entire cluster. For this kind of problem, the solution is to use Doc Values.

 

Doc Values

 

Doc Values ​​is a columnar data storage structure, which is very similar to FieldData, but its storage location is in the Lucene file, that is, it will not occupy the JVM Heap. With the iteration of the ES version, Doc Values ​​is more stable than FieldData, and Doc Values ​​is the default setting since 2.x.

 

Summarize

 

The rapid iteration of the structure stems from the rapid development of the business. It is precisely because of the rapid development of the Daojia business in recent years that the structure of the order center has also been continuously optimized and upgraded.

 

There is no best architecture solution, only the most suitable one. I believe that in a few years, the architecture of the order center will have another look, but with greater throughput, better performance, and stronger stability, it will be the order center system. eternal pursuit.

Guess you like

Origin blog.csdn.net/qq_35240226/article/details/108235933