ES+Redis+MySQL, this high-availability architecture design is too top

  • 1. Background

  • 2. ES High Availability Solution

  • 3. Membership Redis cache solution

  • 4. High-availability member master database solution

  • 5. Abnormal member relationship management

  • 6. Outlook: more refined flow control and downgrade strategies


1. Background

The membership system is a basic system that is closely related to the main process of placing orders in all business lines of the company. If the membership system fails, users will not be able to place orders, and the scope of impact will be on all business lines of the company. Therefore, the membership system must ensure high performance and high availability, and provide stable and efficient basic services.

With the merger of Tongcheng and eLong, more and more systems need to open up multi-platform membership systems such as Tongcheng APP, eLong APP, Tongcheng WeChat Mini Program, and eLong WeChat Mini Program. For example, in the cross-marketing of WeChat applets, a user buys a train ticket and wants to send him a hotel red envelope at this time, which requires querying the unified membership relationship of the user. Because the train ticket uses the Tongcheng membership system, and the hotel uses the eLong membership system, only after the corresponding eLong membership card number is found can the red envelope be mounted to the member account. In addition to the cross-marketing mentioned above, there are many scenarios that need to query the unified membership relationship, such as order center, membership level, mileage, red envelope, frequent travel, real name, and various marketing activities, etc. Therefore, the request volume of the membership system is getting bigger and bigger, and the concurrency is getting higher and higher. This year's May Day holiday has a second concurrent TPS of more than 20,000. Under the impact of such a large amount of traffic, how does the membership system achieve high performance and high availability? This is what this article focuses on.

2. ES High Availability Solution

1. ES dual-center active-standby cluster architecture

After the integration of Tongcheng and eLong, the total number of members of all systems on the entire platform is more than one billion. With such a large data volume, the query dimension of the business line is also relatively complicated. Some business lines are based on mobile phone numbers, some are based on WeChat unionid, and some are based on eLong card numbers, etc. to query member information. Based on such a large amount of data and so many query dimensions, we choose ES to store unified membership relationships. ES clusters are very important in the entire membership system architecture, so how to ensure the high availability of ES?

First of all, we know that the ES cluster itself guarantees high availability, as shown in the following figure:

picture

When one node of the ES cluster goes down, the Replica Shards corresponding to other nodes will be upgraded to Primary Shards to continue to provide services. But even that is not enough. For example, ES clusters are deployed in computer room A, and now computer room A suddenly loses power, what should I do? For example, if the server hardware fails, most machines in the ES cluster are down, what should I do? Or suddenly there is a very popular flash sale event, which brings a wave of very large traffic, directly killing the ES cluster, what should I do? Faced with these situations, let the operation and maintenance brothers rush to the computer room to solve them? This is very unrealistic, because the membership system directly affects the main process of placing orders for all business lines of the company, and the time for fault recovery must be very short. If manual intervention by the operation and maintenance brothers is required, the time will be too long, which is absolutely intolerable . How to do the high availability of ES? Our solution is the ES dual-center active-standby cluster architecture.

We have two computer rooms, namely computer room A and computer room B. We deploy the ES main cluster in computer room A, and deploy the ES standby cluster in computer room B. The reading and writing of the member system are all in the ES main cluster, and the data is synchronized to the ES standby cluster through MQ. At this time, if the ES main cluster collapses, through unified configuration, the read and write of the member system will be switched to the ES standby cluster in computer room B, so that even if the ES main cluster is down, failover can be achieved in a short time. Ensure the stable operation of the membership system. Finally, after the failure of the ES main cluster recovers, turn on the switch to synchronize the data during the failure to the ES main cluster. After the data is synchronized and consistent, switch the reading and writing of the member system to the ES main cluster.

2. ES traffic isolation three-cluster architecture

It seems that there should be no major problem if the dual-center ES active and standby clusters achieve this step, but a terrible traffic shock last year made us change our minds. It was a holiday, and a certain business launched a marketing campaign. In a user request, the membership system was called more than 10 times, which caused the tps of the membership system to skyrocket, and almost exploded the ES cluster. This incident made us terrified, and it made us realize that we must prioritize the callers and implement more refined isolation, circuit breaking, downgrading, and current limiting strategies. First, we sorted out all the callers and divided them into two types of requests. The first category is requests that are closely related to the main process of placing an order. These requests are very important and should be guaranteed with high priority. The second category is related to marketing activities. This type of request has a characteristic. They have a large number of requests and a high tps, but it does not affect the main process of placing an order. Based on this, we built another ES cluster, which is specially used to deal with high tps marketing spike requests, so that it is isolated from the main ES cluster, and will not affect the user's ordering process due to the traffic impact of a certain marketing activity. process. As shown below:

picture

3. ES cluster depth optimization and improvement

After talking about the high-availability architecture of ES's dual-center active and standby clusters, let's explain in depth the optimization of the ES's main cluster. For a period of time, we were particularly miserable, that is, every meal time, the ES cluster started to call the police, which made us feel flustered every time we ate, fearing that the ES cluster could not handle it alone, and the whole company would be blown up. Then why call the police as soon as it's meal time? Because the traffic is relatively large, the number of ES threads soars, the CPU goes straight up, and the query time increases, which is transmitted to all callers, resulting in a wider range of delays. So how to solve this problem? By digging into the ES cluster, we found the following problems:

  • The ES load is unreasonable, and the hotspot problem is serious. There are dozens of nodes in the ES main cluster. Some nodes deploy too many shards, while others deploy few shards. As a result, some servers are heavily loaded. When traffic peaks, frequent warnings are issued.

  • The size of the ES thread pool is set too high, causing the cpu to soar. We know that when setting the threadpool of ES, the number of threads is generally set to the number of CPU cores of the server. Even if the query pressure of ES is high and the number of threads needs to be increased, it is best not to exceed "cpu core * 3 / 2 + 1". If the number of threads is set too much, the CPU will frequently switch back and forth between multiple thread contexts, wasting a lot of CPU resources.

  • The memory allocated by the shard is too large, 100g, which slows down the query. We know that the ES index should allocate the number of shards reasonably, and control the memory size of a shard within 50g. If the memory allocated by a shard is too large, the query will be slowed down, time-consuming will be increased, and the performance will be seriously affected.

  • The field of string type is set with double fields, which are both text and keyword, which doubles the storage capacity. The query of member information does not need to be scored according to the degree of relevance, it can be queried directly according to the keyword, so the text field can be completely removed, which can save a large part of storage space and improve performance.

  • ES query, use filter, not query. Because the query will calculate the relevance score of the search results, which consumes more cpu, and the query of member information does not need to calculate the score, and this part of the performance loss can be completely avoided.

  • To save ES computing power, sort the search results of ES in the jvm memory of the member system.

  • Add routing key. We know that an ES query will distribute the request to all shards, aggregate the data after all shards return the results, and finally return the results to the caller. If we already know in advance which shards the data is distributed on, we can reduce a large number of unnecessary requests and improve query performance.

After the above optimization, the results are very remarkable, the cpu of the ES cluster is greatly reduced, and the query performance is greatly improved. The cpu usage of the ES cluster:

Time-consuming interface of membership system:

picture

3. Membership Redis cache solution

For a long time, the membership system has not been cached. There are two main reasons: first, the performance of the ES cluster mentioned above is very good, with more than 30,000 concurrent transactions per second, and the 99 lines take about 5 milliseconds, which is enough to deal with various difficult problems. scene. Second, some businesses require real-time consistency in the binding relationship of members, and membership is an old system that has been developed for more than 10 years, and it is a distributed system composed of many interfaces and many systems. Therefore, as long as there is an interface that is not considered in place and the cache is not updated in time, it will lead to dirty data, which will lead to a series of problems, such as: users cannot see WeChat orders, APP and WeChat membership levels, mileage, etc. There is no merger, WeChat and APP cannot cross-market, etc. Then why do you need to cache again? It is because of the blind box activity of air tickets this year, the instantaneous concurrency it brings is too high. Although the membership system is safe and sound, there are still lingering fears. To be on the safe side, we finally decided to implement a caching solution.

1. The solution to the inconsistency problem of Redis cache data caused by ES nearly one second delay

In the process of making a membership cache solution, I encountered a problem caused by ES, which would lead to inconsistency of cached data. We know that ES operation data is near real-time. If you add a document to ES, you can check it immediately, but you can’t find it. You need to wait for 1 second before you can check it. As shown below:

Why does ES's near-real-time mechanism cause redis cache data to be inconsistent? Specifically, suppose a user logs out of his APP account. At this time, ES needs to be updated to delete the binding relationship between APP account and WeChat account. The data update of ES is near real-time, that is to say, you can query the updated data after 1 second. And within this 1 second, there is a request to query the membership binding relationship of the user. It first checks in the redis cache and finds that there is no one, then checks in ES, and finds it, but what it finds is the old data before the update . Finally, the request updates the queried old data to the redis cache and returns it. In this way, after 1 second, the membership data of the user in ES is updated, but the data in the redis cache is still old data, resulting in inconsistency between the redis cache and ES data. As shown below:

Faced with this problem, how to solve it? Our idea is to add a 2-second redis distributed concurrent lock when updating ES data, in order to ensure the consistency of the cached data, and then delete the member's cached data in redis. If there is a request to query data at this time, first obtain the distributed lock, and find that the member ID has been locked, indicating that the data just updated by ES has not yet taken effect, then after querying the data at this time, the redis cache will not be updated, and it will return directly. This avoids the inconsistency problem of cached data. As shown below:

picture

At first glance, the above solution seems to have no problem, but careful analysis may still lead to inconsistency of cached data. For example, before the update request adds a distributed lock, there is exactly one query request to acquire a distributed lock, but there is no lock at this time, so it can continue to update the cache. But just before he updated the cache, the thread was blocked. At this time, the update request came, a distributed lock was added, and the cache was deleted. When the update request completes the operation, the thread of the query request comes alive. At this time, it executes the update cache and writes the dirty data into the cache. Did you find it? The main crux of the problem is that there is a concurrency conflict between "delete cache" and "update cache". As long as they are mutually exclusive, the problem can be solved. As shown below:

picture

After implementing the caching solution, according to statistics, the cache hit rate is 90%+, which greatly relieves the pressure on ES, and the overall performance of the membership system has been greatly improved.

2. Redis dual-center multi-cluster architecture

Next, let's take a look at how to ensure the high availability of the Redis cluster. As shown below:

Regarding the high availability of the Redis cluster, we have adopted a dual-center multi-cluster model. Deploy a set of Redis clusters in computer room A and computer room B respectively. When updating the cached data, double-write, and only if the redis clusters in both computer rooms are successfully written, will it return success. When querying cached data, query nearby in the computer room to reduce delay. In this way, even if computer room A fails as a whole, computer room B can still provide complete member services.

4. High-availability member master database solution

As mentioned above, the binding relationship data of all platform members exists in ES, while the registration details of members exist in relational databases. At first, the database used by members was SqlServer. Until one day, a DBA came to us and said that a single SqlServer database has stored more than one billion member data, and the server has reached its physical limit and cannot be expanded any more. According to the current growth trend, it won't be long before the entire SqlServer database collapses. Think about it, what kind of disaster scenario is that: when the membership database collapses, the membership system collapses; when the membership system collapses, all business lines of the company collapse. Thinking about it makes me shudder, and it is so refreshing, so we immediately started the work of migrating the DB.

1. MySql dual-center Partition cluster solution

After research, we chose the MySql cluster solution with dual-center sub-database and sub-table, as shown in the following figure:

Members have a total of more than one billion data. We have divided the member main database into more than 1,000 shards, and divided each shard to the order of one million, which is enough for use. The MySql cluster adopts the architecture of 1 master and 3 slaves. The master library is placed in computer room A, and the slave library is placed in computer room B. Data is synchronized between the two computer rooms through a dedicated line, and the delay is within 1 millisecond. The member system reads and writes data through DBRoute, and the written data is routed to the computer room A where the master node is located, and the read data is routed to the local computer room, which can be accessed nearby to reduce network delays. In this way, the use of a dual-center MySql cluster architecture greatly improves availability. Even if the computer room A collapses as a whole, the Slave in computer room B can be upgraded to Master to continue to provide services.

After the dual-center MySql cluster was built, we conducted a stress test. After the test, the second concurrency can reach more than 20,000, and the average time-consuming is within 10 milliseconds, and the performance meets the standard.

2. Smooth migration plan for member main database

The next job is to switch the underlying storage of the membership system from SqlServer to MySql. This is a very risky job, and there are mainly the following difficulties:

  • The membership system cannot be stopped for a moment. To complete the switch from SqlServer to MySql without stopping, it is like changing the wheels of a high-speed car.

  • The membership system is composed of many systems and interfaces. After all, it has been developed for more than 10 years. Due to historical reasons, a large number of old interfaces have been left behind, and the logic is intricate. So many systems must be sorted out one by one, and the DAL layer code must be rewritten without any problems, otherwise it will be catastrophic.

  • The migration of data should be seamless, not only the migration of more than 1 billion data in stock, but also the real-time data should be seamlessly synchronized to mysql. In addition, in addition to ensuring the real-time performance of data synchronization, it is also necessary to ensure the correctness of data and the consistency of SqlServer and MySql data.

Based on the above pain points, we designed a technical solution of "full synchronization, incremental synchronization, and real-time traffic grayscale switching".

First of all, in order to ensure seamless switching of data, a real-time double-writing scheme is adopted. Due to the complexity of business logic and the technical differences between SqlServer and MySql, in the process of double-writing mysql, the writing may not be successful, and once the writing fails, the data of SqlServer and MySql will be inconsistent, which is absolutely not allowed . Therefore, the strategy we adopt is to write to SqlServer mainly during the trial run, and then write to MySql asynchronously through the thread pool. If the writing fails, retry three times. If it still fails, record the log, and then manually check the cause. Continue to double write until it runs for a period of time and there is no double write failure. Through the above strategy, the correctness and stability of the double write operation can be ensured in most cases. Even if there is an inconsistency between the data of SqlServer and MySql during the trial run, the data of MySql can be fully constructed again based on SqlServer , because when we design the double-write strategy, we will ensure that SqlServer can write successfully, that is to say, the data in SqlServer is the most complete and correct. As shown below:

After talking about double writing, let's take a look at how to grayscale the "read data". The overall idea is to gradually grayscale the traffic through the A/B platform. At the beginning, 100% of the traffic reads the SqlServer database, and then gradually cuts the traffic to read the MySql database. First, 1%, if there is no problem, then gradually release the traffic, and finally 100% All the traffic goes through the MySql database. In the process of gradually grayscale traffic, a verification mechanism is required. Only when the verification is ok can the traffic be further enlarged. So how is this verification mechanism implemented? The solution is to use an asynchronous thread to compare whether the query results of SqlServer and MySql are consistent in a query request. If they are not consistent, record the log and manually check the cause of the inconsistency until the problem of inconsistency is completely resolved, and then gradually grayscale the traffic. As shown below:

picture

Therefore, the overall implementation process is as follows:

picture

First of all, in the middle of a dark and windy night, when the traffic is the smallest, complete the full data synchronization from SqlServer to MySql database. Then, enable double writing. At this time, if there is a user registration, it will be double written to the two databases in real time. Then, between full synchronization and real-time double-write enable, the two databases still have a difference in data during this period, so incremental synchronization is needed again to complete the data to prevent data inconsistency. The rest of the time is spent monitoring various logs to see if there is a problem with double writing, to see if the data comparison is consistent, and so on. This period of time is the longest time-consuming and the most prone to problems. If some problems are serious and cause data inconsistencies, you need to start all over again, build the MySql database again based on SqlServer in full, and then re-grey the traffic until the end , 100% of the traffic is gray-scaled to MySql. At this point, you're done. The gray-scale logic is offline, and all reads and writes are switched to the MySql cluster.

3. MySql and ES active/standby cluster scheme

After this step, I feel that the main member library should be fine, but a serious failure of the dal component changed our mind. That failure was terrible. Many applications in the company could not connect to the database, and the amount of orders created plummeted. This made us realize that even if the database is good, the abnormality of the dal component can still cause the membership system to hang up. Therefore, we heterogeneously heterogeneous the data source of the member master database again, and double-write the data to ES, as follows:

If the dal component fails or the MySql database hangs up, you can switch the reading and writing to ES, wait for MySql to recover, then synchronize the data to MySql, and finally switch the reading and writing back to the MySql database. As shown below:

picture

5. Abnormal member relationship management

The membership system not only needs to ensure the stability and high availability of the system, but also the accuracy and correctness of the data. For example, a distributed concurrent fault causes a user's APP account to be bound to someone else's WeChat applet account, which will have a very bad impact. First of all, once the two accounts are bound, the hotel, air ticket, and train ticket orders placed by the two users can be seen by each other. Think about it, others can see your hotel reservations, will you complain if you are not popular? In addition to being able to see other people's orders, you can also operate orders. For example, a user sees an air ticket order booked by someone else in the order center of the APP. He thinks it is not his own order, so he cancels the order. This will bring about very serious customer complaints. As we all know, the cancellation fee for air tickets is quite high, which not only affects the normal travel of the user, but also leads to relatively large economic losses, which is very bad.

For these abnormal member accounts, we sorted them out in detail, identified these accounts through very complex and brain-burning logic, and carried out in-depth optimization and management of the member interface, blocked the relevant loopholes in the code logic layer, and completed the abnormal member’s account. Governance work. As shown below:

picture

6. Outlook: more refined flow control and downgrade strategies

Any system cannot guarantee that there will be no problems 100%, so we must have a failure-oriented design, that is, a more refined flow control and degradation strategy.

1. More refined traffic control strategy

Hotspot control. For the scenario of fraudulent billing, the same member id will have a large number of repeated requests, forming hot accounts. When the access of these accounts exceeds the set threshold, the traffic limiting strategy will be implemented.

Flow control rules based on the calling account. This strategy is mainly to prevent the large traffic caused by the caller's code bug. For example, in a user request, the caller calls the membership interface many times in a loop, resulting in a sudden increase in the traffic of the membership system many times. Therefore, it is necessary to set flow control rules for each calling account, and implement a flow-limiting policy when the threshold is exceeded.

Global flow control rules. Our membership system can withstand more than 30,000 tps of concurrent requests per second. If at this time, there is a terrible traffic coming, with tps as high as 100,000, it is better to let this wave of traffic kill all member databases and es Fast fail the traffic that exceeds the tolerance of the membership system, at least member requests within 30,000 tps can be responded normally, and the entire membership system will not collapse.

picture

2. A more refined downgrade strategy

Downgrade based on average response time. The member interface also depends on other interfaces. When the average response time of calling other interfaces exceeds the threshold, it enters the quasi-degraded state. If the average response time of the incoming requests in the next 1s continues to exceed the threshold, then in the next time window, the fuse will be automatically cut off.

Demotion based on number of outliers and proportion of outliers. When an exception occurs on other interfaces that the member interface depends on, if the number of exceptions within 1 minute exceeds the threshold, or the ratio of the total number of exceptions to the throughput per second exceeds the threshold, it enters the degraded state and automatically fuses within the next time window.

Currently, our biggest pain point is the management of members calling accounts. In the company, if you want to call the member interface, you must apply for a calling account. We will record the usage scenarios of the account and set the rules for traffic control and downgrade strategies. But in the process of actual use, the colleague who applied for the account may change to another department. At this time, he may also call the membership system. In order to save trouble, he will not apply for the member account again, but directly use the previous account. This makes it impossible for us to judge the specific usage scenarios of a member account, and it is impossible to implement more refined flow control and downgrade strategies. Therefore, next, we will sort out all the calling accounts one by one. This is a very large and cumbersome task, but there is no way out, so we must do it well.

Guess you like

Origin blog.csdn.net/2301_77463738/article/details/131170122