Byte two sides: 10Wqps membership system, how to design?

say up front

In Nien's (50+) reader community, I often encounter a very, very high-frequency interview question, but it is very difficult to answer, similar to the following:

  • Tens of millions of data, how to do system architecture?

  • Billion-level data, how to do system architecture?

  • Tens of millions of traffic, how to do the system architecture?

  • Billion-level traffic, how to do system architecture?

  • How to structure a high-concurrency system?

Recently, a small partner Byte Ermian encountered this problem again.

In fact, Nin has always wanted to sort out a textbook answer,

The textbook-like answer to "How to optimize performance for tens of millions of data?" is actually hidden in this industry case.

Here is a new industry case "High Concurrency Architecture of Tongcheng-Elong Membership System". Nien reconstructed and sorted out this solution from the interview dimension, and now uses it as a reference answer for our "Nin Java " Interview Book PDF "V96 Version

The following content is Nien's secondary analysis based on his own 3-level architecture notes and Nien's 3-level architecture knowledge system (3-level architecture universe).

PDF of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to get it

Business Scenarios of Tongcheng-Elong Membership System

Tongcheng Elong is a company jointly established by Tongcheng Network and eLong Travel Network under Tongcheng Travel Group on December 29, 2017. The new company will integrate the advantageous resources of both parties, such as transportation and hotels, to create a more leading travel service platform .

On June 21, 2018, Tongcheng-Elong submitted a prospectus on the Hong Kong Stock Exchange, and the joint sponsors were Morgan Stanley, JPMorgan Chase, and CMB International.

On November 26, 2018, Tongcheng-Elong was officially listed on the Hong Kong Stock Exchange.

Tongcheng-Elong provides technological empowerment for the upstream and downstream of the travel industry chain such as airports, hotels, and destinations, and accelerates the process of digital intelligence in the industry. End-user applications include:

  • Tongcheng APP
  • eLong APP
  • Tongcheng WeChat Mini Program
  • eLong WeChat Mini Program

In the third quarter of 2021, the average monthly active users of Tongcheng-Elong reached 277 million, a year-on-year increase of 12.7%; the average monthly paying users reached 33.6 million, a year-on-year increase of 12.8%.

In the 12 months ended September 30, 2021, paying users of Tongcheng-Elong increased by 29.6% year-on-year to 196 million.

Inside Tongcheng-Elong, the membership system is a basic system .

This basic system is closely related to the main order process of all business lines of the company.

Inside the Tongcheng-Elong platform, if the membership system fails, users will not be able to place orders.

In other words, if the membership system fails, the scope of impact is not only the membership system itself, but all business lines of the company.

Therefore, the membership system must ensure high performance, high availability, and high concurrency, and provide stable and efficient basic services for large business platforms.

With the merger of Tongcheng and eLong, more and more systems need to open up multi-platform membership systems such as Tongcheng APP, eLong APP, Tongcheng WeChat Mini Program, and eLong WeChat Mini Program.

For example, in the cross-marketing of WeChat applets, a user buys a train ticket and wants to send him a hotel red envelope at this time, which requires querying the unified membership relationship of the user.

Because the train ticket uses the Tongcheng membership system, and the hotel uses the eLong membership system, only after the corresponding eLong membership card number is found can the red envelope be mounted to the member account.

In addition to cross-marketing scenarios, there are many, many business scenarios that need to query the unified membership relationship.

For example, order center, membership level, mileage, red envelope, frequent travel, real name, and various marketing activities, etc.

Therefore, the request volume of the membership system is getting larger and larger, and the concurrency is getting higher and higher.

The concurrent TPS per second during the May Day holiday in 2022 even exceeded 20,000.

Under the impact of such a large amount of traffic, how does the membership system achieve high performance and high availability?

Heterogeneous storage architecture solution for member data

Inside Tongcheng-Elong, the membership system adopts a heterogeneous storage architecture combining mysql cluster + ES cluster

Specifically as shown in the figure below

  1. Why use mysql's
    current hottest relational database, which supports transactions, supports B+ tree structure high-performance indexing, and has low data latency (you can check it when you write it)

    But there are two major disadvantages:

    • (1) It is not suitable for full-text search. If non-indexed full-text search occurs, full-table search will appear, and the performance is low

    • (2) Using mysql requires sub-database and sub-table. At this time, it is impossible to do full table association

  2. Why using elsaticsearch
    es is an inherent advantage in search,

    • (1) Inverted index, a natural full-text search

    • (2) Support large tables, wide tables, and unstructured data. Unlike relational databases, a simple search requires cross-table join tables and cross-database associations

  3. The benefits of mysql+elasticsearch

    Use mysql to access data and es as a search engine to ensure performance

MySQL member main library: high availability + high concurrency architecture

As mentioned above, the binding relationship data of all platform members exists in ES, while the registration details of members exist in relational databases.

At the earliest, the database used by members was SQL Server.

Until one day, a single SQL Server database has stored more than one billion member data, and the server has reached its physical limit and cannot be expanded any more.

According to the natural growth trend, it won't be long before the entire SQL Server database collapses.

Think about it, everyone, the SQL Server database collapsed, what kind of disaster scenario is that:

  • When the membership database collapses, the membership system collapses;

  • When the membership system collapsed, all business lines of the company collapsed.

Thinking about it makes me shudder, and it's extremely refreshing. The eLong team of the same city immediately started the work of migrating the DB.

MySQL dual-center Partition cluster solution

After research, the team chose the MySQL cluster solution with dual-center sub-database and sub-table.

mysql table structure

Members have a total of more than one billion data, and the team has divided the main member database into more than 1,000 shards.

In terms of the dimension of a single shard, each shard is about one million , and a single shard is less than tens of millions, which is enough for use.

mysql master-slave architecture

The entire MySQL cluster adopts the architecture of 1 master and 3 slaves.

The master library is placed in computer room A, and the slave library is placed in computer room B. Data is synchronized between the two computer rooms through a dedicated line, and the delay is within 1 millisecond.

mysql traffic routing architecture

  • Write routing: write data is routed to computer room A where the master node is located.

  • Read routing: All read data is routed to the local computer room for nearby access, reducing network latency.

The member system reads and writes data through DBRoute. DBRoute is a unified database access sdk component that can switch traffic according to the traffic switch.

The specific architecture diagram is shown in the following figure:

In this way, the use of a dual-center MySQL cluster architecture greatly improves availability. Even if the computer room A collapses as a whole, the Slave in computer room B can be upgraded to Master to continue to provide services.

After the dual-center MySQL cluster was built, the team conducted a stress test. After the test, the second concurrency can reach more than 20,000, and the average time consumption is within 10 milliseconds, and the performance meets the standard.

Smooth migration plan for member master library

The next job is to switch the underlying storage of the membership system from SQL Server to MySQL. This is a very risky job, and there are mainly the following difficulties:

  • The membership system cannot be shut down for a moment. To complete the switch from SQL Server to MySQL without shutting down is like changing the wheels of a high-speed car.
  • The membership system is composed of many systems and interfaces. After all, it has been developed for more than 10 years. Due to historical reasons, a large number of old interfaces have been left behind, and the logic is intricate. So many systems must be sorted out one by one, and the DAL layer code must be rewritten without any problems, otherwise it will be catastrophic.
  • The migration of data should be seamless, not only the migration of more than 1 billion data in stock, but also the real-time generated data should be seamlessly synchronized to MySQL. In addition, in addition to ensuring the real-time performance of data synchronization, it is also necessary to ensure the correctness of data and the consistency of SQL Server and MySQL data.

Based on the above pain points, the team designed a technical solution of " full synchronization, incremental synchronization, and real-time traffic grayscale switching ".

First of all, in order to ensure seamless switching of data, a real-time double-writing scheme is adopted.

Due to the complexity of business logic and the technical differences between SQL Server and MySQL, in the process of double-writing MySQL, the writing may not be successful, and once the writing fails, the data in SQL Server and MySQL will be inconsistent. Allowed.

Therefore, the strategy adopted by the team is, during the trial run, the main write SQL Server, and then asynchronously write MySQL through the thread pool,

If the write fails, try again three times. If it still fails, record the log, and then manually check the cause. After solving it, continue to double write until it runs for a period of time without double write failure.

Through the above strategies, the correctness and stability of double-write operations can be ensured in most cases. Even if the data of SQL Server and MySQL are inconsistent during the trial run, MySQL can be fully built again based on SQL Server. The data, because when the team designs the double-write strategy, it will ensure that SQL Server can write successfully, that is to say, the data in SQL Server is the most complete and correct.

As shown below:

After talking about double writing, the team will then take a look at how gray scale "reading data" is.

The overall idea is to gradually grayscale the traffic through the A/B platform. At the beginning, 100% of the traffic reads the SQL Server database, and then gradually cuts the traffic to read the MySQL database. First, 1%, if there is no problem, then gradually release the traffic, and finally 100% of the traffic. % of the traffic goes through the MySQL database.

In the process of gradually grayscale traffic, a verification mechanism is required. Only when the verification is ok can the traffic be further enlarged.

So how is this verification mechanism implemented?

The solution is to use an asynchronous thread to compare whether the query results of SQL Server and MySQL are consistent in a query request. If they are inconsistent, record the log and manually check the cause of the inconsistency. .

As shown below:

Therefore, the overall implementation process is as follows:

First of all, in the middle of a dark and windy night, when the traffic is the smallest, complete the full data synchronization from SQL Server to MySQL database.

Then, enable double write,

At this point, if there is a user registration, it will be double-written to the two databases in real time.

Then, between full synchronization and real-time double-write enable, the two databases still have a difference in data during this period, so incremental synchronization is needed again to complete the data to prevent data inconsistency.

The rest of the time is spent monitoring various logs to see if there is a problem with double writing, to see if the data comparison is consistent, and so on.

This period of time is the longest and most prone to problems.

If some problems are serious and cause data inconsistency, you need to start all over again.

Build a full MySQL database based on SQL Server again, and then grayscale the traffic again until finally, 100% of the traffic is grayscaled to MySQL. At this point, you're done. The grayscale logic is offline, and all reads and writes are switched to the MySQL cluster.

MySQL and ES active/standby cluster solution

After this step, I feel that the main member library should be fine, but a serious failure of the dal component changed the team's thinking.

That failure was terrible. Many applications in the company could not connect to the database, and the amount of orders created plummeted.

This made the team realize that even if the database is good, the abnormality of the dal component can still cause the membership system to hang.

Therefore, the team heterogeneized the data source of the member master database again, and double-written the data to ES, as follows:

If the dal component fails or the MySQL database hangs up, you can switch the read and write to ES,

After MySQL recovers, synchronize the data to MySQL, and finally switch back to the MySQL database for reading and writing.

As shown below:

ES High Availability Solution

ES dual-center active-standby cluster architecture

After the integration of Tongcheng and eLong, the total number of members of all systems on the entire platform is more than one billion.

With such a large data volume, the query dimension of the business line is also relatively complicated.

Some business lines are based on mobile phone numbers, some are based on WeChat unionid, and some are based on eLong card numbers, etc. to query member information.

Based on such a large amount of data and so many query dimensions, the team chose ES to store unified membership relationships.

ES clusters are very important in the entire membership system architecture, so how to ensure the high availability of ES?

First of all, the team knows that the ES cluster itself guarantees high availability, as shown in the following figure:

When one node of the ES cluster goes down, the Replica Shards corresponding to other nodes will be upgraded to Primary Shards to continue to provide services.

But even that is not enough.

For example, ES clusters are deployed in computer room A, and now computer room A suddenly loses power, what should I do?

For example, if the server hardware fails, most machines in the ES cluster are down, what should I do?

Or suddenly there is a very popular flash sale event, which brings a wave of very large traffic, directly killing the ES cluster, what should I do?

Faced with these situations, let the operation and maintenance brothers rush to the computer room to solve them?

This is very unrealistic, because the membership system directly affects the main process of placing orders for all business lines of the company, and the time for fault recovery must be very short. If manual intervention by the operation and maintenance brothers is required, the time will be too long, which is absolutely intolerable .

How to do the high availability of ES?

The team's solution is the ES dual-center active-standby cluster architecture.

The team has two computer rooms, namely computer room A and computer room B.

The team deploys the ES main cluster in computer room A, and deploys the ES standby cluster in computer room B.

The reading and writing of the member system are all in the ES main cluster, and the data is synchronized to the ES standby cluster through MQ.

At this time, if the ES main cluster collapses, through unified configuration, the read and write of the member system will be switched to the ES standby cluster in computer room B, so that even if the ES main cluster is down, failover can be achieved in a short time. Ensure the stable operation of the membership system.

Finally, after the failure of the ES main cluster recovers, turn on the switch to synchronize the data during the failure to the ES main cluster. After the data is synchronized and consistent, switch the reading and writing of the member system to the ES main cluster.

As shown below:

ES traffic isolation three-cluster architecture

It seems that there should be no major problem if the dual-center ES active and standby clusters achieve this step, but a terrible traffic shock last year made the team change their minds.

It was a holiday, and a certain business launched a marketing campaign,

In one request from the user, the membership system was called more than 10 times, which caused the TPS of the membership system to skyrocket and almost exploded the ES cluster.

This incident made the team terrified, and it made the team realize that it is necessary to prioritize the callers and implement more refined isolation, circuit breaker, downgrade, and current limiting strategies.

First, the team sorted out all the callers and divided them into two types of requests.

The first category is requests that are closely related to the main process of placing an order. These requests are very important and should be guaranteed with high priority.

The second category is related to marketing activities. This type of request has a characteristic. They have a large number of requests and a high TPS, but they do not affect the main process of placing an order.

Based on this, the team built another ES cluster, which is specially used to deal with high TPS marketing spike requests, so that it is isolated from the main ES cluster, and will not affect the user's main task of placing an order due to the traffic impact of a certain marketing activity. process. As shown below:

ES cluster depth optimization and improvement

After talking about the high-availability architecture of ES's dual-center active and standby clusters, the team will then explain in depth the optimization of the ES's main cluster.

For a period of time, the team was in great pain, that is, every meal time, the ES cluster started to call the police, which made everyone feel flustered every time they ate, fearing that the ES cluster could not handle it alone, and the whole company would be blown up.

Then why call the police as soon as it's meal time?

Because the traffic is relatively large, the number of ES threads soars, the CPU goes straight up, and the query time increases, which is transmitted to all callers, resulting in a wider range of delays.

So how to solve this problem?

By digging into the ES cluster, the team discovered the following problems:

  • The ES load is unreasonable, and the hotspot problem is serious. There are dozens of nodes in the ES main cluster. Some nodes deploy too many shards, while others deploy few shards. As a result, some servers are heavily loaded. When traffic peaks, frequent warnings are issued.
  • The size of the ES thread pool is set too high, causing the CPU to soar. The team knows that when setting the threadpool of ES, the number of threads is generally set to the number of CPU cores of the server. Even if the query pressure of ES is high and the number of threads needs to be increased, it is best not to exceed "cpu core * 3 / 2 + 1". If the number of threads is set too much, the CPU will frequently switch back and forth between multiple thread contexts, wasting a lot of CPU resources.
  • The memory allocated by the shard is too large, 100G, which slows down the query. The team knows that the ES index should allocate the number of shards reasonably, and control the memory size of a shard within 50G. If the memory allocated by a shard is too large, the query will be slowed down, time-consuming will be increased, and the performance will be seriously affected.
  • The field of string type is set with double fields, which are both text and keyword, which doubles the storage capacity. The query of member information does not need to be scored according to the degree of relevance, it can be queried directly according to the keyword, so the text field can be completely removed, which can save a large part of storage space and improve performance.
  • ES query, use filter, not query. Because the query will calculate the relevance score of the search results, which consumes more CPU, but the query of member information does not need to calculate the score, and this part of the performance loss can be completely avoided.
  • Save ES computing power, and sort ES search results in the JVM memory of the member system.
  • Add routing key. The team knows that an ES query will distribute the request to all shards, aggregate the data after all shards return the results, and finally return the results to the caller. If the team already knows in advance which shards the data is distributed on, it can reduce a large number of unnecessary requests and improve query performance.

After the above optimization, the results are very significant, the CPU of the ES cluster is greatly reduced, and the query performance is greatly improved. CPU usage of ES cluster:

Time-consuming interface of membership system:

Membership Redis cache solution

For a long time, the membership system has not been cached for two main reasons:

  • First, the performance of the ES cluster mentioned above is very good, with more than 30,000 concurrency per second, and the 99th percentile takes about 5 milliseconds, which is enough to cope with various difficult scenarios.
  • Second, some businesses require real-time consistency in the binding relationship of members, and membership is an old system that has been developed for more than 10 years, and it is a distributed system composed of many interfaces and many systems.

Therefore, as long as there is an interface that is not considered in place and the cache is not updated in time, it will lead to dirty data, which in turn will cause data inconsistency.

For example:

  • Users cannot see WeChat orders on the APP
  • The membership level and mileage of APP and WeChat are not merged
  • WeChat and APP cannot cross-market and so on.

Then why do you need to cache again?

It is because of the blind box activity of air tickets this year, the instantaneous concurrency it brings is too high.

Although the membership system is safe and sound, there are still lingering fears. To be on the safe side, we finally decided to implement a caching solution.

The solution to the inconsistency problem of Redis cache data caused by ES nearly one second delay

In the process of making a membership cache solution, I encountered a problem caused by ES, which would lead to inconsistency of cached data.

The team knows that ES operation data is near real-time. If you add a document to ES, you can check it immediately, but you can't find it. You need to wait for 1 second before you can check it.

As shown below:

Why does ES's near real-time mechanism cause Redis cache data to be inconsistent?

Specifically, suppose a user logs out of his APP account. At this time, ES needs to be updated to delete the binding relationship between APP account and WeChat account.

The data update of ES is near real-time, that is to say, you can query the updated data after 1 second.

And within this 1 second, there is a request to query the membership binding relationship of the user. It first checks in the Redis cache and finds that there is no, then checks in ES and finds it, but it finds the old data before the update .

Finally, the request updates the queried old data to the Redis cache and returns it.

In this way, after 1 second, the membership data of the user in ES is updated, but the data in the Redis cache is still old data, which leads to the inconsistency between the Redis cache and the ES data. As shown below:

Faced with this problem, how to solve it?

The team's idea is to add a 2-second Redis distributed concurrent lock when updating ES data, and then delete the member's cached data in Redis to ensure the consistency of the cached data.

If there is a request to query data at this time, first obtain the distributed lock, and find that the member ID has been locked, indicating that the data just updated by ES has not yet taken effect, then after querying the data at this time, the Redis cache will not be updated, and it will return directly. This avoids the inconsistency problem of cached data.

As shown below:

At first glance, the above solution seems to have no problem, but careful analysis may still lead to inconsistency of cached data.

For example, before the update request adds a distributed lock, there is exactly one query request to acquire a distributed lock, but there is no lock at this time, so it can continue to update the cache.

But just before he updated the cache, the thread was blocked. At this time, the update request came, a distributed lock was added, and the cache was deleted. When the update request completes the operation, the thread of the query request comes alive. At this time, it executes the update cache and writes the dirty data into the cache.

Did you find it? The main crux of the problem is that there is a concurrency conflict between "delete cache" and "update cache". As long as they are mutually exclusive, the problem can be solved.

As shown below:

After implementing the caching solution, according to statistics, the cache hit rate is 90%+, which greatly relieves the pressure on ES, and the overall performance of the membership system has been greatly improved.

Redis dual-center multi-cluster architecture

Next, the team looked at how to ensure the high availability of the Redis cluster.

As shown below:

Regarding the high availability of the Redis cluster, the team adopted a dual-center multi-cluster model.

Deploy a set of Redis clusters in computer room A and computer room B respectively.

When updating the cached data, double-write, only if the Redis clusters in both computer rooms are successfully written, will it return success.

When querying cached data, query nearby in the computer room to reduce delay.

In this way, even if computer room A fails as a whole, computer room B can still provide complete member services.

Outlook: more refined traffic control and downgrade strategies

Any system cannot guarantee that there will be no problems 100%, so the team must have a failure-oriented design, that is, a more refined flow control and degradation strategy.

More refined flow control strategy

  • Hotspot control. For the scenario of fraudulent billing, the same member id will have a large number of repeated requests, forming hot accounts. When the access of these accounts exceeds the set threshold, the traffic limiting strategy will be implemented.
  • Flow control rules based on the calling account. This strategy is mainly to prevent the large traffic caused by the caller's code bug. For example, in a user request, the caller calls the membership interface many times in a loop, resulting in a sudden increase in the traffic of the membership system many times. Therefore, it is necessary to set flow control rules for each calling account, and implement a flow-limiting policy when the threshold is exceeded.
  • Global flow control rules. The team membership system can withstand more than 30,000 TPS of concurrent requests per second. If at this time, there is a terrible traffic coming, the TPS is as high as 100,000. Instead of letting this wave of traffic kill all the membership database and ES, it is better Fast fail the traffic that exceeds the tolerance of the membership system, at least member requests within 30,000 TPS can be responded normally, and the entire membership system will not collapse.

More refined downgrading strategy

  • Downgrade based on average response time. The member interface also depends on other interfaces. When the average response time of calling other interfaces exceeds the threshold, it enters the quasi-degraded state. If the average response time of the incoming requests in the next 1 second continues to exceed the threshold, then in the next time window, the fuse will be automatically cut off.
  • Demotion based on number of outliers and proportion of outliers. When an exception occurs on other interfaces that the member interface depends on, if the number of exceptions within 1 minute exceeds the threshold, or the ratio of the total number of exceptions to the throughput per second exceeds the threshold, it enters the degraded state and automatically fuses within the next time window.

At present, the biggest pain point of the team is the management of members calling accounts. In the company, if you want to call the member interface, you must apply for a calling account. The team will record the usage scenarios of the account and set the rules for traffic control and downgrade strategies.

But in the process of actual use, the colleague who applied for the account may change to another department. At this time, he may also call the membership system. In order to save trouble, he will not apply for the member account again, but directly use the previous account. This makes it impossible for the team to judge the specific usage scenarios of a member account, and it is impossible to implement more refined flow control and downgrade strategies. Therefore, next, the team will sort out all calling accounts one by one. This is a very large and cumbersome task, but there is no way out, so we must do it well.

Chapter 33 Video: 10Wqps Basic User Platform Architecture and Practice

For the content of this article, Nien will put it in the project introduction of "Chapter 33 Video: 10Wqps Basic User Platform Architecture and Practical Operation".

And provide supporting resume templates to help you rebuild and upgrade the highlights of your resume, and finally help you enter a big factory, build a structure, and get a high salary.

So, the above is the "textbook" answer

Combined with the plan of station B, everyone returns to the previous interview questions:

  • Tens of millions of data, how to do system architecture?

  • Billion-level data, how to do system architecture?

  • Tens of millions of traffic, how to do the system architecture?

  • Billion-level traffic, how to do system architecture?

  • How to structure a high-concurrency system?

The above solution is the perfect answer, the "textbook" answer.

In the follow-up, Nien will give you more and more exciting answers based on industry cases.

Of course, if you encounter such problems, you can ask Nien for help.

recommended reading

" Burst, relying on "bragging" to live in JD.com, with a monthly salary of 40K "

" Too fierce, relying on "bragging" to live on SF Express, with a monthly salary of 30K "

" It exploded... Jingdong asked for 40 questions, and 50W+ after passing "

" Questions are numb...Ali asked 27 questions at the same time, and 60W+ after passing "

" Baidu madly asked for 3 hours, Dachang got an offer, the guy is so ruthless!" "

" Are you too ruthless: face an advanced Java, how hard and ruthless it is "

" One hour of byte madness, the guy got the offer, it's too ruthless!" "

" Accept a Didi Offer: From the three experiences of the guy, what do you need to learn? "

"Nin's Architecture Notes", "Nin's High Concurrency Trilogy", "Nin's Java Interview Collection" PDF, please go to the following official account [Technical Freedom Circle] to take it↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132343058