Yirendai System Architecture - Evolutionary Road under High Concurrency

Iteration of the Yirendai system version

Version 1.0 - Simple Annoyance

1.PNG


The Yirendai system before the iteration is actually a front desk, a backend, and a DB. The frontend adopts a multi-machine deployment method. The software layer is also divided into three layers like the most traditional software. The first layer is Controller, the second layer is Service, and the third layer is DAO. Obviously this system is not suitable for the Internet, there are some inevitable problems. First of all, when there are more than 10,000 users and thousands of online users, such a deployment method will cause some bottlenecks, including both the server and the database. The second is that the size of the team becomes larger, and all developers focus on developing the same system, resulting in serious conflicts.

Version 1.5 - Try "Eat Big Tonic"!

In response to the above problems, we have made some modifications, and I define it as "eating a big tonic". Eating fortification usually has a very obvious feature, that is, it works immediately, but the side effects are also great.


2.PNG

First of all, we pay more attention to the improvement of performance in the page layer of Yirendai, such as using browser cache, compressed transmission, the page has been optimized by YSlow, the link layer has added CDN, made static and even reverse proxy, so Can withstand 80% of traffic. A cache cluster is added to the application server and database layers. This cache cluster can basically block 80% of the traffic. Finally, the system layer is divided into multiple systems vertically according to the business. There are also some changes in the database. At first, it was only one host and one database. Now it has become a master-slave, or even one master and multiple slaves. Users can support more than one million, and online users are tens of thousands. Even so, our limiting factor is still the database. Although the optimized two layers block about 95% of the traffic, the business development still exceeds the load that the database can bear, so the database is still a big bottleneck.

The second problem is team division. In fact, each team makes its own system, but everyone still uses the same database. At this time, it is very troublesome to design and modify the database. I even have to ask other teams every time, if I can change my career like this, what impact will it have on you, etc.

The third problem is also very thorny. The cache is heavily used, and the problem of timeliness and consistency of data is getting more and more serious.

Version 2.0 - Refinement of "Open a Small Stove"

In order to solve the problems of version 1.5, we need to do fine optimization, I define it as a small stove. First, reasonably plan data ownership, optimize query efficiency, and shorten database transaction time; second, divide systems, each system uses a fixed table. What we do every day is to let the operation and maintenance find out which SQL is the slowest online and optimize them. Third, do the transaction, or shorten the transaction time as much as possible.

Then start to focus on code quality, improve execution efficiency, and start to focus on concurrency issues. When users reach this amount, users will help us test concurrency problems. For example, the same user logs in to two clients with the same account, and he withdraws cash at the same time. At this time, if the program is not handled well, it is very likely that he will be asked to withdraw cash twice.

Finally, it is necessary to distinguish between strongly consistent and eventually consistent requests, and reasonably use cache and read-write separation to solve these problems.

2.0 solves a lot of performance problems, but there will still be new problems - there are more and more systems, and the inter-system dependencies become complicated. At this time, it is easy to appear a cycle where A adjusts B, B adjusts C, and C adjusts back to A. transfer. The second is the increase in mutual calls between systems, and the upstream system overwhelms the downstream system; the third is also a very troublesome problem. There are many systems, and it becomes more and more difficult to find online problems - imagine deploying multiple systems to many machines If you want to find an online problem, it will be very difficult to find it in the form of a log.

So on this basis we did a few things. One is current limiting. The current limit is usually based on two points: the maximum number of active threads and the number of runs per second; the maximum number of active threads is suitable for high consumption tasks, and the number of runs per second is suitable for low consumption tasks. Second, I suggest to unify the return value between internal systems as much as possible during this period, and the return status (business normal, business exception, program exception) and error description must be recorded in the return value; third, the RPC framework can be reused or in the original framework On this basis, continue to develop and complete the current limiting work.


3.PNG

Let's talk about finding logs again. The picture shows the deployment framework of the Yirendai log system. The leftmost is our business system. On the business system, the logs are collected into the Kafka queue, and then the Kafka queue logs are put into the ES cluster for indexing. Finally, we use Kibana and our own research and development. The log query system is used to view the log, so that the log will be easier to find after it is concentrated to a point.

Regarding the software, Yirendai uniformly uses SLF4J+Logback to output logs, and then implements log concatenation between business systems. Some parameters are implicitly passed between all servers and clients, and these parameters will follow the call chain step by step. Transmission, through AOP to achieve. What parameters need to be passed for log concatenation, or what parameters should be entered in the log? The first is the time, which should be in milliseconds, and the second is the serial number, which is a unique value generated for each request. Then there are user Session, device number, caller time (APP uses mobile phone local time), local IP, client IP, user real IP, number of crossing systems - if we find an error, we can find the serial number according to the error log , and then through the serial number, you can go to the log query platform to query all the systems in this request route and the log of each system for this request. With these, it is very easy to find problems.

After reaching 2.0, Yirendai's website can basically support the scale of medium and large websites, and there will not be too many performance problems in a short period of time, but we will continue to go down and further improve the system version.

Version 3.0 - Split for service

3.0 To sum up, it is to be service-oriented. In layman’s terms, it is splitting, including vertical splitting in business and horizontal splitting on the system based on vertical splitting. So how to do service-oriented?

First of all, when doing business splits, you can first do a large service split according to basic services and business services, and then basic services include non-business-based basic services and business-based basic services. Non-business-based systems are very Obviously it has nothing to do with other systems. The relationship between business-oriented basic services and business is very small, and basically the relationship with business systems is limited to the relationship between primary keys and foreign keys.


5.PNG

Yirendai can be naturally disassembled and divided into two systems, one is the loan business, the other is the wealth management business. The loan business can be divided into background, web, cooperation channels, etc. There will be a basic service under this system, which is to provide some basic services and a one-tier system of interfaces. The basic service is divided into two parts, one is the incoming part of the basic service, and the other is the loan of the basic service. In the process of splitting, we found another problem. The two businesses of wealth management and borrowing can’t be separated, that is, the relationship between matching business and bonds. This kind of business that cannot be separated can be separately commissioned into a system to provide services.


666.jpg

Splitting the system may seem easy, but there are many practical problems. I have summed up the methods of splitting as follows:

First, appropriate redundancy, redundancy can ensure that the database can still perform related queries. Most of the refactoring process is not to create a new system, but to modify the original system. At this time, some redundancy can be done to avoid modifying the code.

Second, data replication, but it must be ensured that the data ownership system has the authority to modify and initiate replication. This is more suitable for the global configuration mentioned above. For example, basically all companies will have several tables that record the provinces, cities and counties across the country. These will be used in every system, not necessarily every system has an interface. Calling it in the form can have a redundant copy of data in each system.

Third, a little trick - how to verify the database, it is not necessary to split it into two physical databases to verify, you can create two accounts on one database, and the respective permissions of these two accounts point to the split table , so that the split effect can be directly verified through the account.

Fourth, plan services in advance. Before splitting, determine whether the service type is a service with many reads or writes, and whether it is a fast-request or slow-request service. Different services need to be deployed separately.

Finally, the same data cannot be controlled by more than one system, and the same system cannot be owned by more than one team.

Version 4.0 - Cloud Outlook

After accomplishing the above points, version 3.0 has done almost the same, but there is still a lot of work to be done in Yirendai. Is version 4.0 a cloud platform, a plan for off-site deployment, and whether it needs to be vertical when the table is large? Splitting, going to IOE or using Docker for rapid deployment, etc. These are actually things that we will consider in the future when we do 4.0 or 5.0.

Optimization of Yirendai's financial management system

Reasonable Estimated Traffic - Strong Consistency and Eventual Consistency

The three interfaces in the figure are the home page, the column page, and the details page.


6.5.PNG


Before doing optimization, we must first estimate the traffic reasonably. There are the following two common methods.

Evaluation method 1: weekday PV / heat time;

Evaluation method 2: Average number of operations per person/hot time for the number of online users during the  hot period. Taking the Yirendai financial terminal as an example, assuming that there are N million people during the peak period, and then performing M operations on average, basically all the bonds are robbed in about R minutes, and the calculation is about N  M / R million times / second .

After the estimation, a more detailed estimation should be made to distinguish what is strong consistency, what is final consistency, and what are the two flows. Strong consistency requires that the requested data must be the most accurate data at that time, and this data cannot be separated or cached by read and write. The data timeliness of eventual consistency is not so high, as long as the final result is correct.

Suppose the M operations include: registration, registration verification code, login, unlock gesture password, home page, browsing product list, etc. There are some operations in it, such as product balance, order generation, payment SMS, payment, these All are strong requirements.

The solution for eventual consistency is very simple, and it can be solved by adding more machines. If the real-time performance is high, the read-write separation of the database can be directly used. If the cache is used, the cache time can be shortened; the low real-time performance should use a longer cache. .


7.PNG


The traffic processing scheme with strong consistency is generally locking. You can use database locks, distributed locks such as ZK (Zookeeper), or directly use queues, because queues are also a kind of locks in general. If you use database locks, you can basically support up to 2000 concurrency per second. Using database locks to deal with concurrency, the first method is to deal with concurrency with transactions. First open the transaction, lock the shared resource, then update the shared resource, and finally query the shared resource again, and then judge the result. If it is said that the result is valid, it will continue to execute directly. If it is said that the result is not valid, the transaction will be rolled back directly. The second method is to process concurrency without transactions. Add a judgment condition to the where condition of the database SQL. If the number of update entries is 1, the update is successful, and if it is 0, the update fails. In this case, you need to write code to roll back the data. .

What if the traffic is still unbearable?

In fact, it has been able to withstand a very large amount of traffic, but the business may continue to develop, what should I do if it can't bear it?

The first principle is that no distributed algorithm is suitable for concurrent operation, and the best way is to single-point and queue for processing.

Second, the single point of concurrency is too large, and the granularity of the lock should be split in an appropriate way.

Third, increasing the downgrade requirement can appropriately reduce the service quality without affecting the normal use of the user. Appropriately modify the requirements and appropriately increase the time for the user to wait for the result; if the user waits for twice as long, it may be able to withstand twice the concurrency before. This can be optimized in terms of interaction, allowing users to have a better experience.

Finally, adjust the operation strategy appropriately to disperse the centralized active time of users.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324468240&siteId=291194637