The 2021 version of Alibaba's concurrent system architecture manual is released, and the interview will no longer be worried about not being able to answer

basis

I have also stepped on some pits before . An entrepreneurial project I participated in adopted a service-oriented architecture at the initial stage. However, due to limited manpower and insufficient team technology accumulation at the time, I found that it was impossible to control such a complex architecture in the actual project development process. , There have also been many problems such as difficulty in locating the problem and the overall performance of the system. It is difficult to trace the root cause even when the system is down. Finally, the service has to be integrated and returned to a simple monolithic architecture.

So I suggest that the evolution of the general system should follow the following ideas:

The simplest system design meets the business needs and current flow status, and the most familiar technology system is selected. With the increase in traffic and changes in business, correct the problematic points in the architecture, such as single point problems, horizontal expansion problems, and components whose performance cannot meet demand. In this process, we choose components that are mature in the community and familiar to the team to help us solve the problem. We will make our own wheels when the community does not have a suitable solution. When minor repairs and supplements to the architecture cannot meet the needs, consider major adjustments such as refactoring and rewriting to solve the existing problems.

Take Taobao as an example. At that time, the business from 0 to 1 was to quickly build the system through purchase. Then, with the growth of traffic, Taobao made a series of technical transformations to improve high concurrency processing capabilities, such as the migration of the database storage engine from MyISAM to InnoDB, database sub-database sub-table, increase cache, start middleware research and development, etc. When these are not satisfied, consider large-scale reconstruction of the overall architecture. For example, the famous "Colorful Stone" project has allowed Taobao's architecture to evolve from a single to a service-oriented architecture. It is through the gradual technological evolution that Taobao has evolved the technical architecture that now bears over 100 million QPS.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

Increase the number of processing cores of the system

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

database

Starting from this lecture, we officially enter the evolution chapter. I will start from the part and take you one by one to understand some of the methods used to complete these goals. These methods will specifically solve the problems that arise in the design of high-concurrency systems. For example, in the 15th lecture, I will mention the Bloom filter. This component is to solve the problem of how to improve the cache hit rate as much as possible when there is a large amount of cache penetration.

Of course, simply explaining the theory and explaining the plan will be boring, so I will use a virtual system as the main line throughout the course to explain what problems we will encounter when the system reaches a certain stage, and what kind of use will be adopted. What are the technical points involved in the response process? Through this way of narration, we strive to elicit problems with cases, so that you can understand how to solve different problems when you encounter different problems. Of course, in this process, I hope you can think more, and then learn the knowledge you have learned. Use it in actual projects.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

How to split the database horizontally

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

Cache

Cache is a component that stores data. Its function is to make requests for data return faster.

We often store the cache in memory, so some people equate memory and cache. This is completely a layman’s opinion. As an industry insider, you need to know that in some scenarios we may also use SSD as a cache of cold data. For example, 360 open source Pika uses SSD to store data to solve the capacity bottleneck of Redis.

In fact, any structure used to coordinate the difference in data transmission speed between the two types of hardware with a large difference in speed can be called a cache. So when it comes to this, we need to know what the delays of common hardware components look like, so that we can have a more intuitive impression of the delays when making plans.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

How to make the user's request reach the CDN node

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

message queue

At the beginning of the course, I took you to understand the three goals of high-concurrency system design: performance, availability, and scalability. In terms of improving system performance, we have been focusing on the query performance of the system. I also spent a lot of space to explain the distributed transformation of the database, the principles and use skills of various caches. The reason is that most of the scenarios we encounter are read more than write less, especially in the initial stage of a system.

For example, in the early stage of a community system, there must be only a small number of seed users producing content, and most users are "watching" what others are saying. At this time, the overall traffic is relatively small, and the write traffic may only account for one percent of the overall traffic. Even if the overall QPS reaches 10,000 times per second, the write request will only reach 100 times per second. Performance optimization, its cost performance is indeed not very high. However, with the development of your business, you may encounter some scenarios where there are high concurrent write requests.

Typical scene. Suppose your mall plans a spike campaign, which starts at 00:00 on the fifth day and is limited to the top 200. When the spike is about to begin, the background will show that the user is refreshing the APP or browser frantically to ensure that they can See the product as early as possible.

At this time, you are still facing high read requests, so what are the countermeasures?

Because the user is querying a small amount of product data, which belongs to the hot data of the query, you can use a caching strategy to block the request as much as possible in the upper-level cache. Data that can be staticized, such as pictures and video data in the mall, try your best Make it static so that it can hit the CDN node cache and reduce the query volume and bandwidth burden of the Web server. Web servers such as Nginx can directly access distributed cache nodes, which can prevent requests from reaching business servers such as Tomcat.

Of course, you can add some current limiting strategies, such as discarding repeated requests from a certain user, a certain IP, or a certain device within a short period of time.

Through these several methods, you find that you can block the request as far as possible from the database.

After slightly alleviating the read request, the 00:00 minute spike activity started on time, and the user instantly requested the e-commerce system to generate an order and deduct the inventory. These write operations of the user were directly sent to the database without being cached. Within 1 second, 10,000 database connections were reached at the same time, and the system's database was on the verge of collapse. It is extremely urgent to find a write request solution that can cope with such high concurrency. At this time you think of the message queue

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

In Kafka, the consumer's consumption progress is different in different versions.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

Distributed service

Through the content of the previous chapters, you have optimized your vertical e-commerce system in terms of performance, availability, and scalability from the perspective of databases, caches, and message queues.

At present, the deployment method of the project still adopts an integrated architecture, which means that all functional modules, such as the order module, user module, payment module, logistics module, etc. in the e-commerce system, are packaged into a large In the Web project, and then deployed on the application server.

You vaguely think that there may be problems with this deployment method, so you Googled it and found that when the system develops to a certain stage, it must be split into microservices. You also see Taobao’s "Colorful Stone" project. The scalability of the overall architecture has a huge impact. All this makes you fascinated.

But there is a question that has been lingering in your mind: What prompted us to split the integrated architecture into a micro-service architecture? Does it mean that the overall QPS of the system reaches 10,000, or when it reaches 20,000, it must be split into microservices?

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

How to forward traffic to Sidecar

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

maintain

In the life cycle of a project, operation and maintenance occupies a large proportion. In terms of importance, it almost keeps pace with project research and development. In the process of system operation and maintenance, it is the job of each team to discover and solve problems in a timely manner. Therefore, at the beginning of your vertical e-commerce system, the operation and maintenance team must have completed basic monitoring of the machine's CPU, memory, disk, network, etc., and hope to find and deal with problems in a timely manner. You thought everything was going well, but you didn't expect to get complaints from users frequently during the operation of the system. The reasons are:

The master-slave delay of the used database has become longer, causing problems in business functions;

The response time of the interface has become longer, and the user feedback product page has a blank page; a large number of errors have occurred in the system, which affects the normal use of users.

You should have discovered and dealt with these problems in time. But the reality is that you can only passively fix the problem after the user feedback. At this time, your team realized that in order to quickly discover and locate problems in the business system, it is necessary to build a complete server-side monitoring system. As the saying goes, "There are tens of thousands of roads, the first is to be monitored, and the monitoring is not in place, and the two leaders tear." However, in the process of building, your team got into trouble again:

First of all, how to choose the monitored indicators?

What methods and approaches can be used to collect these indicators?

How to process and display the indicators after they are collected?

These problems, one after another, are related to the stability and availability of the system. In this lesson, I will take you to solve these problems and build a server-side monitoring system.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

Fixed window and sliding window algorithm

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

Actual combat

In the previous course, from the perspective of database, cache, message queue and distributed servicing, I took you to understand how to ensure the high performance, high availability and high scalability of the system in the face of high concurrency. Although there are a lot of examples in the course to help you understand the theoretical knowledge, there is not a complete example to help you put the knowledge together.

Therefore, in order to implement the knowledge we mentioned, in the actual combat article, I will use Weibo as the background and use two complete cases to take you from a practical perspective to deal with the impact of high concurrency and large traffic. I hope to give you a more specific one. The perceptual knowledge of, provides you with some ideas when implementing similar systems. The first case I want to talk about today is how to design a counting system that supports high concurrency and large storage capacity.

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

 

How to design a counting system that supports high concurrency

2021 version of Ali Java billion-level concurrency design manual: basic + database + cache + message queue + distributed + maintenance + actual combat

Due to the large content, I won't introduce them one by one. Friends who need this note [see the picture below to add a small assistant to get it]

Guess you like

Origin blog.csdn.net/m0_50180963/article/details/113994106