How does Taobao's double 1.1 billion-level high concurrency resist? You will understand after reading this

Preface

Double 11 is coming soon. This article takes the design of Taobao's back-end architecture as an example, introduces the evolution of the server architecture from one hundred concurrent to tens of millions of concurrent, and lists the relevant technologies encountered in each evolution stage. Everyone has an overall understanding of the evolution of the architecture.

The article concludes with some principles of architecture design.
Insert picture description here

basic concepts

Before introducing the architecture, in order to avoid some readers not understanding some concepts in the architecture design, the following introduces a few of the most basic concepts.

1) What is distributed?

Multiple modules in the system are deployed on different servers, which can be called a distributed system. For example, Tomcat and database are deployed on different servers, or two Tomcats with the same function are deployed on different servers.

2) What is high availability?

When some nodes in the system fail, other nodes can take over it and continue to provide services, then the system can be considered to be highly available.

3) What is a cluster?

A specific field of software is deployed on multiple servers and provides a class of services as a whole. This whole is called a cluster.

For example, the Master and Slave in Zookeeper are separately deployed on multiple servers and form a whole to provide centralized configuration services.

In a common cluster, the client can often connect to any node to obtain services, and when a node in the cluster goes offline, other nodes can automatically take over it and continue to provide services, which shows that the cluster has high availability.

4) What is load balancing?

When a request is sent to the system, the request is evenly distributed to multiple nodes in some way, so that each node in the system can evenly process the request load, then the system can be considered to be load balanced.

5) What are forward proxy and reverse proxy?

When the system wants to access the external network, the request is forwarded through a proxy server in a unified manner. From the perspective of the external network, it is the access initiated by the proxy server. At this time, the proxy server implements a forward proxy;

When an external request enters the system, the proxy server forwards the request to a server in the system. For the external request, only the proxy server interacts with it. At this time, the proxy server implements a reverse proxy.

To put it simply, forward proxy is a process in which a proxy server replaces the inside of the system to access the external network, and reverse proxy is a process in which external requests for access to the system are forwarded to the internal server through the proxy server.
Insert picture description here

Architecture evolution

The Age of Innocence: Stand-alone Architecture

Insert picture description here
Take Taobao as an example: At the beginning of the website, the number of applications and the number of users are small, and Tomcat and the database can be deployed on the same server.

When the browser initiates a request to Taobao, it first converts the domain name to the actual IP address 10.102.4.1 through the DNS server (domain name system), and the browser then visits the Tomcat corresponding to the IP.

Architecture bottleneck: As the number of users grows, Tomcat and the database compete for resources, and the performance of a single machine is not enough to support the business.

The first evolution: Tomcat and the database are deployed separately.
Insert picture description here
Tomcat and the database respectively monopolize server resources, significantly improving their respective performance.

Architecture bottleneck: As the number of users grows, concurrent reading and writing of the database becomes a bottleneck.

The second evolution: the introduction of local cache and distributed cache
Insert picture description here
. Add a local cache on the same server or in the same JVM as Tomcat, and add a distributed cache externally to cache popular product information or html pages of popular products. Through caching, most requests can be intercepted before reading and writing the database, greatly reducing database pressure.

The technologies involved include: using memcached as a local cache, using Redis as a distributed cache, and problems such as cache consistency, cache penetration/breakdown, cache avalanche, and centralized invalidation of hot data.

Architecture bottleneck: The cache resists most of the access requests. As the number of users increases, the concurrency pressure mainly falls on the stand-alone Tomcat, and the response gradually slows down.

The third evolution: the introduction of reverse proxy to achieve load balancing
Insert picture description here
. Deploy Tomcat on multiple servers, and use reverse proxy software (Nginx) to evenly distribute requests to each Tomcat.

It is assumed here that Tomcat supports up to 100 concurrency and Nginx supports up to 50,000 concurrency. In theory, Nginx distributes requests to 500 Tomcats and can resist 50,000 concurrency.

The technologies involved include: Nginx and HAProxy, both of which are reverse proxy software working on the seventh layer of the network, mainly supporting the http protocol, and also involve session sharing, file upload and download issues.

Architecture bottleneck: The reverse proxy greatly increases the amount of concurrency that the application server can support, but the increase in the amount of concurrency also means that more requests penetrate the database, and the stand-alone database eventually becomes the bottleneck.

The fourth evolution: the separation
Insert picture description here
of database read and write divides the database into a read library and a write library. There can be multiple read libraries. The data in the write library is synchronized to the read library through the synchronization mechanism. For scenarios that need to query the latest written data, you can use Write one more copy in the cache and get the latest data through the cache.

The technologies involved include: Mycat, which is a database middleware, through which the database can be separated, read and written, and divided into tables. The client uses it to access the lower-level database, and it also involves data synchronization and data consistency issues. .

Architectural bottleneck: Businesses are gradually increasing, and the traffic gap between different businesses is large, and different businesses directly compete for databases, which affect each other's performance.

Fifth evolution: Databases are divided by business
Insert picture description here
. Data of different businesses are stored in different databases to reduce resource competition between businesses. For businesses with a large amount of visits, more servers can be deployed to support them.

At the same time, cross-business tables cannot be directly associated with the analysis, and need to be resolved through other means, but this is not the focus of this article, and those who are interested can search for solutions by themselves.

Architecture bottleneck: As the number of users grows, the stand-alone writing library will gradually reach the performance bottleneck.

The sixth evolution: split the big table into small tables

Insert picture description here
For example, for review data, hash according to the product ID and route to the corresponding table for storage;

For payment records, tables can be created by hour, and each hour table continues to be split into small tables, using user ID or record number to route data.

As long as the amount of table data for real-time operations is small enough and requests can be evenly distributed to small tables on multiple servers, the database can improve performance through horizontal expansion. The Mycat mentioned earlier also supports access control in the case of splitting large tables into small tables.

This approach significantly increases the difficulty of database operation and maintenance, and higher requirements for DBA. When the database is designed to this structure, it can already be called a distributed database

But this is just a logical database as a whole, different components in the database are realized by different components separately

For example, the management and request distribution of sub-databases and tables are implemented by Mycat, SQL parsing is implemented by a stand-alone database, read-write separation may be implemented by gateways and message queues, and the summary of query results may be implemented by the database interface layer, etc.

This architecture is actually an implementation of the MPP (Massively Parallel Processing) architecture.

At present, there are many MPP databases for both open source and commercial use. Among the more popular open source are Greenplum, TiDB, Postgresql XC, HAWQ, etc. Commercial ones such as Nanda General GBase, Ruifan Technology's Snowball DB, Huawei's LibrA, etc.

Different MPP databases have different focuses. For example, TiDB focuses more on distributed OLTP scenarios, and Greenplum focuses more on distributed OLAP scenarios.

These MPP databases basically provide SQL standard support capabilities like Postgresql, Oracle, and MySQL. They can parse a query into a distributed execution plan and distribute it to each machine for parallel execution. Finally, the database itself summarizes the data and returns it.

It also provides capabilities such as authority management, sub-database sub-table, transaction, data copy, etc., and most of them can support clusters with more than 100 nodes, which greatly reduces the cost of database operation and maintenance, and enables the database to achieve horizontal expansion.

Architecture bottleneck: Both the database and Tomcat can be scaled horizontally, and the supportable concurrency is greatly improved. As the number of users grows, the single-machine Nginx will eventually become a bottleneck.

Seventh evolution: Use LVS or F5 to balance multiple Nginx loads.
Insert picture description here
Because the bottleneck is Nginx, it is impossible to achieve multiple Nginx load balances through two layers of Nginx.

The LVS and F5 in the figure are load balancing solutions that work on the fourth layer of the network. LVS is software, running in the kernel state of the operating system, and can forward TCP requests or higher-level network protocols, so the supported protocols are more Rich, and the performance is much higher than Nginx, it can be assumed that the single-machine LVS can support hundreds of thousands of concurrent request forwarding;

F5 is a load balancing hardware, similar to the capabilities provided by LVS, with higher performance than LVS, but it is expensive.

Since LVS is a stand-alone version of software, if the server where LVS is located goes down, the entire back-end system will be inaccessible, so a spare node is required.

You can use keepalived software to simulate the virtual IP, and then bind the virtual IP to multiple LVS servers. When the browser accesses the virtual IP, it will be redirected to the real LVS server by the router

When the main LVS server goes down, the keepalived software will automatically update the routing table in the router and redirect the virtual IP to another normal LVS server to achieve the effect of high availability of the LVS server.

It should be noted here that the drawing from the Nginx layer to the Tomcat layer in the above figure does not mean that all Nginx forwards requests to all Tomcats.

In actual use, it may be a part of Tomcat under several Nginx. These Nginx are highly available through keepalived, and other Nginx is connected to another Tomcat, so that the number of Tomcats that can be connected can be doubled.

Architecture bottleneck: Since LVS is also a stand-alone machine, as the number of concurrency increases to hundreds of thousands, the LVS server will eventually reach the bottleneck. At this time, the number of users reaches tens of millions or even hundreds of millions. The users are distributed in different regions and the distance from the server room Different, the delay caused by the access will be significantly different.

The eighth evolution: implement load balancing
Insert picture description here
in the computer room through DNS polling. A domain name can be configured in the DNS server to correspond to multiple IP addresses, and each IP address corresponds to a virtual IP in a different computer room.

When a user visits Taobao, the DNS server will use a polling strategy or other strategies to select an IP for the user to visit. This method can achieve load balancing in the machine room

At this point, the system can achieve horizontal expansion of the computer room level, and the concurrency of tens of millions to hundreds of millions of levels can be solved by adding computer rooms, and the amount of concurrent requests at the entrance of the system is no longer a problem.

Architecture bottleneck: With the richness of data and the development of business, the requirements for retrieval and analysis are becoming more and more abundant, and the database alone cannot solve such rich requirements.

The ninth evolution: the introduction of technologies such as NoSQL databases and search engines.
Insert picture description here
When the data in the database reaches a certain size, the database is not suitable for complex queries and can only meet the scenarios of ordinary queries.

For statistical report scenarios, the results may not be able to run when the amount of data is large, and other queries will slow down when running complex queries

For scenarios such as full-text search and variable data structure, databases are inherently inapplicable. Therefore, it is necessary to introduce suitable solutions for specific scenarios.

For example, for mass file storage, it can be solved by the distributed file system HDFS, and for key value type data, it can be solved by solutions such as HBase and Redis

For full-text search scenarios, it can be solved by search engines such as ElasticSearch, and for multi-dimensional analysis scenarios, it can be solved by solutions such as Kylin or Druid.

Of course, the introduction of more components will also increase the complexity of the system. The data saved by different components needs to be synchronized, consistency issues need to be considered, and more operation and maintenance methods are needed to manage these components.

Architecture bottleneck: The introduction of more components solves the rich demand, and the business dimension can be greatly expanded. With this, too much business code is contained in an application, and business upgrade iteration becomes difficult.

Tenth evolution: Large applications are split into small applications.
Insert picture description here
Application codes are divided according to business segments, making the responsibilities of individual applications clearer, and independent upgrades and iterations can be achieved between each other. At this time, some common configurations may be involved between applications, which can be solved by the distributed configuration center Zookeeper.

Architectural bottleneck: There are shared modules between different applications. Separate management by the application will result in multiple copies of the same code, resulting in the upgrade of all application codes when public functions are upgraded.

The eleventh evolution: the reused functions are separated into micro-services,
Insert picture description here
such as user management, order, payment, authentication and other functions exist in multiple applications, then the codes of these functions can be extracted separately to form a single service To manage

Such services are so-called microservices. Public services are accessed through multiple methods such as HTTP, TCP or RPC requests between applications and services. Each individual service can be managed by a separate team.

In addition, functions such as service governance, current limiting, fusing, and downgrading can be implemented through frameworks such as Dubbo and SpringCloud to improve the stability and availability of services.

Architecture bottleneck: different service interface access methods are different, application code needs to adapt to multiple access methods to use the service. In addition, applications access services, services may also access each other, the call chain will become very complicated, and the logic becomes confusion.

The twelfth evolution: the introduction of the enterprise service bus ESB to shield the access differences of the service interface.
Insert picture description here
Through the ESB unified access protocol conversion, the application unified through the ESB to access the back-end services, and the services and services also call each other through the ESB, thereby reducing The degree of coupling of the system.

This single application is split into multiple applications, public services are separately extracted for management, and the enterprise message bus is used to relieve the coupling problem between services. This is the so-called SOA (service-oriented) architecture, which is similar to microservices. The architecture is easy to confuse, because the presentation is very similar.

Personal understanding, microservice architecture refers more to the idea of ​​extracting public services from the system for separate operation and maintenance management, while SOA architecture refers to an architectural idea that splits services and makes service interface access unified. SOA architecture It contains the idea of ​​microservices.

Architectural bottleneck: With the continuous development of business, the number of applications and services will continue to increase, and the deployment of applications and services will become more complex. Deploying multiple services on the same server must also solve the problem of operating environment conflicts.

In addition, for scenarios that require dynamic expansion and contraction, such as large promotion, the performance of the service needs to be scaled horizontally, it is necessary to prepare the operating environment and deploy the service on the newly added service, and operation and maintenance will become very difficult.

The thirteenth evolution: the introduction of containerization technology to achieve operating environment isolation and dynamic service management.
Insert picture description here
Currently the most popular containerization technology is Docker. The most popular container management service is Kubernetes (K8S). Applications/services can be packaged as Docker images. K8S to dynamically distribute and deploy images.

Docker image can be understood as a minimal operating system that can run your application/service, which contains the application/service running code, and the operating environment is set up according to actual needs.

After packaging the entire "operating system" as an image, it can be distributed to the machines that need to deploy related services, and the service can be started by directly starting the Docker image, making service deployment and operation and maintenance simple.

Before the big promotion, servers can be divided into existing machine clusters to start Docker images to enhance service performance

After the big promotion, you can turn off the mirroring without affecting other services on the machine (before section 18, the service running on the new machine needs to modify the system configuration to adapt to the service, which will cause other services on the machine to run. The environment is destroyed).

Architectural bottleneck: After the use of containerization technology, the problem of service dynamic expansion and contraction is solved, but the machine still needs to be managed by the company itself. When it is not a big promotion, a large amount of machine resources still need to be idle to deal with the big promotion. Operation and maintenance costs are extremely high, and resource utilization is low.

The fourteenth evolution: the
Insert picture description here
system is carried by the cloud platform. The system can be deployed to the public cloud, using the massive machine resources of the public cloud to solve the problem of dynamic hardware resources

During the big promotion period, temporarily apply for more resources in the cloud platform, combine Docker and K8S to quickly deploy services, release resources after the big promotion ends, truly pay on demand, and greatly improve resource utilization. Greatly reduce operation and maintenance costs.

The so-called cloud platform is to abstract massive machine resources into a whole resource through unified resource management

On the cloud platform, hardware resources (such as CPU, memory, network, etc.) can be dynamically applied for on demand, and a general operating system is provided on top of it, providing common technical components (such as Hadoop technology stack, MPP database, etc.) for users to use, and even Provide developed applications

Users do not need to care about what technology is used inside the application to solve their needs (such as audio and video transcoding services, email services, personal blogs, etc.).

The following concepts are involved in the cloud platform:

IaaS: Infrastructure as a service. Corresponding to the above-mentioned machine resources are unified into the overall resource level, which can dynamically apply for hardware resources;
PaaS: platform as a service. Corresponding to the above-mentioned provision of commonly used technical components to facilitate system development and maintenance;
SaaS: Software as a service. Corresponding to the above-mentioned providing well-developed applications or services, pay according to the function or performance requirements.
So far: The above mentioned problems have their own solutions from the high concurrent access problem to the service architecture and system implementation.

But at the same time, you should also be aware that in the above introduction, practical problems such as cross-computer room data synchronization, distributed transaction implementation, etc. are intentionally ignored. These problems will be discussed separately in the future.
Insert picture description here

Summary of Architecture Design Experience

1) Does the adjustment of the structure have to follow the above-mentioned evolution path?

No, the sequence of architecture evolution mentioned above is only a single improvement for a certain aspect

In the actual scenario, there may be several issues that need to be resolved at the same time, or the bottleneck may be reached first by another aspect. At this time, it should be solved according to the actual problem.

For example, in scenarios where the amount of concurrency in the government category may not be large, but the business may be very rich, high concurrency is not a key problem to be solved. At this time, the first priority may be a solution with rich requirements.

2) To what extent should the architecture be designed for the system to be implemented?

For a single-implemented system with clear performance indicators, it is enough that the architecture is designed to support the system's performance indicator requirements, but there must be an interface to extend the architecture so that it is not needed.

For the evolving system, such as the e-commerce platform, it should be designed to the extent that it can meet the requirements of the next stage of user volume and performance indicators, and iteratively upgrade the architecture according to the growth of the business to support higher concurrency and richer business .

3) What is the difference between server architecture and big data architecture?

The so-called "big data" is actually a general term for scene solutions such as massive data collection, cleaning and conversion, data storage, data analysis, and data services. Each scene includes a variety of optional technologies.

For example, data collection includes Flume, Sqoop, Kettle, etc., data storage includes distributed file systems HDFS, FastDFS, NoSQL database HBase, MongoDB, etc., data analysis includes Spark technology stack, machine learning algorithms, etc.

In general, big data architecture is an architecture that integrates various big data components according to business needs. It generally provides distributed storage, distributed computing, multi-dimensional analysis, data warehouse, machine learning algorithms and other capabilities.

The server-side architecture refers more to the application organization level architecture, and the underlying capabilities are often provided by the big data architecture.

4) Are there any principles for architecture design?

  • N+1 design: Every component in the system should be free of single points of failure;
  • Rollback design: to ensure that the system can be forward compatible, there should be a way to roll back the version when the system is upgraded;
  • Disabled design: It should provide a configuration that controls whether specific functions are available, and can quickly go offline when the system fails;
  • Monitoring design: the means of monitoring should be considered in the design stage;
  • Multi-active data center design: If the system requires extremely high availability, consider implementing multiple active data centers in multiple locations, and the system is still available when at least one computer room is out of power;
  • Adopt mature technology: There are often many hidden bugs in newly developed or open source technologies. If there is a problem, it may be a disaster without commercial support;
  • Resource isolation design: avoid a single business occupying all resources;
  • The architecture should be able to scale horizontally: the system can only be scaled horizontally to effectively avoid bottlenecks;
  • Buy non-core: If non-core functions require a lot of R&D resources to solve, consider buying mature products;
  • Use commercial hardware: Commercial hardware can effectively reduce the probability of hardware failure;
  • Fast iteration: The system should quickly develop small functional modules, go online for verification as soon as possible, and find problems as early as possible to greatly reduce the risk of system delivery;
  • Stateless design: The service interface should be made stateless, and the access of the current interface does not depend on the state of the interface last accessed.

Easter eggs:

This is the end of this article. Friends who like it can help forward and pay attention to it. Thanks for your support!

I have collected a lot of interview information and sorted it out. Friends in need can send me a private message

Also give away Spring source code analysis, Dubbo, Redis, Netty, zookeeper, Spring cloud, distributed data

This knowledge is particularly suitable for:

1. Java programmers who want to change jobs in the near future and need to be interviewed should check for omissions and make up for shortcomings as soon as possible;

2. Want to understand the latest recruitment needs/technical requirements of "first-line Internet companies", compare and find out their strengths and weaknesses, and evaluate their competitiveness in the existing market;

3. Programmers who have not yet formed a systematic Java knowledge system and lack a clear improvement direction and learning path.

4. Lack of confidence if you want to go to first-line Internet companies

Original author: huashiou

Guess you like

Origin blog.csdn.net/XingXing_Java/article/details/103109680