Alibaba 2: How many nodes do you deploy? How to deploy 1000W concurrency?

Said it in front

In the reader exchange group (50+) of 40-year-old architect Nien , some friends have recently obtained interview qualifications from first-tier Internet companies such as Alibaba, NetEase, Youzan, Xiyin, Baidu, NetEase, and Didi. A few very important interview questions:

  • 1000W concurrency, how many nodes need to be deployed?
  • How do you think about how many nodes to deploy? How do you estimate and deploy them?

Nien reminded that issues related to deployment architecture and node planning are the core knowledge of architecture and are key online problems.

Therefore, here Nien will give you a systematic and systematic review, so that you can fully demonstrate your strong "technical muscles" and make the interviewer "can't help himself and drool" .

This question and the reference answers are also included in the V102 version of our " Nien Java Interview Guide " for reference by subsequent friends to improve everyone's 3-level architecture, design, and development levels.

For the PDFs of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and " Nin Java Interview Collection ", please go to the official account [Technical Freedom Circle] to obtain

Let’s think about a question first, which is also a question often encountered during the interview process.

If your company's current product sells masks, it can usually support 10W user access.

In case of emergencies, such as an epidemic,

It is expected that the number of users will reach 1000W in one month. If this task is given to you, what should you do?

How to analyze the problem of 10 million user concurrency

The question of how to support 10 million users is actually a fairly abstract question.

For technology developers, quantification is needed.

What is quantification? It is necessary to have clear performance indicator data for reference when executing key tasks.

For example, during peak hours, the system's transaction response time, number of concurrent users, query rate per second (QPS), success rate, etc.

The basic requirement for quantification is that all indicators must be clear.

Only in this way can the improvement and optimization of the entire architecture be effectively guided.

Therefore, if you face such a problem, you first need to find the core of the problem, which is to understand some quantifiable data indicators.

  • If you have relevant historical business transaction data, then you should refer to these data as much as possible and process the collected raw data (logs) to analyze the peak period and transaction behavior, scale and other information during this period, so as to update Clearly understand requirement details.
  • If you don't have relevant data indicators to refer to, then you need to rely on experience for analysis. For example, you can refer to some mature business transaction models in similar industries (such as daily transaction activities in the banking industry or ticket sales and inspection transactions in the transportation industry), or directly adopt the "2/8" principle and the "2/5/8" principle. Start practicing.
    When users can get a response from the system within 2 seconds, they will feel that the system is responsive;
    when they get a response within 2-5 seconds, they will feel that the system response speed is acceptable;
    when they get a response within 5-8 seconds, they will feel that the system is responsive. They will feel that the system response speed is slow, but it is still acceptable;
    however, when users do not receive a response after more than 8 seconds, they will feel that the system performance is extremely poor, or think that the system has become unresponsive, so they choose to leave the website, or initiate Second request.

While estimating key indicators such as response time, number of concurrent users, query rate per second (QPS), and success rate, you also need to pay attention to specific business functional requirements.

Each business function has its own unique characteristics. For example:

  • In some scenarios, there is no need to return clear execution results synchronously;
  • In some business scenarios, it is acceptable to return a prompt message such as "The system is busy, please wait!" to avoid large-scale system paralysis caused by excessive processing traffic.

Therefore, it is necessary to learn to balance the relationship between these indicators.

service level agreement

In most cases, it's best to set a priority order for these metrics and focus on only a few high-priority metric requirements whenever possible.

SLA : The abbreviation of Service-Level Agreement means service level agreement.

The SLA of a service is the formal commitment of the service provider to the service consumer and is a key item to measure the level of service capability.

The items defined in the service SLA must be measurable and have clear measurement methods.

SLA items meaning Measurement methods Example service level interface level
Request success rate The percentage of requests successfully answered by the service to the total number of requests during the measurement period (number of successfully responded requests/total requests)*100 >99% yes no
Availability Within the measurement period, the percentage of service availability time is divided into three levels.
1.99.999%-99.9999%. This is the service with the highest availability. The cumulative unavailability time in a year is 5.256 minutes-31.536 seconds. The unavailability of such services will affect user use. For example, login 2.99.99%%-99.999%
, The cumulative unavailability time in one year is 52.56 hours-5.256 minutes. When unavailability occurs, user operations will be affected. The indirect user-oriented service is
3.99.9%-99.99%. The cumulative unavailability time in one year is 8.76 hours-52.56 minutes. When the service occurs, It will not affect the user's use when it is unavailable.
(service online time/total statistical period time)*100 Level 1 yes no
data consistency After the service consumer calls the service interface to write the data, it immediately calls the service interface to read. Whether the written data content can be read, including three levels:
1. Strong consistency
2. Weak consistency
3. Final consistency
Call the resource creation interface and call the resource query interface to obtain the created data. eventual consensus yes no
Throughput The number of requests processed per second. For service clusters, it is recommended to provide a calculation method for the overall throughput. For example, cluster throughput = throughput * number of service instances. If it is difficult to give, then at least the typical number of cluster instances must be given. Throughput Counts the number of requests the service handles per second 200 yes Optional
TP50 request delay The value defined by the 50% request delay region during the service operation cycle Use percentile calculations 100ms yes Optional
TP99.9 request delay The value defined by the region for 99.9% of request delays during the service operation cycle Use percentile calculations 200ms yes Optional

1. Explanation of related concepts in concurrency

Before diving into the above issues, I would like to introduce you to some key evaluation metrics related to the system:

  • SWC
  • tps
  • dau
  • pv
  • uv

These key concepts have been introduced by Nin in a special article, please refer to the following article for details:

What is the QPS of your system and how is it deployed? Assuming tens of millions of requests every day, how to deploy it?

2. Calculate the number of visits of 1000w users according to the 28th rule

Let's go back to the original question: 1000W concurrency, how many nodes need to be deployed?

Assuming we have no historical data to refer to, we can use the 28th law to make estimates.

  • Assuming there are 10 million users and the proportion of users who visit the website every day is 20%, then there are approximately 2 million users visiting the website every day.
  • Assuming that each user clicks 50 times on average, then the total page views PV=100 million.
  • There are 24 hours in a day. According to the 80/20 rule, most users are active every day in (24 * 0.2), which is approximately equal to within 5 hours, and most users refer to (100 million clicks * 80%), which is approximately equal to 8000W. (PV), means that within 5 hours, about 8000W will click in, that is, there will be about 4500 (8000W/5 hours) requests per second.
  • 4500 is just an average value. During these 5 hours, the request volume is not necessarily uniform, and there may be a situation where a large number of users visit together (for example, for websites like Taobao, the peak hours of daily visits are concentrated at 14:00 pm and 21:00 pm, of which 21 :00 is the peak number of visits in a day). Normally, the request volume during peak traffic hours is 3 to 4 times the average request volume (this is an empirical value), and we calculate it as 4 times. So during these 5 hours, there may be 18,000 requests per second. Therefore, the original problem of supporting 10 million users has become a specific problem, that is, the server needs to be able to support 18,000 requests per second (QPS=18,000)

3. Server pressure estimation

After roughly estimating the highest concurrency peak that the backend server needs to withstand, we need to conduct a stress test from the perspective of the entire system architecture, and then reasonably configure the number of servers and architecture.

First of all, we need to understand how much concurrency a server can withstand, so how to analyze it?

Since our application is deployed on Tomcat, we need to start with the performance of Tomcat.

The following is a diagram describing how Tomcat works. The diagram is explained as follows:

  • LimitLatch is a connection controller, which is responsible for controlling the maximum number of connections that Tomcat can handle simultaneously. In NIO/NIO2 mode, the default value is 10000; in APR/native mode, the default value is 8192.
  • Acceptor is an independent thread that calls the socket.accept method in the while loop in the run method to receive the client's connection request. Whenever a new request arrives, accept will return a Channel object, and then hand over the Channel object to Poller for processing.
    Poller is essentially a Selector, which also implements threads. Poller maintains a Channel array internally and continuously detects the data readiness status of the Channel in an infinite loop. Once the Channel is readable, it will generate a SocketProcessor task object and hand it over to the Executor for processing.
  • SocketProcessor implements the Runnable interface. When the thread pool executes the SocketProcessor task, it handles the current request through the Http11Processor. Http11Processor reads the data of the Channel to generate a ServletRequest object.
  • Executor is a thread pool responsible for running SocketProcessor tasks. The run method of SocketProcessor will call Http11Processor to read and parse the request data. We know that Http11Processor is the encapsulation of the application layer protocol, it will call the container to get the response, and then write the response through Channel.

From this figure we can know that there are four main factors that affect the number of Tomcat requests.

3.1 Tomcat influencing factor 1: Current server system resources

I think you may have encountered an exception similar to "Socket/File: Can't open so many files", which is the representation of the file handle limit in the Linux system.

In the Linux operating system, each TCP connection occupies a file descriptor (fd). When the file descriptor exceeds the current limit of the Linux system, this error message will pop up.

We can use the following command to view the upper limit of the number of files that a process can open.

ulimit -a 或者 ulimit -n

open files (-n) 1024 is the limit of the Linux operating system on the number of file handles opened by a process (also includes the number of open sockets)

This is only a restriction on the user level. In fact, there is also a general restriction on the system. Check the system bus system:

cat /proc/sys/fs/file-max

file-max sets the total number of files that can be opened by all processes in the system.

At the same time, some programs can be called through setrlimit to set limits for each process. If we receive a large number of file handle out-of-use error messages, we should consider increasing this value.

When encountering the above error, we can modify it in the following ways (limit on the number of open files for a single process)

vi /etc/security/limits.conf
  root soft nofile 65535
  root hard nofile 65535
  * soft nofile 65535
  * hard nofile 65535
  • *Represents all users and rootrepresents the root user.
  • noproc represents the maximum number of processes
  • nofile represents the maximum number of open files.
  • Soft/hard, the former will generate a warning when the threshold is reached, and the latter will report an error.

In addition, you also need to ensure that the limit on the number of open files at the process level is less than or equal to the total limit of the system. If not, then we need to modify the total limit of the system.

vi /proc/sys/fs/file-max

The biggest overhead of TCP connections on system resources is memory.

Since a TCP connection requires both parties to receive and send data, a read buffer and a write buffer need to be set.

On Linux systems, the minimum size of these two buffers is 4096 bytes. Information can be obtained by viewing /proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem.

Therefore, the minimum memory occupied by a tcp connection is 4096+4096 = 8k. For a machine with 8G memory, if other restrictions are not considered, the maximum number of concurrencies is approximately: 8 * 1024 * 1024/8, which is approximately equal to 1 million.

This number is the theoretical maximum. In actual applications, due to the Linux kernel's restrictions on some resources and the impact of program business processing, it is difficult to reach 1 million connections with 8GB of memory.

Of course, we can increase the number of concurrencies by increasing memory.

3.2 Tomcat influencing factor 2: Configuration of the JVM that Tomcat depends on

We all know that Tomcat is a Java program running on the JVM.

Therefore, optimizing the JVM is also the key to improving Tomcat performance. Let's briefly introduce the basic situation of JVM, as shown in the figure below.

In the JVM, memory is divided into heap, program counter, local method stack, method area (metaspace) and virtual machine stack.

3.2.1 Heap space description

Heap memory is the largest area of ​​​​JVM memory. Most objects and arrays are allocated here, and it is shared by all threads. The heap space is divided into the new generation and the old generation, and the new generation is further divided into the Eden and Survivor areas, as shown in the figure below.

The ratio between the new generation and the old generation is 1:2, which means that the new generation occupies 1/3 of the heap space, and the old generation occupies 2/3.

In addition, in the new generation, the space allocation ratio is Eden:Survivor0:Survivor1=8:1:1.

For example, if the memory size of the Eden area is 40M, then the memory of the two Survivor areas each accounts for 5M, the total memory of the new generation is 50M, and then the memory size of the old generation is calculated to be 100M, which means the total heap space The memory size is 150M.

You can view the default parameters through java -XX:PrintFlagsFinal -version

uintx InitialSurvivorRatio                      = 8
uintx NewRatio                                  = 2

InitialSurvivorRatio: The initial ratio of the new generation Eden/Survivor space

NewRatio: Memory ratio of Old area/Young area

The specific working mechanism of heap memory is as follows:

  • Most objects will be placed in the Eden area after they are created. When the Eden area is full, YGC (Young GC) will be triggered. Most objects will be recycled, and the surviving objects will be copied to Survivor0. At this time, the Eden area was emptied.
  • If YGC is triggered again, the surviving objects will be copied from the Eden+Survivor0 area to the Survivor1 area, and the Eden and Survivor0 areas will be cleared.
  • When YGC is triggered again, the objects in Eden+Survivor1 will be copied to the Survivor0 area. This cycle continues until the age of the object reaches the threshold, and then it will be moved to the old generation. (This design is because most objects in the Eden area will be recycled).
  • Objects that cannot be accommodated in the Survivor area will be directly entered into the old generation.
  • When the old generation is full, Full GC will be triggered.

GC mark-sweep algorithm suspends other threads during execution??

3.2.2 Program Counter

The program counter is used to record information such as the bytecode address executed by each thread. When a thread context switches, it is relied on to record the current execution position so that execution can continue from the last execution position when execution is resumed next time.

3.2.3 Method area

The method area is a logical concept. In version 1.8 of the HotSpot virtual machine, its specific implementation is the metaspace.

The method area is mainly used to store information related to classes that have been loaded by the virtual machine, including class meta information, runtime constant pool, and string constant pool. Class information also includes class version, fields, methods, interfaces, and parent class information.

The method area is similar to the heap space. It is a shared memory area, so the method area is shared by threads.

Local deployment stack and virtual machine stack

The Java virtual machine stack is a thread-private memory space. When a thread is created, a thread stack is allocated in the virtual machine to store information such as method local variables, operand stacks, and dynamic link methods. Every time a method is called, it will be accompanied by a push operation of the stack frame. When the method returns, it will be a pop operation of the stack frame.

The local method stack is similar to the virtual machine stack. The local method stack is used to manage the invocation of local methods, that is, native methods.

JVM memory setting method

After understanding the above basic knowledge, let's discuss how JVM memory should be set and what parameters can be used to set it.

In the JVM, the core parameters that need to be configured include:

  • -Xms,Java heap memory size
  • -Xmx, Java maximum heap memory size
  • -Xmn, the size of the new generation in Java heap memory. After deducting the new generation, the remainder is the old generation memory. If the new generation memory
    is set too small, Minor GC will be triggered frequently. Frequently triggering GC will affect the stability of the system.
  • -XX:MetaspaceSize, metaspace size, 128M
  • -XX:MaxMetaspaceSize, the maximum cloud space size (if these two parameters are not specified, the metaspace will be dynamically adjusted as needed during runtime.) 256M There is
    basically no way to calculate the metaspace of a new system. Generally, it is set to a few hundred megabytes. It's enough, because it mainly stores some class information.
  • -Xss, thread stack memory size, this basically does not need to be estimated, just set it from 512KB to 1M, because the smaller the value, the more threads can be allocated.

The size of JVM memory is affected by the server configuration. For example, for a server with 2 cores and 4G of memory, the memory allocated to the JVM process is approximately 2G.

This is because the server also needs memory itself, and also needs to reserve memory for other processes. This 2G memory also needs to be allocated to stack memory, heap memory and metaspace, so the available heap memory is about 1G.

Then, the heap memory also needs to be divided into the new generation and the old generation.

3.3 Tomcat influencing factor 3: Configuration of Tomcat itself

The core configuration of tomcat is as follows:

Apache Tomcat 8 Configuration Reference (8.0.53) - The HTTP Connector

The maximum number of request processing threads to be created by this Connector, which therefore determines the maximum number of simultaneous requests that can be handled. If not specified, this attribute is set to 200. If an executor is associated with this connector, this attribute is ignored as the connector will execute tasks using the executor rather than an internal thread pool. Note that if an executor is configured any value set for this attribute will be recorded correctly but it will be reported (e.g. via JMX) as -1 to make clear that it is not used.

server:
  tomcat:
    uri-encoding: UTF-8
    #最大工作线程数,默认200, 4核8g内存,线程数经验值800
    #操作系统做线程之间的切换调度是有系统开销的,所以不是越多越好。
    max-threads: 1000
    # 等待队列长度,默认100,
    accept-count: 1000
    max-connections: 20000
    # 最小工作空闲线程数,默认10, 适当增大一些,以便应对突然增长的访问量
    min-spare-threads: 100
  • accept-count : This is the maximum waiting number. When the number of HTTP requests reaches the maximum number of threads of Tomcat, if a new HTTP request arrives, Tomcat will put the request in the waiting queue. This acceptCount refers to the maximum number of waits that can be accepted, and the default value is 100. If the waiting queue is also filled, new requests will be rejected by Tomcat (connection refused).
  • maxThreads : This is the maximum number of threads, every time an HTTP request arrives at the Web service, Tomcat will create a thread to process the request. maxThreads determines how many requests the web service container can handle simultaneously. The default value of maxThreads is 200, it is recommended to increase it. However, there is a cost to adding threads, and more threads not only bring more thread context switching costs, but also consume more memory. By default, the JVM will allocate a thread stack with a size of 1M when creating a new thread, so more threads mean more memory is required. The experience value of the thread number is: 1 core 2g memory is 200, the thread number experience value is 200; 4 core 8g memory, the thread number experience value is 800.
  • maxConnections : This is the maximum number of connections. This parameter specifies the maximum number of connections that Tomcat can accept at the same time. For Java's blocking BIO, the default value is the value of maxthreads; if a custom Executor is used in BIO mode, the default value will be the value of maxthreads in the executor. For Java's new NIO mode, the default value of maxConnections is 10000. For APR/native IO mode on Windows, maxConnections defaults to 8192.
    If it is set to -1, the maxconnections function is disabled, which means that the number of connections of the tomcat container is not limited. The relationship between maxConnections and accept-count is: when the number of connections reaches the maximum value maxConnections, the system will continue to receive connections, but will not exceed the value of acceptCount.

3.4 Tomcat influencing factor 4: pressure brought by application

In our previous analysis, we learned that when NIOEndPoint receives the client's request connection, it will generate a SocketProcessor task and submit it to the thread pool for processing.

The run method in SocketProcessor will call the HttpProcessor component to parse the application layer protocol and generate a Request object.

Finally, call the Adapter's Service method to pass the request to the container.

The container is mainly responsible for processing internal requests, that is, after the current connector obtains information through the Socket, it will obtain a Servlet request, and the container is responsible for processing this Servlet request.

Tomcat uses the Mapper component to locate the URL requested by the user to a specific Serlvet, and then the DispatcherServlet in Spring intercepts the Servlet request and locates it in our specific Controller based on Spring's own Mapper mapping.

When the request reaches the Controller, it is the real start of the request for our business.

The Controller calls Service, and Service calls DAO. After completing the database operation, the request is returned to the client via the original route to complete an overall session.

Therefore, the business logic processing time in the Controller will have an impact on the concurrency performance of the entire container.

4. Evaluation of server quantity

Let’s do some simple math:

Assume that the QPS of a Tomcat node is 500. If you want to support the QPS of 18,000 during peak periods, 40 servers are needed.

These 40 servers need to distribute requests through Nginx software load balancing.

The performance of Nginx is very good, and the official statement states that its concurrent ability to process static files can reach 5W/s.

Since Nginx cannot be a single point, we can use LVS to load balance Nginx. LVS (Linux VirtualServer) uses IP load balancing technology to achieve load balancing.

Through such a set of architectures, our current server can simultaneously undertake QPS=18000, but it is not enough. Let's go back to the two formulas mentioned earlier.

  • QPS=concurrency/average response time
  • Concurrency = QPS * average response time

Suppose our RT is 3s, then the number of concurrency on the server side = 18000 * 3 = 54000, that is, there are 54000 connections to the server side at the same time. Therefore, the number of connections that the server needs to support at the same time is 54,000.

If the RT is larger, it means there are more backlogged connections. These connections will occupy memory resources/CPU resources, etc., and may easily cause the system to crash.

At the same time, when the number of connections exceeds the threshold, subsequent requests cannot enter, and the user will get a request timeout result, which is not what we want to see. Therefore, we must shorten the value of RT.

5. How to reduce the value of RT?

Continuing to look at the picture above, a request needs to wait for the application in the Tomcat container to complete execution before it can be returned.

What operations will the request perform during execution?

  • Query database
  • Access disk data
  • Perform memory operations
  • Call remote service

These operations will consume time, and the client request needs to wait for these operations to complete before returning.

Therefore, the way to reduce response time is to optimize business logic processing.

5.1 Database optimization

When 18,000 requests enter the server and are received, business logic processing begins, which will inevitably involve database queries.

Each request performs at least one database query operation, and more than 3 to 5 queries are required.

Assuming 3 calculations, 54,000 requests will be made to the database per second.

Assuming that a database server supports 10,000 requests per second (there are many factors that affect the number of database requests, such as the amount of data in the database table, the system performance of the database server, and the complexity of the query statement), then 6 database servers are needed to support each request. 10,000 requests per second.

In addition, there are other optimization solutions at the database level.

  • The first is the maximum number of connections settings for MySQL. When the number of visits is too high, you may encounter the problem MySQL: ERROR 1040: Too many connections. The reason is that the number of connections is exhausted. If the server has a large number of concurrent connection requests, it is recommended to increase this value to increase the number of parallel connections. However, the machine's carrying capacity needs to be considered, because the more connections there are, the more memory the connection buffer provided by each connection will occupy. Therefore, the value must be adjusted appropriately and the setting value cannot be increased blindly.
  • Introducing caching components. The amount of data in the data table is too large, for example, reaching tens of millions or even hundreds of millions. In this case, SQL optimization is of little significance, because queries with such a large amount of data will inevitably involve calculations. The problem of excessive read request concurrency can be solved through caching. Generally speaking, database read and write requests follow the 80-20 rule. Of the 54,000 requests per second, approximately 43,200 are read requests, and approximately 90% of these read requests can be resolved through the cache.

The reasons why putting data in the MySQL database into the Redis cache can improve performance are as follows:

  1. Redis stores data in Key-Value format, and its search time complexity is O(1) (constant order), while the underlying implementation of the MySQL engine is B+Tree, and its time complexity is O(logn) (logarithmic order). Therefore, Redis has a faster query speed than MySQL.
  2. MySQL data is stored in tables, and searching for data requires a global scan of the table or a lookup based on an index, which involves disk lookups. Redis doesn't need to be so complicated, because it looks up the data directly based on the location in memory.
  3. Redis is a single-threaded multiplexed IO, which avoids the overhead of thread switching and IO waiting, thereby improving the processor usage efficiency under multi-core processors.
  • Splitting databases and tables reduces the amount of data in a single table. With less data in a single table, query performance is naturally effectively improved.
  • Separation of reading and writing avoids the performance impact of transaction operations on query operations. The writing operation itself consumes resources, and the database writing operation is IO writing. The writing process usually involves uniqueness verification, index building, index sorting and other operations, which consumes a lot of resources. The response time of a write operation is often several times or even dozens of times that of a read operation. Lock contention, write operations often require locks, including table-level locks, row-level locks, etc. This type of lock is an exclusive lock. After a session occupies an exclusive lock, other sessions cannot read data, which will greatly affect data reading performance. Therefore, MySQL deployment usually adopts a read-write separation method. The master database is used to write data and some time-sensitive read operations, and the slave database is used to undertake most read operations. In this way, the overall performance of the database can be greatly improved.
  • sql+nosql heterogeneous storage. Different storage libraries are used for data with different characteristics, such as MongoDB (NoSQL documented storage), Redis (NoSQL Key-Value storage), HBase (NoSQL columnar storage), these databases are similar to Key-Value databases to some extent . Nosql has high scalability and is suitable for managing large amounts of unstructured data.

  • Client pooling technology reduces the performance loss of frequently creating database connections. Before each database operation, the connection is established first, then the database operation is performed, and finally the connection is released. This process involves network communication delays and the performance overhead of frequently creating and destroying connection objects. When the request volume is large, this performance loss will become very obvious. By using connection pooling technology, you can reuse already created connections and reduce this performance loss.

5.2 Disk data access optimization

For disk operations, it mainly includes reading and writing. For example, in a trading system scenario, reconciliation files usually need to be parsed and written. Optimization methods for disk operations include:

  • Utilize disk cache and cache I/O to make full use of the system cache to reduce the number of actual I/Os.
  • Sequential reading and writing are used, and append writing is used instead of random writing to reduce addressing overhead and improve I/O writing speed.
  • Using SSD instead of HDD, the I/O efficiency of SSD is much higher than that of mechanical hard drive.
  • When frequently reading and writing the same disk space, mmap (memory mapping) can be used instead of read/write to reduce the number of memory copies.
  • In scenarios where synchronous writes are required, write requests should be combined as much as possible, instead of each request being written to disk synchronously, fsync() can be used instead of O_SYNC.

5.3 Proper use of memory

Make full use of memory cache to store frequently accessed data and objects in memory to avoid repeated loading or reduce performance losses caused by database access.

5.4 Calling remote services

Remote service calls will affect I/O performance, mainly including:

  • Blocking of remote calls waiting for return results
    • Asynchronous communication
  • Network communication time
    • Intranet communication
    • Increase network bandwidth
  • Stability of remote service communications

5.5 Asynchronous architecture

In microservices, for situations where processing time is long and logic is complex, high concurrency may cause service threads to be exhausted and new threads to be unable to be created to process requests.

In response to this situation, in addition to optimizing at the program level (such as database tuning, algorithm tuning, caching, etc.), you can also consider making architectural adjustments, such as returning results to the client first so that users can continue to use other operations on the client. , and then asynchronously process the complex logic processing module on the server side.

This asynchronous processing method is suitable for scenarios where the client is not sensitive to the processing results and does not require real-time, such as mass email, mass message, etc.

Solutions for asynchronous design include:

  • Multithreading
  • Message Queuing (MQ)

6. Splitting of application services

In addition to the above methods, it is also necessary to split the business system into microservices for the following reasons:

  • Business development leads to an increase in the complexity of application programs, resulting in an increase in entropy.
  • Business systems have more and more functions, and more and more people are involved in development iterations. Maintaining a huge project is prone to problems.
  • It is difficult for a single application system to achieve horizontal expansion, and server resources are limited, which may cause all requests to be concentrated on a certain server node, resulting in excessive resource consumption and system instability.
  • Testing and deployment costs gradually increase.

The most important thing is that it is difficult to break through the performance bottleneck of a single application.

For example, to support 18,000 QPS, a single service node will definitely not be able to support it. Therefore, the advantage of service splitting is that multiple computers can be used to form a large-scale distributed computing network, and the entire business logic can be completed through network communication.

6.1 How to split services

Regarding how to split services, although it seems simple, you will encounter some boundary issues in actual operation.

For example, some data models apply to both module A and module B. How to draw the line? In addition, how should the granularity of service splitting be determined?

Usually, service splitting is carried out according to business, and the boundary division of microservices is guided according to domain-driven design (DDD).

Domain-driven design is a methodology that determines business boundaries and application boundaries by defining domain models to ensure the consistency of business models and code models .

Whether it is DDD or microservices, we need to follow the basic principles of software design: high cohesion and low coupling .

There should be high cohesion within services and low coupling between services.

In fact, a domain service corresponds to a set of functions, and these functions have certain commonalities.

For example, order services include functions such as creating orders, modifying orders, and querying order lists. The clearer the domain boundaries, the stronger the cohesion of functions, and the lower the coupling between services.

Service splitting also needs to be carried out based on the current technical team and company conditions.

For start-up teams, microservices should not be pursued excessively, so as not to cause the business logic to be too scattered and the technical architecture to be too complex. In addition, the infrastructure is not perfect, which may lead to extended delivery time and have a greater impact on the company's development. Therefore, when performing service splitting, the following factors also need to be considered:

  • Due to the market nature of the company's business field, if it is a market-sensitive project, the product should be launched first and then iterated and optimized.
  • The maturity of the development team and whether the team's technology can handle it.
  • Whether the basic capabilities are sufficient, such as DevOps, operation and maintenance, test automation and other basic capabilities. Whether the team has the ability to support the operation and maintenance complexity caused by the operation of a large number of service instances, and whether it can do a good job in service monitoring.
  • The execution efficiency of the testing team. If the testing team cannot support automated testing, automatic regression, stress testing and other means to improve testing efficiency, it will inevitably lead to a significant increase in testing workload and delay the project launch cycle.

For old system transformation, there may be more risks and problems involved. Before starting the transformation, the following steps need to be considered: the preparation stage before the split, the design of the split transformation plan, and the implementation of the split plan.

  • Before starting to decompose, you need to have a clear understanding of the current overall architecture and the dependencies between various modules. At the same time, in the preparation stage, you mainly need to understand the dependencies and interfaces. In this way, you can know how to operate when decomposing, such as where to make the first cut, so as to quickly turn a complex single system into two smaller systems. At the same time, it is also necessary to minimize the impact on the existing business of the system. . Avoid building a distributed monolithic application that contains many services that are tightly coupled to each other but must be deployed together. This is called a distributed system. If you forcibly decompose without conducting in-depth analysis, you may accidentally cut off important dependencies, leading to a major Class A failure with disastrous consequences.
  • At different stages, the focus of decomposition is different, and each stage has its core issues that need attention. The decomposition itself can be divided into three stages: decomposition of core business and non-business parts, adjustment and design of core business, and internal decomposition of core business. In the first stage, the core business needs to be streamlined and the non-core parts are stripped off to reduce the scale of the system that needs to be processed; in the second stage, the core business part needs to be rebuilt according to the design concept of microservices; in the third stage In this stage, the refactoring design of the core business part needs to be implemented. There are three methods of decomposition: code decomposition, deployment decomposition, and data decomposition.

In addition, each stage needs to focus on one or two specific goals. If there are too many goals, nothing may be achieved. For example, the microservice decomposition of a certain system sets the following goals:

  1. Performance indicators (throughput and latency): The throughput of core transactions is more than doubled (TPS: 1000->10000), the latency of business A is reduced by half (Latency: 250ms->125ms), and the latency of business B is reduced by half (Latency :70ms->35ms).
  2. Stability indicators (availability, failure recovery time): Availability >=99.99%, Class A failure recovery time <=15 minutes, number of failures in a quarter <=1.
  3. Quality indicators: Write complete product requirements documents, design documents, deployment and operation documents, more than 90% single test coverage of the core transaction part code and 100% automated test cases and scenario coverage, to achieve a sustainable performance testing benchmark environment and long-term Continuous performance optimization mechanism.
  4. Scalability index : complete the reasonable decomposition of code, deployment, runtime and data in multiple dimensions. For each business and transaction module after the core system reconstruction, as well as the corresponding data storage, scalability can be achieved at any time by adding machine resources Extension.
  5. Maintainability indicators : Establish comprehensive and complete monitoring indicators, especially real-time performance indicator data for the entire link, covering all key businesses and statuses, shortening the monitoring and alarm response time, and cooperating with the operation and maintenance team to achieve capacity planning and management. When problems arise, The system can be pulled up or rolled back to the last available version within a minute (boot time <= 1 minute).
  6. Ease of use index : The new API interface implemented through reconstruction is reasonable and simple, which greatly meets the needs of users at all levels, and customer satisfaction continues to increase.
  7. Business support indicators : For the development of new business requirement functions, on the premise of ensuring quality, the development efficiency is doubled, and the development resources and cycle are reduced by half.

Of course, don't expect to complete all the goals at once, and you can choose one or two high-priority goals for execution at each stage.

6.2 After micro-service, how to carry out service governance?

The microservice architecture first manifests itself as a distributed architecture. Secondly, we need to demonstrate and provide business service capabilities. Next, we need to consider various non-functional capabilities related to these business capabilities. These services scattered in different locations need to be managed uniformly while remaining transparent to service callers, which creates functional requirements for service registration and discovery.

Similarly, each service may be deployed on multiple instances on multiple machines. Therefore, we need to have routing and addressing capabilities to achieve load balancing to improve the scalability of the system. Faced with so many externally provided service interfaces, we need a mechanism to unify access control and apply some non-business policies to this access layer, such as permission-related policies. This is the role of the service gateway. At the same time, we found that with the development of business and specific operational activities (such as flash sales, big sales, etc.), the traffic may increase by more than ten times. At this time, we need to consider the system capacity and the strong and weak dependencies between services, and implement Measures such as service degradation, circuit breaker and system overload protection.

Due to the complexity brought by microservices, application configuration and business configuration are dispersed in various places. Therefore, the need for a distributed configuration center also arises.

Finally, after the system is deployed in a decentralized manner, all calls involve cross-processes. We also need a set of technologies that can perform link tracking and performance monitoring online so that we can understand the internal status and indicators of the system at any time, so that we can monitor the system at any time. Perform analysis and intervention.

6.3 Overall Architecture Diagram

Through comprehensive analysis from micro to macro, we can basically construct a complete architecture diagram.

  • Access layer, which is the portal for external requests to enter the internal system, and all requests must pass through the API gateway.
  • The application layer, also known as the aggregation layer, provides aggregation interfaces for related businesses and calls middle-end services for combination.
  • Atomic services include atomic technical services and atomic business services, which provide relevant interfaces according to business needs.

Atomic services provide reusable capabilities for the entire architecture.

For example, the comment service, as an atomic service, is required by the videos, articles, and communities of Bilibili. In order to improve reusability, the comment service can be an independent atomic service and cannot be tightly coupled with specific needs.

In this case, the comment service needs to provide a reusability capability that can adapt to different scenarios.

Similarly, functions such as file storage, data storage, push services, and authentication services will be precipitated into atomic services. Business developers can quickly build business applications by orchestrating, configuring, and combining them based on atomic services.

7. How to quantify and measure 3-high?

3How to quantify and measure high?

There is no exact definition of high concurrency, it mainly describes the situation of facing a large amount of traffic in a short period of time.

When you are in an interview or at work, and your leader or interviewer asks you how to design a system that can withstand tens of millions of traffic, you can follow the steps I provided for analysis.

  • First, you need to establish some quantifiable data indicators, such as query rate per second (QPS), daily active users (DAU), total number of users, transactions per second (TPS), and access peaks.
  • Then, based on this data, you start to design the architecture of the system.
  • Then implement it

7.1 Macroscopic indicators in high concurrency

A system that can meet high concurrency requirements does not simply pursue performance, but needs to meet at least three macro goals:

  • High performance , which is the embodiment of the parallel processing capability of the system. With limited hardware investment, improving performance means saving costs. At the same time, performance is also related to user experience. Whether the response time is 100 milliseconds or 1 second, the user's experience is completely different.
  • High availability , this is the time the system can provide services normally. Users will definitely choose the former between a system that is trouble-free and non-stop throughout the year and a system that often breaks down and goes down. In addition, if the system availability can only reach 90%, it will also have a significant impact on the business.
  • High scalability refers to the system's scalability, that is, whether it can complete expansion in a short time during peak traffic periods to more stably withstand peak traffic, such as Double 11 events, celebrity divorces and other hot events.

7.2 Micro indicators

Performance

Through performance indicators, we can measure current performance problems and use them as a basis for evaluating optimized performance. Usually, we use the interface response time over a period of time as a measurement criterion.

  1. Average response time : This is the most commonly used measurement, but it has the disadvantage of being insensitive to slow requests. For example, out of 10,000 requests, 9,900 are 1 millisecond and 100 are 100 milliseconds, then the average response time is 1.99 milliseconds. Although the average elapsed time has only increased by 0.99 milliseconds, the response time for 1% of requests has increased by a factor of 100.
  2. TP90, TP99 and other quantile values : This is an indicator that sorts the response time from small to large. TP90 represents the response time ranked in the 90th quantile. The larger the quantile value, the more sensitive it is to slow requests.

availability metrics

High availability means that the system has a high ability to run without failures. Availability = mean failure time/total system running time. Usually we use several 9s to describe the availability of the system.

For high-concurrency systems, the minimum requirement is to guarantee 3 9s or 4 9s. The reason is very intuitive. If you can only do two nines, it means there is a 1% failure time. For some large companies with hundreds of billions of GMV or revenue per year, 1% of the failure time will lead to a billion-level business impact.

Scalability index

In the face of burst traffic, we cannot temporarily modify the architecture, so adding machines to linearly increase the processing capacity of the system is the fastest way.

For business clusters or basic components, scalability = performance improvement ratio / machine addition ratio. The ideal scalability is: increase resources several times and improve performance several times. Generally speaking, the scalability should be maintained above 70%.

However, from the perspective of the overall architecture of a high-concurrency system, the goal of expansion is not just to design the service to be stateless, because when the traffic increases by 10 times, the business service can quickly expand 10 times, but the database may become a new bottleneck.

Stateful storage services like MySQL are usually technically difficult to expand. If the architecture is not planned in advance (vertical and horizontal splitting), it may involve the migration of a large amount of data.

Therefore, high scalability needs to consider: service clusters, middleware such as databases, caches and message queues, load balancing, bandwidth, dependent third parties, etc. When concurrency reaches a certain level, each of the above factors may become scalable. bottleneck point.

7.3 Practice plan

universal design approach

Scale-up

Its goal is to improve the processing power of a single machine, and the solution includes:

  1. Improve the hardware performance of a single machine: improve it by increasing memory, CPU core number, storage capacity, or upgrading the disk to SSD.
  2. Improve the software performance of a single machine: use cache to reduce the number of IOs, and use concurrent or asynchronous methods to increase throughput.

scale-out

Since there is always a limit to single-machine performance, it is ultimately necessary to introduce horizontal expansion and further improve concurrent processing capabilities through cluster deployment, including the following two directions:

  1. Build a layered architecture: This is the basis for horizontal expansion, because high-concurrency systems usually have complex businesses, and layered processing can simplify complex problems and make horizontal expansion easier.
  2. Each layer performs horizontal expansion: stateless horizontal expansion and stateful shard routing. Business clusters can usually be designed to be stateless, while databases and caches are often stateful. Therefore, partition keys need to be designed for storage sharding. Of course, read performance can also be improved through master-slave synchronization and read-write separation.

7.3.1 High-performance practice solution

  1. Distributed deployment, sharing the pressure of a single machine through load balancing.
  2. Multi-level caching, including using CDN, local cache, distributed cache, etc. for static data, as well as dealing with issues such as hot keys, cache penetration, cache concurrency, and data consistency in cache scenarios.
  3. Database and index optimization, and use of search engines to solve complex query problems.
  4. Consider using NoSQL databases, such as HBase, TiDB, etc., but the team needs to be familiar with these components and have strong operation and maintenance capabilities.
  5. Asynchronous processing, the secondary process is processed asynchronously through multi-threading, message queues, and even delayed tasks.
  6. For traffic control, consider whether the business allows traffic limiting (such as flash sale scenarios), including front-end traffic limiting, Nginx access layer traffic limiting, and server-side traffic limiting.
  7. Traffic peaks are cut and valleys are filled, and traffic is received through message queues.
  8. Concurrent processing, parallelizing serial logic through multi-threading.
  9. Pre-calculation, such as the red envelope grabbing scene, the red envelope amount can be calculated in advance and cached, and used directly when sending red envelopes.
  10. Cache preheating, preheating data into local cache or distributed cache in advance through asynchronous tasks.
  11. Reduce the number of IOs, such as batch reading and writing of databases and caches, batch interface support for RPC, or reducing RPC calls through redundant data.
  12. Reduce the data packet size during IO, including using lightweight communication protocols, appropriate data structures, removing redundant fields in interfaces, reducing cache key size, compressing cache value, etc.
  13. Optimize program logic, such as pre-positioning judgment logic that has a high probability of blocking the execution process, optimizing For loop calculation logic, or adopting more efficient algorithms.
  14. Use various pooling technologies, such as HTTP request pool, thread pool (consider CPU-intensive or IO-intensive to set core parameters), database and Redis connection pool, etc.
  15. JVM optimization, including the size of the new generation and old generation, GC algorithm selection, etc., to reduce GC frequency and time consumption.
  16. Choose a lock strategy, use optimistic locking in scenarios where there are more reads and less writes, or consider reducing lock conflicts through segmented locks.

7.3.2 High availability practical solutions

  1. Node failover, Nginx and the service governance framework support failover of a failed node to another node.
  2. Failover of non-peer nodes, through heartbeat detection and implementation of master-slave switchover (such as Redis sentry mode or cluster mode, MySQL master-slave switchover, etc.).
  3. Set the timeout, retry strategy and idempotent design of the interface layer.
  4. Degrade processing, ensuring core services, sacrificing non-core services, and performing circuit breakers when necessary; or when there is a problem with the core link, there is an alternative link.
  5. Flow control directly rejects or returns an error code for requests that exceed the system's processing capabilities.
  6. The reliability guarantee of the message queue includes the retry mechanism on the producer side, the persistence of the message agent, and the ack mechanism on the consumer side, etc.
  7. Grayscale release supports small traffic deployment according to the machine dimension, observing system logs and business indicators, and then pushing the full volume after the operation is stable.
  8. Monitoring and alarming include basic CPU, memory, disk, network monitoring, as well as Web server, JVM, database, various middleware monitoring and business indicator monitoring.
  9. Disaster recovery drills, similar to the current "chaos engineering", use destructive methods on the system to observe whether local failures will cause availability problems.

The high-availability solution mainly considers three aspects: redundancy, trade-offs, and system operation and maintenance. It also needs to have a supporting duty mechanism and fault handling process. When online problems occur, they can be followed up and dealt with in a timely manner.

7.3.3 Highly scalable practical solutions

  1. A reasonable layered architecture, such as the most common layered architecture on the Internet, can further layer microservices in a more fine-grained manner according to the data access layer and business logic layer (but the performance needs to be evaluated, and there may be one more hop in the network) ).
  2. The storage layer is split vertically according to the business dimension and horizontally according to the data characteristic dimension (sub-database and sub-table).
  3. The most common way to split the business layer is based on business dimensions (such as commodity services, order services, etc. in e-commerce scenarios). It can also be split according to core interfaces and non-core interfaces, and it can also be split according to requests (such as To C and To B, APP and H5).

Say it at the end

Interview questions related to deployment architecture and node planning are very common interview questions.

If everyone can answer the above content fluently and thoroughly, the interviewer will basically be shocked and attracted by you.

In the end, the interviewer loved it so much that he "can't help himself and his mouth watered" . The offer is coming.

During the learning process, if you have any questions, you can come and talk to Nien, a 40-year-old architect.

references

https://zhuanlan.zhihu.com/p/422165687

recommended reading

" NetEase side: Single node 2000Wtps, how does Kafka do it?" "

" Byte Side: What is the relationship between transaction compensation and transaction retry?" "

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?" "

" How to structure billion-level short videos? " "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!" "

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!" "

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?" "

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/132526730