[Transfer] Analysis of system load capacity

http://www.rowkey.me/blog/2015/09/09/load-analysis/?hmsr=toutiao.io&utm_medium=toutiao.io&utm_source=toutiao.io


Analysis of system load capacity

—This article was last updated on 2015.12.23 —

In the Internet age, high concurrency is a common topic. Whether for a web site or an app application, the concurrent requests that can be carried at peak times are a key indicator of a system's performance. For example, Alibaba's Double Eleven has withstood hundreds of millions of peak requests and orders, and it does reflect Alibaba's technical level (of course, money is also a reason).

So, what is the system load capacity? How to measure? What are the relevant factors? How to optimize it?

I. Metrics

What is used to measure the load capacity of a system? There is a concept called Requests per second, which refers to the number of requests per second that can be successfully processed. For example, you can configure the maxConnection of the tomcat server to be infinite, but limited by the server system or hardware limitations, many requests will not be responded within a certain period of time, which does not count as a successful request. The number of requests responded is the number of requests per second, which reflects the load capacity of the system.

Typically, for a system, the number of requests per second increases as the number of concurrent users increases. However, we will eventually reach a point where the number of concurrent users starts to "overwhelm" the server. If you continue to increase the number of concurrent users, the number of requests per second starts to drop, while the response time increases. The critical point at which the number of concurrent users begins to "overwhelm" the server is very important, and the number of concurrent users at this time can be considered as the maximum load capacity of the current system.

2. Related factors

In general, several factors related to the concurrent access to the system are as follows:

  • bandwidth
  • Hardware Configuration
  • System Configuration
  • Application server configuration
  • program logic
  • system structure

Among them, bandwidth and hardware configuration are the decisive factors to determine the system load capacity. These can only be improved by extensions and upgrades. What we need to focus on is how to maximize the load capacity of the system based on a certain bandwidth and hardware configuration.

2.1 Bandwidth

Undoubtedly, bandwidth is a crucial factor in determining the load capacity of the system, just like a water pipe, a thin water pipe will naturally pass less water at the same time (this analogy to explain bandwidth may not be particularly appropriate). The bandwidth of a system first determines the load capacity of the system, and its unit is Mbps, which indicates the transmission speed of data.

2.2 Hardware configuration

The hardware of the server where the system is deployed determines the maximum load capacity of a system, which is also the upper limit. In general, the following configurations play a key role:

  • CPU frequency/number of cores: The frequency of the CPU is related to the operation speed of the CPU, and the number of cores affects the efficiency of thread scheduling and resource allocation.
  • Memory size and speed: The larger the memory, the larger the data that can be run in the memory, and the speed is naturally faster; the speed of the memory has changed from the original hundreds of Hz to several thousand Hz, which determines the speed of data reading and storage .
  • Hard disk speed: traditional hard disks use magnetic heads for addressing, and the io speed is relatively slow. Hard disks using SSDs have much faster addressing speeds.

The architecture design and system optimization of many systems will eventually add the following sentence: the use of ssd storage solves these problems.

It can be seen that the hardware configuration is the most critical factor in determining the load capacity of a system.

2.3 System Configuration

Generally speaking, the current back-end systems are deployed on Linux hosts. So aside from the win series, for Linux systems, the following configurations are generally related to the load capacity of the system.

  • Limit on the number of file descriptors: Everything in Linux is a file, and a socket corresponds to a file descriptor. Therefore, the maximum number of open files configured by the system and the maximum number of files that can be opened by a single process determine the upper limit of the number of sockets.
  • Process/thread limit: For multi-process modes such as prefork used by apache, the load capacity is limited by the number of processes. For tomcat multi-threaded mode is limited by the number of threads.
  • tcp kernel parameters: The bottom layer of network applications is naturally inseparable from tcp/ip, and the Linux kernel has some related configurations that also determine the load capacity of the system.

2.3.1 Limit on the number of file descriptors

  • The maximum number of open file descriptors in the system: this number is saved in /proc/sys/fs/file-max, modify this value

      临时性
          echo 1000000 > /proc/sys/fs/file-max
      永久性:在/etc/sysctl.conf中设置
          fs.file-max = 1000000
    
  • The maximum number of open file descriptors for a process: This is the maximum number of files that a single process can open. Can be viewed/modified by ulimit -n. If you want to modify it permanently, you need to modify the nofile in /etc/security/limits.conf.

The total number of file descriptors currently in use can be seen by reading /proc/sys/fs/file-nr. In addition, for the configuration of file descriptors, the following points need to be noted:

  • The number of open file descriptors for all processes cannot exceed /proc/sys/fs/file-max
  • The number of file descriptors opened by a single process cannot exceed the soft limit of nofile in user limit
  • The soft limit of nofile cannot exceed its hard limit
  • The hard limit of nofile cannot exceed /proc/sys/fs/nr_open

2.3.2 Process/thread limit

  • Process limit: ulimit -u can view/modify the maximum number of processes that a single user can open. The noproc in /etc/security/limits.conf is the maximum number of processes in the system.
  • Thread limit

    • You can view the maximum number of threads that the system can open through /proc/sys/kernel/threads-max.
    • The maximum number of threads in a single process is related to PTHREAD_THREADS_MAX. This limit can be viewed in /usr/include/bits/local_lim.h, but if you want to modify it, you need to recompile.
    • It needs to be mentioned here that the thread implementation method of Linux kernel 2.4 is linux threads, which is a lightweight process. It will first create a management thread. The size of the number of threads is affected by PTHREAD_THREADS_MAX. However, the thread implementation method of the Linux 2.6 kernel is NPTL, which is an improved LWP implementation. The biggest difference is that the pid (tgid) of the thread public process and the number of threads are only limited by resources.
    • The size of the number of threads is also restricted by the size of the thread stack: use ulimit -s to view/modify the size of the thread stack, that is, each time a new thread is opened, a part of the memory needs to be allocated to this thread. Decreasing this value increases the number of threads that can be opened.

2.3.3 tcp kernel parameters

In the case of a server with limited rated CPU and memory resources, maximizing the performance of the server is the ultimate goal. In the case of saving costs, you can consider modifying the Linux kernel TCP/IP parameters to maximize the performance of the server. If the load problem cannot be solved by modifying the kernel parameters, you can only consider upgrading the server. This is limited by the hardware and there is no way to do it.

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

Using the above command, you can get the number of network connections in each state of the current system. as follows:

LAST_ACK 13
SYN_RECV 468
ESTABLISHED 90
FIN_WAIT1 259
FIN_WAIT2 40
CLOSING 34
TIME_WAIT 28322

Here, the number of connections of TIME_WAIT is a point to pay attention to. If this value is too high, it will consume a lot of connections and affect the load capacity of the system. The parameters need to be adjusted to release the time_wait connection as soon as possible.

The general tcp-related kernel parameters are in the /etc/sysctl.conf file. In order to release the connection in the time_wait state as soon as possible, the following configuration can be done:

  • net.ipv4.tcp_syncookies = 1 //Indicates that SYN Cookies are enabled. When the SYN waiting queue overflows, enable cookies to deal with it, which can prevent a small number of SYN attacks. The default value is 0, which means it is closed;
  • net.ipv4.tcp_tw_reuse = 1 //Indicates that reuse is enabled. Allow TIME-WAIT sockets to be reused for new TCP connections, the default is 0, which means close;
  • net.ipv4.tcp_tw_recycle = 1 //Indicates that the fast recycling of TIME-WAIT sockets in the TCP connection is enabled, the default is 0, which means it is closed;
  • net.ipv4.tcp_fin_timeout = 30 //Modify the default TIMEOUT time of the system.

One thing to note here is that when tcp_tw_recycle is turned on, the timestamp will be checked. The timestamp of the packet sent in the mobile environment sometimes jumps randomly, and the packet with the "reverse" timestamp will be regarded as "The retransmission data of the tw connection of the recycle is not a new request", so the packets are discarded and not returned, resulting in a large number of packet loss. In addition, when there is LVS and the NAT mechanism is used, turning on tcp_tw_recycle will cause some exceptions, see: http://www.pagefault.info/?p=416 . If you still need to enable this option in this case, you can consider setting net.ipv4.tcp_timestamps=0 and ignore the timestamp of the packet.

In addition, the load capacity can be further improved by optimizing the range of available ports for tcp/ip. ,as follows:

  • net.ipv4.tcp_keepalive_time = 1200 //Indicates how often TCP sends keepalive messages when keepalive is enabled. The default is 2 hours, change to 20 minutes.
  • net.ipv4.ip_local_port_range = 10000 65000 //Indicates the port range for outgoing connections. Small by default: 32768 to 61000, change to 10000 to 65000. (Note: Do not set the minimum value too low here, otherwise normal ports may be occupied!)
  • net.ipv4.tcp_max_syn_backlog = 8192 //Indicates the length of the SYN queue, the default is 1024, and the increased queue length is 8192, which can accommodate more network connections waiting to be connected.
  • net.ipv4.tcp_max_tw_buckets = 5000 //Indicates that the system keeps the maximum number of TIME_WAIT at the same time. If this number is exceeded, TIME_WAIT will be cleared immediately and a warning message will be printed. The default is 180000, change to 5000. For Apache, Nginx and other servers, the parameters in the above lines can reduce the number of TIME_WAIT sockets well, but for Squid, the effect is not great. This parameter can control the maximum number of TIME_WAIT to prevent the Squid server from being dragged to death by a large number of TIME_WAIT.

2.4 Application Server Configuration

When it comes to application server configuration, it is necessary to mention several working modes of the application server, also called concurrency strategies.

  • multi process: multi-process mode, one process handles one request.
  • Prefork: Similar to the multi-process method, but it will fork some processes in advance for subsequent use, which is a process pool concept.
  • Worker: One thread corresponds to one request. Compared with the multi-process method, it consumes less resources, but at the same time, the crash of one thread will cause the collapse of the entire process, and the stability is not as good as multi-process.
  • master/worker: The non-blocking IO method is adopted. There are only two processes: worker and master. The master is responsible for the creation and management of the worker process. The worker process uses event-driven multiplexing IO to process requests. Only one mater process is needed, and the number of the waker process is set according to the number of cpu cores.

The first three are the methods adopted by the traditional application servers apache and tomcat, and the last one is the method adopted by nginx. Of course, what needs to be noted here is the difference between the application server and nginx as a reverse proxy server (ignore the function of nginx+cgi as an application server for the time being). The application server needs to process application logic, and sometimes consumes CPU resources; while the reverse proxy is mainly used for IO, which is an IO-intensive application. This event-driven network model is more suitable for IO-intensive applications, but not for CPU-intensive applications. For the latter, multiprocessing/threading is a better choice.

Of course, due to the event-driven IO multiplexing model adopted by nginx, when it acts as a reverse proxy server, the concurrency that can be supported is very large. The Taobao tengine team once had a test result that "on a machine with 24G memory, it can process up to 2 million concurrent requests".

2.4.1 nginx / tengine

ngixn is currently the most widely used reverse proxy software, and tengine is an enhanced version of nginx open sourced by Alibaba, which basically implements some functions of the paid version of nginx, such as active health check, session sticky, etc. For the configuration of nginx, there are a few points to note:

  • The number of workers should be adapted to the number of CPUs (cores)
  • keepalive timout should be set appropriately
  • worker_rlimit_nofile maximum file descriptor to increase
  • upstream can use keepalive of http 1.1

Typical configuration can be seen: https://github.com/superhj1987/awesome-config/blob/master/nginx/nginx.conf

2.4.2 tomcat

The key configuration of tomcat is generally divided into two blocks: jvm parameter configuration and connector parameter configuration.

  • JVM parameter configuration:

    • Minimum heap value: Xms
    • Maximum heap size: Xmx
    • Young generation size: Xmn
    • Permanent Generation Size: XX:PermSize:
    • Permanent generation maximum size: XX:MaxPermSize:
    • Stack size: -Xss or -XX:ThreadStackSize

    One thing to note about the stack size here is that the default value of ThreadStackSize on Linux x64 is 1024KB, and creating a stack for a Java thread will use the size specified by this parameter. If -Xss or -XX:ThreadStackSize is set to 0, the "system default" is used. On Linux x64, the "system default" size defined by the HotSpot VM for the Java stack is also 1MB. So the default stack size of a normal Java thread is 1MB anyway. One thing to note here is the stack size of java and the operating system stack size of the operating system mentioned earlier (ulimit -s): this configuration only affects the initial thread of the process; subsequent threads created with pthread_create can specify the stack size. In order to precisely control the stack size of Java threads, HotSpot VM deliberately does not use the initial thread of the process (primordial thread) as a Java thread.

    其他还要根据业务场景,选择使用那种垃圾回收器,回收的策略。另外,当需要保留GC信息时,也需要做一些设置。

    典型配置可见:https://github.com/superhj1987/awesome-config/blob/master/tomcat/java_opts.conf

  • connector参数配置

    • protocol: 有三个选项:bio;nio;apr。建议使用apr选项,性能为最高。
    • connectionTimeout:连接的超时时间
    • maxThreads:最大线程数,此值限制了bio的最大连接数
    • minSpareThreads: 最大空闲线程数
    • acceptCount:可以接受的最大请求数目(未能得到处理的请求排队)
    • maxConnection: 使用nio或者apr时,最大连接数受此值影响。

    典型配置可见:https://github.com/superhj1987/awesome-config/blob/master/tomcat/connector.conf

    一般的当一个进程有500个线程在跑的话,那性能已经是很低很低了。Tomcat默认配置的最大请求数是150。当某个应用拥有250个以上并发的时候,应考虑应用服务器的集群。

    另外,并非是无限调大maxTreads和maxConnection就能无限调高并发能力的。线程越多,那么cpu花费在线程调度上的时间越多,同时,内存消耗也就越大,那么就极大影响处理用户的请求。受限于硬件资源,并发值是需要设置合适的值的。

对于tomcat这里有一个争论就是:使用大内存tomcat好还是多个小的tomcat集群好?(针对64位服务器以及tomcat来说)

其实,这个要根据业务场景区别对待的。通常,大内存tomcat有以下问题:

  • 一旦发生full gc,那么会非常耗时
  • 一旦gc,dump出的堆快照太大,无法分析

因此,如果可以保证一定程度上程序的对象大部分都是朝生夕死的,老年代不会发生gc,那么使用大内存tomcat也是可以的。但是在伸缩性和高可用却比不上使用小内存(相对来说)tomcat集群。

使用小内存tomcat集群则有以下优势:

  • 可以根据系统的负载调整tc的数量,以达到资源的最大利用率,
  • 可以防止单点故障。

2.4.3 数据库

mysql

mysql是目前最常用的关系型数据库,支持复杂的查询。但是其负载能力一般,很多时候一个系统的瓶颈就发生在mysql这一点,当然有时候也和sql语句的效率有关。比如,牵扯到联表的查询一般说来效率是不会太高的。

影响数据库性能的因素一般有以下几点:

  • 硬件配置:这个无需多说
  • 数据库设置:max_connection的一些配置会影响数据库的连接数
  • 数据表的设计:使用冗余字段避免联表查询;使用索引提高查询效率
  • 查询语句是否合理:这个牵扯到的是个人的编码素质。比如,查询符合某个条件的记录,我见过有人把记录全部查出来,再去逐条对比
  • 引擎的选择:myisam和innodb两者的适用场景不同,不存在绝对的优劣

抛开以上因素,当数据量单表突破千万甚至百万时(和具体的数据有关),需要对mysql数据库进行优化,一种常见的方案就是分表:

  • 垂直分表:在列维度的拆分
  • 水平分表:行维度的拆分

此外,对于数据库,可以使用读写分离的方式提高性能,尤其是对那种读频率远大于写频率的业务场景。这里一般采用master/slave的方式实现读写分离,前面用程序控制或者加一个proxy层。可以选择使用MySQL Proxy,编写lua脚本来实现基于proxy的mysql读写分离;也可以通过程序来控制,根据不同的sql语句选择相应的数据库来操作,这个也是笔者公司目前在用的方案。由于此方案和业务强绑定,是很难有一个通用的方案的,其中比较成熟的是阿里的TDDL,但是由于未全部开源且对其他组件有依赖性,不推荐使用。

现在很多大的公司对这些分表、主从分离、分布式都基于mysql做了自己的二次开发,形成了自己公司的一套分布式数据库系统。比如阿里的Cobar、网易的DDB、360的Atlas等。当然,很多大公司也研发了自己的mysql分支,比较出名的就是姜承尧带领研发的InnoSQL。

redis

当然,对于系统中并发很高并且访问很频繁的数据,关系型数据库还是不能妥妥应对。这时候就需要缓存数据库出马以隔离对mysql的访问,防止mysql崩溃。

其中,redis是目前用的比较多的缓存数据库(当然,也有直接把redis当做数据库使用的)。redis是单线程基于内存的数据库,读写性能远远超过mysql。一般情况下,对redis做读写分离主从同步就可以应对大部分场景的应用。但是这样的方案缺少ha,尤其对于分布式应用,是不可接受的。目前,redis集群的实现方案有以下几个:

  • redis cluster:这是一种去中心化的方案,是redis的官方实现。是一种非常“重”的方案,已经不是Redis单实例的“简单、可依赖”了。目前应用案例还很少,貌似国内的芒果台用了,结局不知道如何。
  • twemproxy:这是twitter开源的redis和memcached的proxy方案。比较成熟,目前的应用案例比较多,但也有一些缺陷,尤其在运维方面。比如无法平滑的扩容/缩容,运维不友好等。
  • codis: 这个是豌豆荚开源的redis proxy方案,能够兼容twemproxy,并且对其做了很多改进。由豌豆荚于2014年11月开源,基于Go和C开发。现已广泛用于豌豆荚的各种Redis业务场景。现在比Twemproxy快近100%。目前据我所知除了豌豆荚之外,hulu也在使用这套方案。当然,其升级项目reborndb号称比codis还要厉害。

2.5 系统架构

影响性能的系统架构一般会有这几方面:

  • 负载均衡
  • 同步 or 异步
  • 28原则

2.5.1 负载均衡

负载均衡在服务端领域中是一个很关键的技术。可以分为以下两种:

  • 硬件负载均衡
  • 软件负载均衡

其中,硬件负载均衡的性能无疑是最优的,其中以F5为代表。但是,与高性能并存的是其成本的昂贵。所以对于很多初创公司来说,一般是选用软件负载均衡的方案。

软件负载均衡中又可以分为四层负载均衡和七层负载均衡。 上文在应用服务器配置部分讲了nginx的反向代理功能即七层的一种成熟解决方案,主要针对的是七层http协议(虽然最新的发布版本已经支持四层负载均衡)。对于四层负载均衡,目前应用最广泛的是lvs。其是阿里的章文嵩博士带领的团队所研发的一款linux下的负载均衡软件,本质上是基于iptables实现的。分为三种工作模式:

  • NAT: 修改数据包destination ip,in和out都要经过lvs。
  • DR:修改数据包mac地址,lvs和realserver需要在一个vlan。
  • IP TUUNEL:修改数据包destination ip和源ip,realserver需要支持ip tunnel协议。lvs和realserver不需要在一个vlan。

三种模式各有优缺点,目前还有阿里开源的一个FULL NAT是在NAT原来的DNAT上加入了SNAT的功能。

此外,haproxy也是一款常用的负载均衡软件。但限于对此使用较少,在此不做讲述。

2.5.2 同步 or 异步

对于一个系统,很多业务需要面对使用同步机制或者是异步机制的选择。比如,对于一篇帖子,一个用户对其分享后,需要记录用户的分享记录。如果你使用同步模式(分享的同时记录此行为),那么响应速度肯定会受到影响。而如果你考虑到分享过后,用户并不会立刻去查看自己的分享记录,牺牲这一点时效性,可以先完成分享的动作,然后异步记录此行为,会提高分享请求的响应速度(当然,这里可能会有事务准确性的问题)。有时候在某些业务逻辑上,在充分理解用户诉求的基础上,是可以牺牲某些特性来满足用户需求的。

这里值得一提的是,很多时候对于一个业务流程,是可以拆开划分为几个步骤的,然后有些步骤完全可以异步并发执行,能够极大提高处理速度。

2.5.3 28原则

对于一个系统,20%的功能会带来80%的流量。这就是28原则的意思,当然也是我自己的一种表述。因此在设计系统的时候,对于80%的功能,其面对的请求压力是很小的,是没有必要进行过度设计的。但是对于另外20%的功能则是需要设计再设计、reivew再review,能够做负载均衡就做负载均衡,能够缓存就缓存,能够做分布式就分布式,能够把流程拆开异步化就异步化。

当然,这个原则适用于生活中很多事物。

三. 一般架构

一般的Java后端系统应用架构如下图所示:LVS+Nginx+Tomcat+MySql/DDB+Redis/Codis

web-arch

其中,虚线部分是数据库层,采用的是主从模式。也可以使用redis cluster(codis等)以及mysql cluster(Cobar等)来替换。

如需转载,请注明来自: http://superhj1987.github.com

版权声明:本文为博主原创文章,未经博主允许不得转载

本文部分参考自网络相关资料,由于时间太久,无法追朔原作者。如有侵权,请联系[email protected]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327026547&siteId=291194637