5 kinds of policy and principle nginx load balancing

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.

This link: https://blog.csdn.net/qq_35119422/article/details/81505732

5 ways of allocation of upstream nginx currently supported

1, the polling (default)
each request individually assigned to a different time order back-end server, if the back-end server is down, can be automatically removed.
{backserver upstream
Server 192.168.0.14;
Server 192.168.0.15;
}

2, specify the weight
polling a probability proportional to weight ratio and access, for the case where unevenness backend server performance.
{backserver upstream
Server weight = 192.168.0.14. 8;
Server 192.168.0.15 weight = 10;
}

. 3, the IP binding ip_hash
each access request in a hash result ip distribution, so that each guest access to a back-end server is fixed, can be solved session problems.
{backserver upstream
ip_hash;
Server 192.168.0.14:88;
Server 192.168.0.15:80;
}

. 4, Fair (third party)
by the response time of the allocation request to the backend server, a short response time priority allocation.
{backserver upstream
Server server1;
Server Server2;
Fair;
}

5, url_hash (third party)
by hash results to access url allocation request, each url directed to the same back-end server, the back end server is effective when the cache.
{backserver upstream
server squid1: 3128;
server squid2: 3128;
the hash $ REQUEST_URI;
hash_method CRC32;
}

increase in the server load balancing in need

proxy_pass HTTP: // backserver /;
upstream backserver {
ip_hash;
server 127.0.0.1:9090 Down ; (Down temporarily represents the current server load does not participate in)
server 127.0.0.1:8080 weight = 2; (default weight is larger 1.weight, the greater the weight of the load weight)
server 127.0.0.1:6060;
server 127.0.0.1 : 7070 backup; (all other backup machine down or non-busy times, requests backup machine)
}

max_fails: permitted number of failed requests defaults to 1. when the number exceeds the maximum, the module returns proxy_next_upstream defined error
fail_timeout: max_fails failures after the pause time

Insights:

1 Introduction

With the continuous increase website load, load balancing (load balance) is not unfamiliar topic. Load balancing the traffic load is allocated to different service units to ensure high availability server, to ensure fast response enough to give the user a good experience.

nginx first public version was released in 2004. 2011, released version 1.0. It is characterized by high stability, powerful, low resource consumption. From the server market share point of view, nginx and Apache rival has momentum. Among them, the characteristics have to mention is its load balancing feature, which has become the main reason many companies choose it.

We will source perspective on nginx built-in load balancing load balancing strategy and expansion strategies to actual industrial production as an example, compare each load balancing strategy, provide some reference nginx user.

2. Source analysis

nginx load balancing strategy can be divided into two categories: built-in policies and expansion strategies.

Built-in policy contains WRR and ip hash, in default of these two strategies will nginx compiled into the kernel, you can simply specify the parameters in nginx configuration. There are many expansion strategy, such as fair, universal hash, consistent hash, etc., do not default nginx compiled into the kernel.

Since the code in version upgrade nginx load balancing does not change the essence of nature, so the following will nginx1.0.15 stable version, for example, analyze the various strategies from the perspective of the source.

2.1. WRR (weighted round robin)

polling principle is very simple, we first introduce the basic process of polling. Is a flowchart showing a following request:

FIG Two points should be noted:

First, if the weighted round-robin algorithm can be divided into the first search and depth first search wide, then nginx uses a depth first search algorithm, i.e. the first requests are distributed to the high weight of the machine, until the weight of the machine down to lower than other machines, it began to be given to the request under a high weight machine.

Second, when the rear end of the machine are all fall down, immediately Nginx all machines flag will be cleared to an initial state in order to avoid all the machines are in the timeout state, causing the entire front end of the ram to live.

Then look at the source code. nginx directory structure is very clear, weighted round-robin the path to nginx-1.0.15 / src / http / ngx_http_upstream_round_robin. [c | h], on the basis of the source code for an important, difficult to understand where I added a note. First look at the important statement ngx_http_upstream_round_robin.h:

from variable naming can roughly guess their effect. Current_weight and explain the difference in weight, the weight value of the former sort, as the process of dynamic change requests, the value of the latter is arranged, to restore the initial state.

We took at look at the creation of polling. FIG code is as follows:

There is a need to do Tried variables Description: Tried recorded in the server is currently trying to connect through. He is a bitmap. If the number of servers is less than 32, simply to record all the servers in a state in int. If the number is larger than the server 32, the application is required in the memory to store the memory pool.

The bitmap array codes refer to the following:

Finally, the actual policy code, logic is relatively simple, only 30 lines of code to achieve. Look at the code.

2.2. Ip hash policy

ip hash nginx is another built-in load balancing strategy, processes, and polling is very similar, but some of these changes algorithms and specific strategies. As shown below:

Core ip see the hash algorithm to achieve the following code:

see, hash values of both the ip-related number of the rear end of the machine and relevant. After testing, the above-described algorithm may continuously generate 1045 different from each other value, which is the algorithm hard limit. nginx uses the protection mechanism, when passing 20 times hash still can not find available machine, algorithm degenerate into polling.

So, in essence, ip hash algorithm is a disguised form of polling algorithm, if the initial two ip hash value is exactly the same, then the request from both ip will always fall on the same server, this is balanced of planted deeper problems.

2.3. Fair

Fair strategy is expansion strategy, nginx default is not compiled into the kernel. It is determined according to the response time of load back-end server, to choose the lightest load shunt machine.
This strategy has a strong adaptability, but the actual network environment is often not so simple, and therefore should be used with caution.

2.4 General hash, hash consistency

Universal hash and hash consistency is kind of expansion strategy. Nginx universal hash may be built hash key variables, using the hash nginx built consistency consistency hash ring supports memcache.

3 comparison test

了解了以上负载均衡策略，接下来我们来做一些测试。
主要是对比各个策略的均衡性、一致性、容灾性等，从而分析出其中的差异性，根据数据给出各自的适用场景。

为了能够全面、客观的测试nginx的负载均衡策略，我们采用两个测试工具、在不同场景下做测试，以此来降低环境对测试结果造成的影响。

首先给大家介绍测试工具、测试网络拓扑和基本之测试流程。

3.1 测试工具

3.1.1 easyABC

easyABC是百度内部开发的性能测试工具，培训采用epool模型实现，简单易上手，可以模拟GET/POST请求，极限情况下可以提供上万的压力，在团队内部得到广泛使用。

由于被测试对象为反向代理服务器，因此需要在其后端搭建桩服务器，这里用nginx作为桩Web Server，提供最基本的静态文件服务。

3.1.2 polygraph

polygraph是一款免费的性能测试工具，以对缓存服务、代理、交换机等方面的测试见长。它有规范的配置语言PGL（Polygraph Language），为软件提供了强大的灵活性。其工作原理如下图所示：

polygraph提供Client端和Server端，将测试目标nginx放在二者之间，三者之间的网络交互均走http协议，只需配置ip+port即可。

Client端可以配置虚拟robot的个数以及每个robot发请求的速率，并向代理服务器发起随机的静态文件请求，Server端将按照请求的url生成随机大小的静态文件做响应。

选用这个测试软件的一个主要原因：可以产生随机的url作为nginx各种hash策略key。
另外polygraph还提供了日志分析工具，功能比较丰富，感兴趣的同学可以参考附录材料。

3.2. 测试环境

本次测试运行在5台物理机。其中：被测对象单独搭在一台8核机器上，另外四台4核机器分别搭建了easyABC、webserver桩和polygraph。如下图所示：

3.3. 测试方案

给各位介绍一下关键的测试指标：

均衡性：是否能够将请求均匀的发送给后端
一致性：同一个key的请求，是否能落到同一台机器
容灾性：当部分后端机器挂掉时，是否能够正常工作

以上述指标为指导，我们针对如下4个测试场景分别用easyABC和polygraph测试：

场景1      server_*均正常提供服务；
场景2      server_4挂掉，其他正常；
场景3      server_3、server_4挂掉，其他正常；
场景4      server_*均恢复正常服务。

上述四个场景将按照时间顺序进行，每个场景将建立在上一个场景基础上，被测试对象无需做任何操作，以最大程度模拟实际情况。

另外，考虑到测试工具自身的特点，在easyabc上的测试压力在17000左右，polygraph上的测试压力在4000左右。以上测试均保证被测试对象可以正常工作，且无任何notice级别以上（alert/error/warn）的日志出现，在每个场景中记录下server_*的qps用于最后的策略分析。

3.4. 结果

对比在两种测试工具下的测试结果会发现，结果完全一致，因此可以排除测试工具的影响。表1和图1是轮询策略在两种测试工具下的负载情况。

从图表中可以看出，轮询策略对于均衡性和容灾性都可以做到较好的满足。

表2和图2是fair策略在两种测试工具下的负载情况。fair策略受环境影响非常大，在排除了测试工具的干扰之后，结果仍然有非常大的抖动。

从直观上讲，这完全不满足均衡性。但从另一个角度出发，恰恰是由于这种自适应性确保了在复杂的网络环境中能够物尽所用。因此，在应用到工业生产中之前，需要在具体的环境中做好测试工作。

以下图表是各种hash策略，所不同的仅仅是hash key或者是具体的算法实现，因此一起做对比。实际测试中发现，通用hash和一致性hash均存在一个问题：当某台后端的机器挂掉时，原有落到这台机器上的流量会丢失，但是在ip hash中就不存在这样的问题。

正如上文中对ip hash源码的分析，当ip hash失效时，会退化为轮询策略，因此不会有丢失流量的情况。从这个层面上说，ip hash也可以看成是轮询的升级版。

图5为ip hash策略，ip hash是nginx内置策略，可以看做是前两种策略的特例：以来源IP为key。

由于测试工具不太擅于模拟海量IP下的请求，因此这里截取线上实际的情况加以分析。如下图所示：

图5 IP Hash策略

图中前1/3使用轮询策略，中间段使用ip hash策略，后1/3仍然是轮询策略。可以明显的看出，ip hash的均衡性存在着很大的问题。

原因并不难分析，在实际的网络环境中，有大量的高校出口路由器ip、企业出口路由器ip等网络节点，这些节点带来的流量往往是普通用户的成百上千倍，而ip hash策略恰恰是按照ip来划分流量，因此造成上述后果也就自然而然了。

4 小结

通过实际的对比测试，我们对nginx各个负载均衡策略进行了验证。下面从均衡性、一致性、容灾性以及适用场景等角度对比各种策略。如下图示：

我们从源码和实际测试数据角度分析说明了nginx负载均衡的策略，给出了各种策略适合的应用场景。通过分析不难发现，无论哪种策略都不是万金油，在具体场景下应该选择哪种策略一定程度上依赖于使用者对策略的熟悉程度。

以上分析和测试数据能够对大家有所帮助，期待有更多越来越好的负载均衡策略涌现，造福更多运维开发同学。

参考：

https://www.cnblogs.com/wpjamer/articles/6443332.html

https://www.cnblogs.com/andashu/p/6377323.html