Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Editor's note: High-availability architecture sharing and dissemination are articles of typical significance in the architecture field. This article is shared by Yao Sifang in the high-availability architecture group. Please indicate that it is from the high-availability framework public account "ArchNotes".

Yao Sifang, Sina Weibo senior technical expert, technical leader of Weibo platform architecture group. Joined Sina Weibo in 2012 and participated in several key projects including Weibo Feed architecture upgrade, platform service transformation, and hybrid cloud. He is currently the technical leader of the Weibo platform architecture group and is responsible for the research and development of the platform's public infrastructure. He once shared the technology of "Sina Weibo High-Performance Architecture" at QCon, focusing on the direction of high-performance architecture and service middleware.

Business background and problems used by Nginx

Upsync: Weibo open source dynamic traffic management solution based on Nginx containerNginx has been widely used in the industry due to its super-high performance and stability. Nginx is widely used in the seven layers of Weibo. Combined with Nginx's health check module and dynamic reload mechanism, services can be upgraded, launched and expanded almost without damage. At this time, the frequency of expansion is relatively low, and in most cases it is planned expansion.

The business scenarios of Weibo have very significant peak characteristics. There are not only routine evening peaks, but also expected extreme traffic peaks like New Year's Day, Spring Festival Gala, and Red Packet Flying. There are even more occasional peaks caused by celebrities/social events such as #日见# #我们#. The usual practice before is buffer + downgrade. When downgrading is not considered (which will affect the user experience), the buffer is too small to bear the peak value, and the cost cannot be bearable. Therefore, since 2014, we have been trying to use containerization to realize the dynamic adjustment of the buffer, so as to realize the expansion/reduction of the buffer on demand according to the traffic, in order to save cost.

In this scenario, there will be a large number of continuous expansion/reduction operations. There are two common solutions for Nginx-based backend changes in the industry. One is DNS-based provided by Tengine, and the other is backend service discovery based on consul-template. The following table briefly compares the characteristics of the two solutions.
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Based on DNS: This module is developed by the Tengine team and can dynamically resolve domain names under upstream conf. This method is simple to operate, just modify the list of servers mounted under dns.

Disadvantages:

  • DNS polls and resolves regularly (30s). If the configured time is too short, such as 1s, it will put pressure on the DNS server. If the configured time is too long, the timeliness will be affected.
  • Too many servers cannot be linked to DNS-based services, which will cause truncation (UDP protocol) and pressure on bandwidth.

Based on consul-template and consul: As a combination, consul is used as db, consul-template is deployed on the Nginx server, consul-template periodically initiates a request to consul, and when the value value changes, it will update the local Nginx related configuration files and initiate reload command. But in the case of heavy traffic, initiating a reload will affect performance. Reload will also trigger the creation of a new work process. In a period of time, the old and new work processes will exist at the same time, and the old work process will frequently traverse the connection list to see if the request has been processed. If it ends, exit the process; another reload It will also cause the long link between Nginx and client and backend to be closed, and the new work process needs to create a new link.

Performance impact caused by reload:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

During reload, Nginx's request processing capability will decrease (Note: Nginx will not lose the request for successful handshake)
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

The time-consuming reload will fluctuate, and the fluctuation range can even reach 50%+ (different business time-consuming, the fluctuation range will be different)

The impact of each reload on the QPS and time-consuming of Nginx usually lasts 8 to 10 seconds. Considering that there will be frequent changes in an expansion, this is an unbearable burden for online businesses. Therefore, avoid reloading Nginx.

Scheme design based on dynamic routing

In the design of Nginx, each upstream maintains a static routing table that stores the backend's ip, port and other meta information. After each request arrives, the routing table is retrieved according to location, and then a backend forwarding request is selected according to a specific scheduling algorithm (such as round robin). However, this routing table is static. If it is changed, it must be reloaded. From the above figure, you can find that the SLA is greatly affected.

The most intuitive idea is to dynamically update/create a routing table after each update and backend to avoid reaload. Use Nginx to extend a module [nginx-upsync-module] ( https://github.com/weibocom/nginx-upsync-module ) to dynamically update and maintain the routing table. Generally, there are two ways to maintain the routing table: push and pull.

push scheme

In this solution, Nginx provides the http api interface. When adding/deleting servers through the api, a request is made to Nginx by calling the api, which is simple and convenient.

The architecture diagram is as follows:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

The http api is simple and convenient to operate, and has good real-time performance; the disadvantage is that when there are multiple Nginx, the consistency of different Nginx routing tables is difficult to guarantee. If a registration fails, it will cause inconsistent service configuration and complex fault tolerance. In addition, to expand the Nginx server, you need to synchronize the routing table from other Nginx.

pull plan

Considering the consistency problem in the dimension of the routing table in the push scheme, the third-party component consul is introduced to solve this problem.

The architecture diagram is as follows:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

All backend information (including meta) in the routing table is stored in consul, all Nginx pulls relevant information from consul, and if there is a change, the routing table is updated, consul is used to solve the consistency problem, and consul’s wait mechanism is used to solve real-time problems. Use consul's index (version number) for incremental extraction to solve the problem of bandwidth occupation.

In consul, a k/v pair represents a backend message. Adding one is regarded as expansion, and reducing one is regarded as shrinking. Adjusting meta information, such as weights, can also achieve the purpose of dynamic traffic adjustment.

The following implementation is based on consul.

Scheme realization based on dynamic routing

Based on the upsync method, the module nginx-upsync-module was developed. Its function is to pull the list of consul's back-end servers and update the Nginx routing information. This module does not depend on any third-party modules.

Routing table update method

As the db of Nginx, consul utilizes the KV service of consul, and each Nginx work process independently pulls the configuration of each upstream and updates its own route.

The flow chart is as follows:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Each work process regularly goes to consul to pull the corresponding upstream configuration, and the timing interval can be configured; consul provides a time_wait mechanism, using the version number of value, if consul finds that the value of the corresponding upstream has not changed, it will hang the request 5 minutes (default). Any operation on this upstream within these five minutes will immediately return to Nginx to update the corresponding route. The pull interval can be configured according to the needs of the scene, and the required real-time performance can basically be achieved. After the upstream change, in addition to updating the Nginx cache routing information, the upstream back-end server list will also be dumped locally to keep the local server information consistent with consul.

In addition to registering/unregistering the back-end server to consul, the upstream routing information of Nginx will be updated, and the modification of the back-end server attributes will also be synchronized to the upstream routing of nginx. Currently, the modified attributes supported by this module are weight, max_fails, fail_timeout, and down. Modifying the weight of the server can dynamically adjust the back-end traffic. If you want to temporarily remove the server, you can set the down attribute of the server to 1 (the current down attribute Dump to the local server list is not supported temporarily), the traffic will stop hitting the server. If you want to restore the traffic, you can reset the down to 0.

In addition, each work process pulls and updates its own routing table. The reasons for this method are: First, the Nginx-based process model, independent of each other's data, without interfering with each other; second, if shared memory is used, it needs to be pre-allocated in advance , Flexibility may be limited, and read-write locks may be required, which may have a potential impact on performance; third, if shared memory is used, inter-process coordination to pull configuration will increase its complexity and pull stability. Will be affected. For these reasons, they adopted their own pull methods.

High availability

Nginx's back-end list update relies on consul, but it is not strongly dependent on it. The performance is as follows: First, even if consul accidentally hangs in the middle, it will not affect the service of Nginx, and Nginx will continue to provide services using the last updated service list; Second, if consul restarts to provide services, Nginx will continue to detect consul at this time. At this time, consul's back-end service list has changed, and it will be updated to Nginx in time.

On the other hand, every time the work process is updated, the backend list will be dumped locally. The purpose is to reduce the dependency on consul. Nginx can be reloaded even when consul is unavailable. The Nginx startup flowchart is as follows:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

When Nginx starts, the master process first parses the local configuration file, completes the analysis, and then performs a series of initializations, and then starts the initialization of the work process. When work is initialized, it will go to consul to pull the configuration and update the upstream routing information of the work process. If the pull is successful, it will be updated directly. If the pull fails, the file of the configured dump backend list will be opened, and the previous dump will be extracted Server information, update the upstream routing, and then start to provide services normally.

Each time you pull consul, the connection timeout will be set. Since consul will hang for five minutes by default without updates, the response timeout configuration time should be greater than five minutes. After more than five minutes, consul still does not return, so it will directly handle the timeout.

compatibility

Generally speaking, this module only updates the upstream routing information of the back-end, does not embed other modules, and does not affect the functions of other modules, nor does it affect the several load balancing algorithms of Nginx-1.9.9: least_conn, hash_ip, etc.

In addition, the module naturally supports the health monitoring module. If the monitoring module is included when Nginx is compiled, the interface of the health monitoring module will be called at the same time to update the routing table of the health monitoring module from time to time.

Performance Testing

The nginx-upsync-module module potentially brings additional performance overhead, such as sending requests to consul at intervals. Since the interval is relatively long and each request is equivalent to a client request of Nginx, the impact is limited. Based on this, in the same hardware environment, a simple performance comparison was made between using this module and not using this module.

The basic environment
Hardware Environment: Intel (R) Xeon (R ) CPU E5645 @ 2.40GHz 12 -core
system environments: centos6.5;
Work process number: 8;
pressure measuring tools: wrk;
pressure test command: ./ wrk -t8 - c100 -d5m --timeout 3s http://$ip:8888/proxy_test

Pressure test data:
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Among them, Nginx (official) is the official Nginx, and the test data under reload is not executed. Nginx (upsync) is based on the upsync module, registering/deregistering a machine's data with consul every 10s; from the data, it can be seen that the upsync module has a limited impact on performance and can be ignored.

Applications

The module has been used in various businesses of Weibo. The following chart compares and analyzes the QPS and time-consuming changes before and after using the module.
Upsync: Weibo open source dynamic traffic management solution based on Nginx container
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

From the data, it can be concluded that the request processing capacity of nginx is reduced by about 10% during the reload operation, and the time consumption of Nginx itself will increase by 50%+. If it is frequent expansion and contraction, the overhead caused by the reload operation will be more obvious.

During the 2016 New Year's Day, according to the traffic characteristics of different time periods, hundreds of expansion/reduction were carried out, and the SLA of the overall service during the expansion process was not affected.

The official commercial version supports DNS and push versions of Nginx plus.
In the process of using, due to data consistency and other issues, extended support for the pull version based on consul

https://github.com/weibocom/nginx-upsync-module is currently improving the wiki and documentation. Click to read the original text to enter.

Q & A

1. Is the registration of machine configuration information in consul automatically adjusted by the Weibo system according to traffic?
These are two problems. The process of registering backend information with nodes during expansion is automatic and has been integrated into the online system. In addition, Weibo is currently developing and evaluating an online capacity evaluation system, which currently handles semi-automatic adjustments.

2. Why did you not consider zk at the beginning? Would it be different if you changed to zk instead of pulling in rotation and changing to long link?
At present, the module has added support for classes like etcd and zk. At the beginning, consul was used because the company already had consul clusters and operation and maintenance personnel. zk, etcd, consul are essentially the same for modules

3. Why not pull the master of Nginx and then distribute it to each work, but use the work process to pull? The former can reduce network interaction and improve the consistency of multiple work within Nginx.
If you use the master to pull, you need to modify the core module. When designing the module, a big principle is to try to ensure that the module has 0 dependencies. In general, this is also a trade off.

4. Is the registration of machine configuration information in consul automatically adjusted by the Weibo system according to traffic?
This question is similar to question 1.

5. Based on what considerations did you choose consul for configuration management?
Similar to question 2.

6. Can you design a set of APIs and pull Nginx? It doesn’t matter if the source is consul or a certain Java service. I feel that
the design ideas of more general modules are similar to those mentioned above. An upsync type is designed inside, and different sources are used. Different types can be implemented.

7. In addition, I don't know if we have solved the problem of too much Nginx routing information, which takes up too much memory. We are now using LRU to eliminate it. I don't know if there is such a scenario
in Weibo that has been considered in the design. Usually we only keep the current routing table and an expired routing table. After all the requests supported by the expired routing table are processed, this part of the memory will be released.

8. What are the unused machines of Weibo used for when the traffic is low? If this batch of machines is still being used by other services, when Weibo dynamically loads this batch of machines, what about other services.
Currently we are deploying a hybrid cloud and the buffer pool is created on the public cloud. When the traffic is low, just delete it. Machines from the machine room can usually run some offline services when the traffic is low. The strategy of mixing online and offline running is currently under development.

9. The actual test results of ab and work show that frequent updates of the consul list under high pressure will fail. How do you deal with this problem?
We have tested thousands of machines every second and there will be no problems. This has been able to support most (including the predators) demand for expansion. When we were stressing consul, there was indeed a failed condition of consul (single-host providing services), which was not related to the module itself. The performance and configuration of the consul cluster need to be improved.

10. Ask the teacher. I want to have a simple understanding of consul. Is consul the routing information stored in the memory? How to distinguish routing information of different services? Can only be named by keys or consul is deployed separately for each service?
The routing table information will be stored in 3 places. 1. Nginx's memory directly improves routing services; 2. Stored on consul, storing each node is a backend key of /$consul_path/$upstream/ip:port. 3. The files on the server where Nginx is located are stored as snapshots to prevent consulg and Nginx from going down at the same time.

Click to read the original text to access the open source address of nginx-upsync-module on github.

  • Some articles shared by the Weibo technical team in the high-availability architecture
  • A lightweight RPC framework that supports hundreds of billions of calls on Weibo: Motan
  • Weibo Docker-based hybrid cloud platform design and practice
  • Discussion on the deployment experience of Weibo "multiple lives in different places"
  • Weibo's Docker container-based hybrid cloud migration combat
  • MySQL optimization and operation and maintenance of 6 billion records in a single table
  • Troubleshooting methods for Weibo in large-scale and high-load systems

Postscript: I heard yesterday that classmate Sifang came out of the boss office with a black face and was strongly criticized by the boss for not recruiting anyone after the Spring Festival. It just so happens that the Lantern Festival is over, and the new year officially begins, so the editor recommends friends who are interested in the above-mentioned upsync, Motan, Docker and hybrid cloud, MySQL optimization and operation and maintenance with a single table of 6 billion records can pay attention to the following recruitment Information, ruthlessly smashed the resume to Sifang to help him.

The Weibo technical team recruits various technical talents, including C/C++, Java, big data, operation and maintenance and other technical positions. The engineers are all equipped with MacBook Pro and DELL large-screen displays, and they have a wealth of development secrets and training documents. Technical value, the best choice to enhance personal ability. Welcome to scan the QR code for job details.
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

This article is planned by Li Qingfeng, editor Wang Jie, and reviewer Tim Yang. If you want to discuss RPC and microservice architecture, please follow the official account for opportunities to join the group. Please indicate that it is from the high-availability framework "ArchNotes" WeChat official account and include the following QR code.
Upsync: Weibo open source dynamic traffic management solution based on Nginx container

Guess you like

Origin blog.51cto.com/14977574/2547937