Top conference paper | Exploration and practice of virtual network detection technology

Author: Lv Biao, person in charge of Alibaba Cloud Network Qitian

Cloud networks consist of both physical and virtual networks, both of which affect network performance. Past studies mainly focus on solving physical network detection, but there are fewer corresponding studies in the field of virtual network detection. This article will share with you an active detection system Zoonet designed for large-scale multi-tenant virtual networks, and interpret the design background, challenges, technical architecture, and experience sharing of large-scale deployment of the Zoonet system from a technical perspective.

Recently, a paper "Zonet: A Proactive Telemetry System for Large-Scale Cloud Networks" by Alibaba Cloud Luoshen Cloud Network was accepted by the ACM CoNEXT 2022 conference. This year CoNEXT received a total of 151 submissions, of which 28 were successfully selected, with an acceptance rate of only 18.5%. This paper made a world-leading research exploration in the field of large-scale virtual network detection, and the Zoonet network detection system based on this research has been deployed and applied in Alibaba Cloud's global data center.

In order to make it easier for everyone to understand this paper, this article will explain the design background, challenges, technical architecture, and experience sharing of large-scale deployment of the Zoonet system from a technical perspective.

1. Background introduction

Today's public cloud is the infrastructure of an entire society, serving millions of tenants simultaneously. Among them, the cloud network can help each tenant build a reliable and tenant-isolated network environment, so that tenants can run their own applications and communicate with each other without interfering with each other. The cloud network is composed of the underlying physical network and the upper virtual network. The physical network mainly provides the connection capability of the basic network, while the virtual network can provide tenants with more advanced network services, such as network address space isolation, virtual routing and forwarding, multi-IP shared bandwidth, Internet access, tenant-level cross-regional network, tenant Level hybrid cloud network, etc. Since both the physical network and the virtual network will affect the network performance of tenants, it is necessary for cloud vendors to detect both layers of the network to guarantee the service level agreement (SLA) of tenants.

Through literature research, we found that previous research mainly solved the problem of physical network detection, such as Pingmesh[SIGCOMM'15], Everflow[SIGCOMM'15], 007[NSDI'18], dShark[NSDI'19], NetBouncer[NSDI' 19] et al. In the field of virtual network detection, there are very few corresponding studies. Our research found that only one VNET Pingmesh [IMC'18, short paper] tried to do virtual network detection. However, it only did preliminary scheme design and deployment, and did not discuss and solve the problems faced by large-scale virtual network detection, such as rapid update of virtual network, support for heterogeneous middleware (middlebox), tenant-level public network and Cross-regional network detection, complete VM-VM detection link coverage, etc.

Therefore, this paper proposes an active detection system Zoonet designed for large-scale multi-tenant virtual networks. The system includes coverage detection of the entire cloud network and on-demand detection of abnormal links. Coverage detection provides high path coverage with limited bandwidth overhead, prunes redundant paths and routing/ACL restricted paths, and then uses the simplest end-to-end detection mode. On-demand detection realizes rapid location of network anomalies, and starts detection of abnormal paths in hop-by-hop mode. In particular, Zoonet has a bright design in the following aspects:

  • Zoonet supports heterogeneous middleware. There are a large number of heterogeneous middleware in the cloud network, so Zoonet designed a set of general detection models that can help these middleware quickly adapt to Zoonet.

  • Fast topology updates for Zoonet virtual networks. In order to deal with frequent topology updates, Zoonet subscribes to topology change information and adopts a series of optimization strategies to improve update efficiency. As for the measurement noise caused by untimely updates, Zoonet filters it by verifying the latest topology.

  • Zoonet discovers more problems in virtual network scenarios by continuously expanding the detection boundary. In addition to ordinary private network and cross-regional scenarios, Zoonet also supports the detection of public network, stateful middleware, virtual machine "last mile" and other scenarios.

Zoonet has been deployed on Alibaba Cloud for more than two years, covering dozens of regions. During deployment, Zoonet found many interesting cases, including virtual network protocol stack errors, virtual network congestion, virtual routing anomalies, physical network failures, and virtual machine "last mile" anomalies. Since many suspected network problems are not really caused by the network, Zoonet helps cloud vendors to achieve self-certification to the greatest extent.

2. Limitations of Physical Network Detection

Most people intuitively feel that the status of the virtual network can be inferred from the results of physical network detection, but then we will analyze the limitations of physical network detection in detail and explain why we must do virtual network detection.

2.1 Physical network detection cannot cover the virtual network protocol stack

As shown in the figure above, on the physical host, the physical network protocol stack and the virtual network protocol stack are implemented separately. The physical network protocol stack is based on kernel forwarding. In the early days of Alibaba Cloud, the virtual network protocol stack was implemented with the joint support of user space and kernel space. Today, for higher performance forwarding, the virtual networking stack is implemented entirely in user space, using a DPDK-based kernel bypass model. This way the virtual forwarding path is completely decoupled from the kernel space. Because physical network probes only send probes through the physical network stack, they cannot detect problems with the virtual network stack.

2.2 There is no accurate topology mapping between the virtual network and the physical network

The figure above shows an example of a topology map. A simple VM-to-VM path in a virtual network corresponds to the four underlying ECMP paths in a physical network. In this example, even if the physical network detects that a path has been broken, it is difficult to deduce whether there is a problem with the virtual network, because traffic may bypass the underlying broken path (for example, flow 2). Worse, even if we notice that one flow is dropping packets in the virtual network (for example, flow 1), we cannot pinpoint the faulty physical path location because four ECMP paths mess up the relationship between the virtual and physical topology. Mapping relations.

2.3 Physical network detection may bypass middleware

The figure above shows the physical network topology of Alibaba Cloud, which includes physical servers, switches, regional border gateways, and various middleware mounted under the load balancer. In such topologies, middleware is deployed as a separate appliance away from physical servers and switches. Forwarding traffic to these middleware requires additional rewriting of the packet destination address on the sender side. That is, end-to-end probing methods that rely on host initiation will bypass these "off-the-path" middleware.

2.4 Physical network detection cannot cover cross-regional networks or Internet boundaries based on tenant granularity

An Alibaba Cloud tenant can purchase multiple VPCs, which are distributed in different regions and connected through a cross-regional network. However, physical network detection within a region cannot cover regional border gateways and cross-regional networks at the granularity of tenants, and cannot provide cross-regional end-to-end troubleshooting for specific tenants. Similarly, there is a huge demand for traffic between the virtual machine and the Internet, so cloud vendors need to resolve tenants' doubts about whether network failures occur on the cloud or on the ISP side. Therefore, network probing should cover the boundary between the cloud and the Internet to remove this doubt. However, physical network probing cannot fulfill this requirement.

3. Virtual network detection challenges

Virtual network detection, especially large-scale virtual network detection like Alibaba Cloud, faces many unique challenges, and it is not feasible to copy the method of physical network detection. Next, we briefly introduce the specific virtual network detection challenges.

3.1 Realizing low-overhead detection on ultra-large-scale virtual networks

A large cloud region contains hundreds of thousands of servers in a physical network. However, a virtual network usually contains more virtual nodes. First, one physical server can be virtualized into hundreds of virtual machines (one virtual machine can further host up to 1024 containers if ENI trunking technology is used). Second, the virtual network provides services for a large number of tenants, for example, a region serves millions of tenants. Third, for large tenants, a single VPC can accommodate very large-scale virtual nodes (for example, greater than 500,000 virtual machines). According to our analysis, in such a large-scale virtual network, the overhead of traditional detection schemes will be three orders of magnitude higher than that of physical network detection.

3.2 Adapt to the rapid update of virtual network topology

The topology of the physical network is relatively stable. According to our experience, the update frequency of Alibaba Cloud's physical network is about a few thousand times per month. In contrast, the topology of a virtual network changes very quickly due to a large number of active tenants and a flexible management API. As shown in the figure above, tenant resource allocation and release triggers tens of thousands of virtual network topology updates per hour in the cloud region. For the detection system, frequent virtual network updates will cause additional overhead: 1) Real-time topology recalculation based on tenant configuration; 2) Real-time calculation of end-to-end detection paths based on topology updates; 3) Real-time distribution of detection path belts from the controller to massive bandwidth overhead.

3.3 Multi-service and multi-middleware coverage

For layer 2 and layer 3 devices in the physical network, data packet forwarding is generally based on stateless forwarding rules, which is more convenient to realize outbound and inbound bidirectional detection. However, the virtual network contains a variety of stateful middleware, and they cannot implement simple outbound and inbound bidirectional detection. Such as session-based middleware, before the client sends the first data packet of the session, there is no session relationship in these middleware, so the data packets from the server cannot be forwarded correctly through the middleware. For example, a virtual machine relies on SNAT (provided by a NAT gateway) to access the Internet, but the Internet cannot actively access the virtual machine unless a SNAT session is established. Furthermore, different middleware can have different implementations, and even middleware with the same function can be built in different platforms (e.g., bare metal, NFV, FPGA, ASIC). Therefore, the virtual network detection framework needs to adapt to the heterogeneity of middleware.

3.4 Tenant-insensitive VM-to-VM detection

For physical networks, end-to-end means host-to-host, while for virtual networks, end-to-end means virtual machine to virtual machine. For physical network detection, the detection task generation or data collection function can be implemented in the operating system process on the terminal host. However, the same method cannot be directly applied to tenants' VMs, because cloud vendors cannot invade tenants' virtual machines and must protect user privacy. However, the "last mile" of the end-to-end path cannot be easily covered without VM embedded probes. We will discuss the use of ARP Ping to solve this problem below.

3.5 Distinguishing between virtual and physical network issues

When an abnormality occurs on the network, it is necessary to distinguish between virtual network problems and physical network problems to narrow down the scope of the fault. In a physical network, we can rely on traceroute for hop-by-hop measurement and diagnosis. However, when extending traceroute to the virtual network, even though it returned packet loss related to the exact hop count in the virtual network, we still couldn't confirm whether it was a virtual device problem or an underlying physical device problem, because between two adjacent virtual Physical network domains exist between network devices. In this paper we address this problem through a specially designed Zoonet hop-by-hop mode.

4. Zoonet Solution

4.1 Overall Design

As shown in the figure above, Zoonet consists of a data plane and a control plane. The data plane receives detection tasks from the control plane, and then actively injects detection packets to detect virtual networks. The detection path of a detection task is from vSrc to vDst through multiple vBoxes. vSrc and vDst are virtual detection points, implemented by Zoonet-agent (a self-developed program running on the host machine, independent of the VM hypervisor). vBox refers to a series of virtual network middleware (such as hypervisors, load balancers, NAT gateways, Internet gateways, etc.). In Zoonet, special support for vBox can be used for fast location of abnormal paths. To cover Internet probes, Zoonet extends the probe boundary to ISPs by setting the probe point pDst at the cloud boundary. pDst can simulate the Internet to return the detection packet.

The control plane consists of three modules: Telemetry task planner, Telemetry topology analyzer and Telemetry data analyzer. The exploration mission planner is responsible for the mission planning of normalized exploration and on-demand diagnostic exploration. The Probe Topology Analyzer is responsible for virtual network topology calculation and potential probe path analysis for all tenants. It must also update the topology and paths based on the tenant's topology update operations. The probe data analyzer is responsible for collecting probe data from the data plane for in-depth analysis and sending alerts when anomalies are found. In order to eliminate invalid detection tasks caused by untimely updates, it is necessary to read the latest topology for consistency verification before an alarm is issued.

4.2 Zoonet data plane

4.2.1 Detection tasks

The data plane of Zoonet receives and executes the detection tasks issued by the control plane. A detection task is defined as:

Task=Probing(vSrc,vDst,options,modes)

Among them, vSrc and vDst are the source and end points of a detection task, which will pass through multiple intermediate nodes (vBox). In most cases, vSrc and vDst refer to VMs, and vBox refers to virtual network middleware. To be tenant-agnostic, Zoonet's end-to-end probing starts and ends at the VM agent (i.e., the Zoonet agent). In Internet probing, vDst can also be the VIP of the Internet gateway or the physical probing device pDst placed near the Internet border. options define how to encapsulate and send probe packets, such as interval, quantity, probe packet size, protocol, etc. modes define how vBox responds to probe packets. Using options and modes, the control plane can be programmed how to probe the data plane.

Zoonet contains the following three types of detection packets:

  • Request packet (Request packet) , a detection packet from vSrc to vDst;

  • Reply packet (Response packet) , a detection packet from vSrc to vDst;

  • Report packet , a message carrying detection data, vBox may send a report packet when it receives a request or reply packet.

At the same time, in order to simplify the detection logic on the vBox, the vSrc of the end test will be responsible for all interactions with the control plane, including the generation and injection of detection data packets, calculation of detection delay and packet loss, etc. vBox is only responsible for performing probe packet forwarding and message reply.

4.2.2 Detection mode

Simple end-to-end probing is not enough to cover all cloud probing scenarios:

1) The cloud network has a large number of stateful middleware, and the stateless detection can only monitor the virtual path from one direction;

2) Existing solutions only provide detection coverage within the region, and the Internet border is currently a blind spot for detection;

3) End-to-end detection can detect abnormal paths, but when a fault occurs, the exact fault point (that is, the device/link) cannot be located.

To meet these detection needs, Zoonet has developed three pairs of atom detection modes, as shown in the table above. Among these modes, One-Way and PingPong modes indicate one-way or two-way detection, Non-Transparent and Transparent modes indicate whether to extend the detection boundary from vDst to pDst, End-to-End and Hop-by-Hop modes indicate forwarding paths Whether each vBox on will participate in the probing process. There are 8 combinations of the three pairs of detection modes. Here are examples of 4 common combination use cases and corresponding common cloud network detection scenarios:

Example 1: Normalized detection, stateless middleware. In the normalized detection use case (pattern combination of OW+EE+NT), vSrc sends a request detection packet, which reaches vDst through the end-to-end path. vDst then sends a report message to vSrc. Please note that in the case of normalized detection, vBox only forwards detection packets. For network-wide virtual path coverage, normalized detection is enabled by default.

Example 2: Normalized detection, stateful middleware. In the stateful probing use case (using the mode combination of PP+EE+NT), in addition to the report packet, vDst will also send a reply packet. Such use cases are mainly for probing stateful middleware such as SNAT. Note that PingPong is not equivalent to two separate unidirectional probes going in both directions. Because before the session is established by the stateful middleware, the data packets entering from the reverse direction will be discarded.

Example 3: Normalized detection, Internet boundary. In the Internet boundary detection use case (OW+EE+TR mode combination), we rely on Transparent mode and pDst for Internet boundary detection coverage. The probe packet will be further forwarded to pDst after passing through vDst. Here, we do not place pDst on the Internet (for example, in the computer room of the ISP), mainly because the pDst on the Internet is beyond the control of cloud vendors, and even if a path failure is monitored, we cannot determine whether the network anomaly occurred In the cloud or ISP network. In Zoonet, we place pDst in a public cloud near the Internet border.

Example 4: On-demand diagnostic detection. In the on-demand diagnosis use case (pattern combination of OW+HH+NT), different from the previous use cases, the request packet sent from vSrc will trigger each vBox on the forwarding path to reply with a report packet. This hop-by-hop approach is similar to traceroute, but with a cleaner and more versatile design. Specifically, the report packet is sent twice by the vBox, once in the ingress direction and once in the egress direction. This design can quickly distinguish between link failures and node failures. Assuming the two report packets from ingress and egress experience different problems, the failure is on that node. Otherwise, the link in the middle must have failed. Because a virtual link corresponds to a physical network domain, Zoonet's Hop-by-Hop mode can distinguish between physical network and virtual network problems. Hop-by-Hop is computationally intensive on both the vBox and the control plane. Therefore, it is best to use End-to-End for network coverage detection first, and then use Hop-by-Hop for anomaly location.

4.2.3 Other data plane support

  • Zoonet protocol: We have developed a set of virtual network detection protocols, and the packet format of this protocol has been disclosed in the paper.

  • Zoonet agent: self-developed sending and receiving packet agent software, deployed on the VM host, is an independent process, and has a special binding CPU control core. This can effectively avoid interference and impact on tenants and tenant VMs.

  • VM hypervisor: After identifying the Zoonet data packet, it mainly encapsulates and decapsulates the tunnel header, and marks some marking bits.

  • Middleware: Software middleware supports Zoonet better. For some programmable middleware, such as Tofino chips, due to their limited on-chip resources, Zoonet data packets are generally sent to the control plane for processing.

  • Last mile detection: The detection packet sent from the Zoonet agent cannot cover the small link from the VM to the hypervisor, which is easily overlooked. We use ARP detection to solve this problem. The reason why ARP detection is selected is because the ARP protocol is a very low-level and basic protocol, and general VMs support it by default.

4.3 Zoonet control plane

The control plane of Zoonet mainly solves the following three problems:

  • Huge measurement overhead incurred by probing tasks;

  • Problems caused by frequent topology updates;

  • Exploring the huge overhead of data acquisition and consumption.

4.3.1 Hierarchical detection path planning

Before discussing how Zoonet calculates the detection path, let's take a look at the topology of the cloud network. The figure above shows an overview of the virtual network of a tenant in Alibaba Cloud. We can see that the VM is mounted on the virtual switch (the virtual switch is a logical node, and the number of virtual machines it carries is theoretically unlimited). The links between virtual switches in a region are fully interoperable. Multiple VPCs distributed in different regions can be connected through cross-domain links. In addition, the virtual machine can also access the Internet through the public network IP or SNAT. Next we discuss how Zoonet reduces the detection overhead through a layered detection method.

  • Level 1: VMs under the same virtual switch are detected in pairs. If the VMs under each virtual switch can detect each other with fullmesh, it will cause O(n^2) detection complexity, where n represents the number of virtual machines. For a virtual network, the number of virtual machines under a virtual switch is theoretically unlimited (we have observed thousands of virtual machines connected to the same virtual switch in actual deployment). Therefore, fullmesh detection suffers from the detection complexity problem. Pingmesh can perform fullmesh detection between servers under the same ToR, because each physical ToR has a fixed number of ports, so the detection complexity is well controlled. In order to reduce the detection overhead, on each virtual switch, we divide the VM into two groups of equal number, and perform VM pair detection, which reduces the detection complexity from O(n^2) to O(n/2 ). Additionally, we intentionally differentiate VMs based on their distribution across different servers to ensure maximum probing across physical servers.

  • Level 2: fullmesh detection between virtual switches. To fully cover a virtual network, a straightforward solution is to perform fullmesh probes on each end-to-end VM in the virtual network. However, such fullmesh probing overhead is too large. To reduce complexity, we use aggregate probing. Specifically, fullmesh probes can be initiated from aggregate-level virtual switches instead of leaf-node VMs.

  • Level 3: Cross-regional path pruning. Tenants will set routing/ACL configurations on the regional border gateways to limit cross-regional traffic. In this way, although the underlay network for cross-regional communication is fullmesh, the overlay network traffic will not take all paths. According to the characteristics of this overlay network, we perform cross-regional path pruning based on routing tables and ACL rules to further reduce the complexity of cross-regional detection.

  • Level 4: VM-to-VM top-N hotspot path detection. We found that cloud network traffic follows the 80/20 rule, that is, most traffic is carried by a small number of paths. Using this, we can further optimize the detection strategy. Specifically, we can regularly analyze the traffic logs and select the top N VM-to-VM hotspot paths as the key detection paths, which can cost-effectively cover most of the traffic. Top-N path coverage can be a good complement to the previous path planning strategy.

4.3.2 Frequent topology updates

The figure above shows the process of topology update. A tenant update operation will affect all aspects of the virtual network, such as:

  • Virtual network instance, such as VPC, virtual switch, VM, etc.;

  • Intra-regional, cross-regional, Internet/IDC routing;

  • Others such as ACL, Internet bandwidth of each tenant, etc.

When the virtual network configuration changes, they are pushed to the probing topology analyzer. Due to the high frequency of updates, it is impossible to respond every time an update arrives. Therefore, Zoonet uses message queues to buffer recently arriving updates and batch reads at regular intervals. When processing updates, we implement the following strategies to improve efficiency and accuracy:

  • Strategy 1: Remove topology-independent updates. Some tenant changes will not affect the topology of the virtual network, such as tenants adjusting the Internet bandwidth. Such updates are removed from the message queue.

  • Strategy 2: Delete the add-del update. For a batch of updates read from the message queue, if there are same instances, routes, or other configurations that are added before deleting or deleted before adding, we will identify and remove them together.

  • Strategy 3: Aggregate updates at the VPC granularity. Probing topology analyzers subscribe to topology updates from device controllers. Device controllers are implemented in a distributed manner for high availability, which can cause updates to arrive out of order. For example, a probing topology analyzer might receive an update to create a VM in a VPC, but the corresponding VPC creation event might not be read from the queue until the next round, which would result in an incorrect update. Our solution is to aggregate all updates read from the queue at VPC granularity.

4.3.3 Detection data collection and consumption

Zoonet uses the distributed stream processing framework Flink to process cloud-scale probe data in real time. By deploying more computing units, Zoonet can easily scale data analyzers to handle growing workloads. Initially, we utilize a unified user-defined function (UDF) to address data processing for all detection modes. However, the difference in computational complexity of different detection modes is large. After analyzing the calling frequency and computing overhead of different detection modes, we designed a dedicated UDF for the Hop-by-Hop mode and a simpler Flink SQL for other modes. When such computing logic is separated, the computing cost of Zoonet is reduced by 75%.

5. Problems found in online deployment

Zoonet has been deployed on a large scale in the Alibaba Cloud production system for more than two years. Help us discover and fix many network problems during this period. Below we will share the 307 exception cases found based on Zoonet within one month. In total, we divide them into 6 broad categories:

5.1 Virtual network protocol stack error

Since physical network detection does not pass through the virtual network protocol stack, Zoonet detects many virtual network protocol stack errors that cannot be found by physical network detection methods. These errors can be subdivided into the following three categories:

  • Error on end host. Cloud networks are highly dynamic, generating tens of thousands of topology updates per hour in a region. When topology updates occur, end hosts also need to update their configuration tables. In actual deployments, Zoonet has helped detect many configuration inconsistencies between the control and data planes of end hosts, most of which affect traffic forwarding of tenant VMs. For example, Zoonet once detected VM packet loss due to bandwidth misconfiguration. Of course, we can start the entry comparison on each terminal host to find such errors, but due to the large number of entries, the entry comparison time will be very long. Zoonet can help us discover such problems in a more timely manner.

  • Error on middleware. In the middleware development process, system errors will inevitably be introduced. Although we will do a lot of testing before the middleware goes live, for some gray failures (gray failures), especially those that only involve specific tenants, virtual machines or traffic, it will take us a long time to locate, and even many times Could not be found. With the Hop-by-Hop mode, Zoonet greatly improves the probability of finding gray faults through massive tasks with different parameters. In the last month, Zoonet helped us discover 10 middleware misconfigurations caused by SDN control program errors. These errors are only triggered by specific tenants and are extremely difficult to locate using traditional methods. For example, Zoonet detected that a tenant's SNAT hash function was misconfigured, preventing the tenant from establishing new SNAT sessions.

  • An error in the virtual network upgrader. The cloud network provides flexible network services externally. In order to continuously adapt to new network services, it usually undergoes frequent upgrades. A small number of them use cold upgrades, and we have enough time to verify that the upgrade was successful. But most virtual appliances use hot upgrade. Temporary business interruption or upgrade failure caused by hot upgrade is sometimes unavoidable. But we want to catch it early, and then reduce the damage by quickly rolling back or bypassing the point of failure. With Zoonet, we will use continuous Zoonet in-band monitoring to reflect the current network service quality during the upgrade process, and respond quickly once problems are found.

5.2 Virtual network congestion

Tenants build a network environment in which tenants are isolated from each other on the public cloud, and then run applications independently. But virtual network congestion can break this isolation by exhausting the underlying shared resources. Of all network congestion situations, CPU overload is the most common one. Many virtual devices are based on CPU forwarding. But due to limited single-core performance (for example, 1Mpps), a large burst of traffic may exhaust the CPU and cause virtual network congestion. While such issues are easily detected by monitoring CPU utilization, how to accurately assess the scope of impact (e.g., how many tenants/VMs/services are affected) is an open question. Zoonet solves this problem by matching anomaly detection tasks with tenant information to congested components. For example, in one live case, when the middleware's CPU cores were overloaded, Zoonet detected exactly 157 related tasks experiencing packet loss (by 3% of the total number of middleware tasks).

5.3 Virtual routing exception

The virtual network will release several types of virtual IP (VIP) for traffic drainage. Specifically, the middleware cluster will advertise the VIP to the rest of the cloud, and the public cloud will advertise the public IP to the ISP. Normally, these VIP releases are transparent to tenants. But sometimes, abnormal route advertisements (such as route configuration errors) will affect tenant traffic forwarding. In one case, the abnormal VIP release of the Internet service middleware cluster caused the silent packet loss forwarded by a small number of tenant networks. Zoonet helps identify such anomalies by detecting apparent packet loss across multiple Internet instances.

5.4 Virtual link failure/physical network failure

Zoonet can distinguish between virtual node failures and virtual link failures. In actual deployment, Zoonet helps detect many virtual link failures. Because virtual links are composed of multiple physical switches and links, they can also be called physical network failures. Physical network probing can detect these network failures, however, physical network probing cannot correlate them with tenants/VMs/services. For example, if the physical network detection detects a ToR switch down event, it is impossible to confirm whether the event affects the packet forwarding of a tenant's business, because the traffic of the failed ToR may be silently switched to the backup ToR. In this scenario, if Zoonet detects that the corresponding virtual link has obvious packet loss, Zoonet can more accurately infer that the ToR down event has affected the tenant's packet forwarding.

5.5 VM "last mile" anomalies

There are mainly two kinds of problems in the "last mile" between VM and hypervisor, namely VM exception and virtual-queue (queue between VM and hypervisor) exception. Usually such problems are caused by underlying protocol stack and hidden bugs, and when they occur, they can affect thousands of end hosts. We use ARP Ping to cover the "last mile". In actual deployment, we found that the ARP Ping delay is unstable and is affected by the CPU usage of the VM and the host. Therefore, we mainly rely on the packet loss rate of ARP Ping to detect problems. The graph above shows the ARP Ping latency and packet loss rate during VM out of memory events. It can be seen that the delay of ARP Ping fluctuates around 100~300 microseconds, and the packet loss rate reaches 100% at the exact abnormal point.

5.6 Proof of innocence

The suspected network problem complained by the tenant is actually caused by the tenant's traffic exceeding the purchased bandwidth/session quota, or its own configuration error and its own application problem. We've also had suspected network issues end up being diagnosed as VM issues or ISP network issues. Although these cases are not network problems, they occupy a large part of our exception troubleshooting work orders and consume a lot of R&D manpower. Zoonet actively detects and covers virtual network paths through a large-scale virtual network, which improves the ability of cloud network initiative to prove its innocence. And we have been expanding the detection boundary, expecting to achieve a higher range of self-proving innocence.

6. Experience sharing

Experience 1: There are many corner cases for virtual network topology updates. At the beginning of Zoonet's design, we considered events that may cause virtual topology changes, such as VM shutdown, VM release, VM migration, etc. However, the complexity of cloud services leads to many corner cases that we never thought of. For example, one scenario is the migration of VMs between VPCs, which was initially considered impossible because a VM is always considered bound to its VPC. Another situation that is not considered is that if a virtual machine is detected to have network attacks (or is attacked), its virtual IP will be blocked. The figure above shows the proportion of anomaly detection tasks (i.e. detection noise) due to incomplete corner case coverage. Zoonet has experienced small-scale deployment first, and then large-scale deployment. Probing noise grows rapidly with deployment size as hidden bugs are exposed. After a month of bug-fixing, the detection noise finally converged.

Experience 2: Multiple optimizations of the Zoonet control plane. Virtual network configurations are initially stored in a relational database. Zoonet needs to read multiple relational tables before topology calculation, such as VPC resource table, VPC routing table, VM-server mapping table, VM-pubIP mapping table, etc. However, relational databases have limited read and write speeds and poor scalability. Therefore, we develop a virtual network topology cache with a key-value in-memory database for acceleration. Additionally, we further extend the Zoonet control plane using techniques such as region-based database sharding.

Experience 3: Automated fault diagnosis through the diagnostic tree. Zoonet runs regular network detection and sends an alarm to network operators when it detects network anomalies. Network operators are responsible for analyzing alarms and locating root causes. We were overwhelmed by an "alert storm" early in the deployment, so we started writing automated scripts for troubleshooting. Gradually, the script database grows into a diagnostic tree, with each leaf node containing a root cause. With the diagnosis tree, when an abnormality is found, it will first traverse the tree, and then find the leaf node to notify the network operator. The benefits of this are huge, for example, it reduces the number of people on the VPC team working on network troubleshooting from 11 to 1.5.

7. Summary

Alibaba Cloud Luoshen Cloud Network tries to solve two major problems faced in the actual operation of the cloud network: one is how to prove that a problem is not a network problem, and the other is to quickly discover and locate network problems with a small impact. Zoonet adopts a real-time normalized lightweight detection method, and cooperates with the extensive support of detection protocols by many virtual network nodes to better solve the above problems. The system was developed with the efforts of dozens of R&D students, and the system has been fully verified on a large-scale cloud network system. It is hoped that the sharing of Zoonet can bring important reference value to the industry and academia.

further reading

Original Zoonet paper:

https://dl.acm.org/doi/pdf/10.1145/3555050.3569116

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/128829150