The Grand View of Network Virtualization Technology

Network virtualization (Network Virtualization) is to build a virtual network that is different from the physical network topology. For example, a company has multiple offices all over the world, but wants the company's internal network to be a whole, it needs network virtualization technology.

Start with NAT

CaptureCapture

Suppose the IP of a machine in the Beijing office is 10.0.0.1 (this is an intranet IP and cannot be used on the Internet), and the IP of a machine in the Shanghai office is 10.0.0.2, and they need to communicate through the Internet. The public network (Internet) IP of the Beijing office is 1.1.1.1, and the public network IP of the Shanghai office is 2.2.2.2.

A simple way is to change the source IP 10.0.0.1 of the outgoing packets to 1.1.1.1 and the destination IP 10.0.0.2 to 2.2.2.2 at the edge router in the Beijing office; The destination IP 1.1.1.1 becomes 10.0.0.1 and the source IP 2.2.2.2 becomes 10.0.0.2. Similar address translation is done at the border router of the Shanghai office. This allows 10.0.0.1 and 10.0.0.2 to communicate, completely unaware of the Internet and the address translation process. This is the basic NAT (Network Address Translation).

Capture

There are serious problems with this approach, however. Suppose a machine is added to the Shanghai office, and the intranet IP is 10.0.0.3. No matter what the Beijing office does, the border router of the Shanghai office receives a packet whose destination IP is 2.2.2.2. Should it be sent to 10.0.0.2 or 10.0.0.3? This kind of bug looks simple, but it is easy for designers to ignore. When designing a network topology or network protocol, you can't just think about how the data packets go out, but also how the reply data packets come in. If simple NAT is used, each time an intranet machine is added, a public IP should be added to the border router.

We know that public IP is very valuable, and NAPT (Network Address and Port Translation) came into being. NAT in Linux is actually NAPT. Outgoing and incoming connections need to be considered separately. For incoming connections, NAPT's basic assumption is that two machines sharing the same public IP will not provide the same service. For example, 10.0.0.2 provides HTTP service and 10.0.0.3 provides HTTPS service, then the border router of Shanghai office can be configured as "The destination IP is 2.2.2.2 and the destination port is 80 (HTTP) to 10.0.0.2, and the destination port is 443 (HTTPS) to 10.0.0.3". This is DNAT (Destination NAT).

For outgoing connections, things are a little more complicated. 10.0.0.2 initiates a connection to 10.0.0.1 with source port 20000 and destination port 80. 10.0.0.3 also initiates a connection to 10.0.0.1 with source port 30000 and destination port 80 as well. When a reply packet from the Beijing office arrives at the border router in the Shanghai office, its source port is 80 and the destination port is 20000. If the border router does not save the connection state, it obviously does not know who the packet should be forwarded to. That is, the border router needs to maintain a table:

Capture

When a reply packet comes, check the source port (80) and destination port (20000), match the first record, and know that it should be sent to 10.0.0.2. Why is the column "New Source Port" needed? If 10.0.0.2 and 10.0.0.3 respectively initiate TCP connections to the same destination IP and the same destination port with the same source port, the reply packets of these two connections will be indistinguishable. In this case, the border router must assign a different source port, and the source port of the actual outgoing packet is "new". Network address translation for outgoing connections is called SNAT (Source NAT).

IP-in-IP Tunneling

NAPT requires that two machines sharing a public IP cannot provide the same service, which is often unacceptable. For example, we often need to SSH or remote desktop to each machine. Tunnel technology came into being. The simplest three-layer tunnel technology is IP-in-IP.

CaptureCapture

As shown in the figure above, the black text on a white background is the original IP packet, and the white text on a blue background is the added header. This header is typically encapsulated at the sender's border router. The added header is first the Layer 2 (Link Layer) header, and then the Layer 3 (Network Layer) header. The entire packet is a legal IP packet that can be routed in the Internet. After receiving the packet, the receiver's border router sees the IP-in-IP flag in the added header (IP protocol number = 0x04, not shown in the figure), and knows that this is an IP-in-IP tunnel If you see that the Public DIP is yourself, you know it's time to decap. After unpacking, the original package (Private SIP, Private DIP) is exposed and routed to the corresponding machine on the intranet.

IP-in-IP Tunneling is not enough.

  1. If you try to use IP-in-IP tunneling to build a local area network with the same network address and subnet mask at both ends of the tunnel, if you do not configure the ARP table on the client, you will find that the client (not the router at both ends of the tunnel) pings between No way. Because before sending the ping packet (ICMP echo request), the system needs to obtain the MAC address of the other party through the ARP protocol, so as to correctly fill in the link layer header. IP-in-IP tunnel can only pass IPv4 packets, not ARP packets. (IPv4 and ARP are different Layer 3 protocols) Therefore, the client must manually configure the ARP table, or let the router answer the answer on its behalf, which increases the difficulty of network configuration.
  2. In a data center, there is often more than one customer. For example, two customers have created virtual networks, and the intranet IP is 10.0.0.1. If they share the same Public IP when sending to the Internet, it is impossible to determine whether an incoming IP-in-IP packet should be sent to the Internet. which client.
  3. If you want to do load balancing, generally hash the packet header quintuple (source IP, destination IP, Layer 4 protocol, source port, destination port), and select the target machine according to the hash value, which can ensure the same connection. The packets are always sent to the same machine. If an ordinary network device doing load balancing receives an IP-in-IP packet, if it does not know the IP-in-IP protocol, it cannot parse the Layer 4 protocol and port number, and can only do hashing according to Public SIP and Public DIP. The public DIP is generally the same, so only the source IP is a variable, and the uniformity of the hash is difficult to guarantee.
    The first question states that the packets being encapsulated are not necessarily IP packets. The second question states that additional identifying information may need to be added. The third question shows that the added header is not necessarily the IP header. This is the reason why network virtualization technology "blooms" instead of one size fits all.

Classification of Network Virtualization Technology

Looking at a network virtualization technology, it mainly depends on the format of the packets in the tunnel.

  • The outermost layer is the encapsulation layer. Since it needs to be transmitted in the network, it must be a legal Layer 2 packet, so the outermost layer must be the MAC. When we say that the encapsulation layer is N, it means that 2 … N layers of encapsulation headers are added.
  • In the middle is the optional shim layer, which contains some additional information and flags, such as Tenant ID to identify the virtual network of different customers, and Entropy to improve hash uniformity.
  • The inner layer is the data packets actually sent by the client, and this layer determines what the virtual network looks like to the client. For example, the inner layer of an IP-in-IP tunnel is an IPv4 packet. From the customer's point of view, the virtual network is an IPv4 network, which can run TCP, UDP, ICMP or any other four-layer protocol. When we say that the virtual network is N-layer, it means that the 2 .. N-1 layers of the packets sent by the client will not be transmitted (these layers may affect the encapsulation layer, that is, which tunnel to enter). The virtual network is not the lower the layer (closer to the physical layer) the better, because the lower-level protocols are more difficult to optimize, as we will see later. Capture
    Capture

According to the format of the packets in the tunnel, the common network virtualization technologies (I count some tunneling technologies in the category of network virtualization technologies) can be briefly classified: (the following PPP and MAC are Layer 2 protocols, IP is a three-layer protocol, TCP and UDP are a four-layer protocol)

CaptureCapture

It can be seen that almost all reasonable combinations of Encap Layer and Payload have corresponding protocols. Therefore, some people's statement that "add a Layer 2 header to GRE, and you can..." is meaningless. If the Encap Layer and Payload change, they are other protocols. The following takes some protocols as examples to illustrate the significance of the existence of different layered protocols, that is, what problems they solve.

GRE vs. IP-in-IP

GRE (Generic Routing Encapsulation) protocol adds a middle layer (shim layer) to IP-in-IP protocol, including 32-bit GRE Key (Tenant ID or Entropy) and serial number and other information. GRE Key solves the second problem of the previous IP-in-IP tunnel, allowing different customers to share the same physical network and a set of physical machines, which is important in data centers.

CaptureCapture

NVGRE vs. GRE

The network virtualized by GRE is an IP network, which means that IPv6 and ARP packets cannot be transmitted in the GRE tunnel. The problem of IPv6 is relatively easy to solve, just modify the Protocol Type in the GRE header. But the problem of ARP is not so simple. The ARP request packet is a broadcast packet: "Who has 192.168.0.1? Tell 00:00:00:00:00:01", which reflects an essential difference between Layer 2 and Layer 3 networks: Layer 2 networks support broadcast domains ( Broadcast Domain). The so-called broadcast domain refers to which hosts a broadcast packet should be received by. VLANs are a common way to implement broadcast domains.

Of course, IP also supports broadcast, but the packets sent to the Layer 3 broadcast address (such as 192.168.0.255) are still sent to the Layer 2 broadcast address (ff:ff:ff:ff:ff:ff), through the Layer 2 broadcast mechanism is realized. If we have to let the ARP protocol work in the GRE tunnel, it's not impossible, but people generally don't do it.

In order to support all existing and possible future Layer 3 protocols, and to support the broadcast domain, the customer's virtual network needs to be a Layer 2 network. NVGRE and VXLAN are two of the most well-known Layer 2 network virtualization protocols.

Compared with GRE, NVGRE (Network Virtualization GRE) has only two essential changes:

  • The inner payload is a Layer 2 Ethernet frame rather than a Layer 3 IP packet. Note that the FCS (Frame Check Sequence) at the end of the inner Ethernet frame is removed, because the encapsulation layer already has a checksum, and calculating the checksum will increase the system load (if it is calculated by the CPU).
  • The GRE key in the middle layer is split into two parts, the first 24 digits are used as Tenant ID, and the last 8 digits are used as Entropy.
    With the NVGRE, why use the GRE? Aside from historical and political reasons, the lower the level of the virtual network, the more difficult it is to optimize.

  • If the virtual network is layer 2, because the MAC addresses are generally very scattered, only one forwarding rule can be inserted for each host, which is a problem when the network scale is large. If the virtual network is layer 3, IP addresses can be assigned according to the network topology, so that the IP addresses of adjacent hosts on the network are also in the same subnet (this is exactly what the Internet does), so that the router only needs to be based on the network of the subnet. Address and subnet mask prefix matching can reduce a large number of forwarding rules.

  • If the virtual network is layer 2, packets such as ARP broadcast will be broadcast to the entire virtual network, so the layer 2 network (the local area network we often say) generally cannot be too large. If the virtual network is layer 3, this problem does not exist because the IP addresses are assigned step by step.
  • If the virtual network is layer 2, the switch relies on a spanning tree protocol to avoid loops. If the virtual network is three-layer, you can make full use of multiple paths between routers to increase bandwidth and redundancy. The network topology of the data center is generally shown in the following figure ( image source ) Data-Center-Design
    Data-Center-Design

If the level of the virtual network is higher, and the payload does not include the network layer, it cannot generally be called a "virtual network", but it still belongs to the category of tunneling technology. SOCKS5 is such a protocol where the payload is TCP or UDP. Its configuration flexibility is higher than IP-based tunneling technology, for example, you can specify port 80 (HTTP protocol) to go through one tunnel, and port 443 (HTTPS protocol) to go through another tunnel. The -L (local forwarding) and -D (dynamic forwarding) parameters of ssh use the SOCKS5 protocol. The disadvantage of SOCKS5 is that it does not support any three-layer protocol, such as the ICMP protocol (SOCKS4 does not even support UDP, so DNS is more troublesome to deal with).

VXLAN vs. NVGRE

Although NVGRE has an 8-bit Entropy field, if the network device for load balancing does not know the NVGRE protocol, it still hashes according to the five-tuple of "source IP, destination IP, Layer 4 protocol, source port, and destination port". This entropy still doesn't come in handy.

The solution of VXLAN (Virtual Extensible LAN) is: in addition to the MAC and IP layers, the encapsulation layer adds a UDP layer, using the UDP source port number as entropy, and the UDP destination port number as the VXLAN protocol identifier. In this way, the load balancing device does not need to know the VXLAN protocol, and only needs to hash the packet according to the normal UDP quintuple.

white_paper_c11-685115-1white_paper_c11-685115-1

Above: The packet format after VXLAN encapsulation ( image source )

The VXLAN mezzanine is slightly simpler than the GRE mezzanine, and still uses 24 bits as the Tenant ID, without the Entropy bit. The network device or OS virtualization layer that adds the packet header generally copies the source port of the inner payload as the UDP source port of the encapsulation layer. Since the operating system that initiates the connection selects the source port number, it is generally sequential or random, and the hash algorithm inside the network device is generally XOR, so the obtained hash uniformity is generally better.

STT vs. VXLAN

STT (Stateless Transport Tunneling) is a network virtualization protocol newly proposed in 2012 and still in the draft state. Compared with VXLAN, STT just replaces UDP with TCP at first glance. In fact, if the data packets packaged by STT and VXLAN are captured on the network, it is very different.

Why does STT use TCP? In fact, STT just borrows a TCP shell, and does not use TCP's state machine at all, not to mention confirmation, retransmission and congestion control mechanisms. What STT wants to borrow is the LSO (Large Send Offload) and LRO (Large Receive Offload) mechanisms of modern network cards. LSO enables the sender to generate TCP packets up to 64KB (or even longer), and the network card hardware disassembles the TCP payload part of the large packet, copies the MAC, IP, and TCP headers to form small packets that can be sent at Layer 2 (such as 1518 bytes for Ethernet, or 9K bytes with Jumbo Frame turned on). LRO enables the receiver to combine several small packets of the same TCP connection into a large packet, and then generate a network card interrupt and send it to the operating system.

As shown in the figure below, before the payload is sent, STT Frame Header (intermediate layer), MAC header, IP header, and TCP-like header (encapsulation layer) are added. The LSO mechanism of the network card will split the TCP packet into small pieces, add the copy encapsulation layer to the front of each small piece, and then send it out.

CaptureCapture

We know that user-kernel mode switching and network card interrupts consume a lot of CPU time. The performance of network programs (such as firewalls and intrusion detection systems) is often calculated in pps (packet per second) rather than bps (bit per second). Then when a large amount of data is transmitted, if the data packets can be larger, the load on the system can be reduced.

The biggest problem with STT is that it is not easy to enforce policies for a certain customer (tenant ID) on network equipment. As shown in the figure above, for a large package, only the first small package has STT Frame Header, and there is no information that can identify the customer in the subsequent small packages. If I want to limit a customer's traffic from the Hong Kong data center to the Chicago data center to no more than 1Gbps, it is not achievable. If other network virtualization protocols are used, since each packet contains information identifying the client, this strategy can only be configured on the border router (of course, the border router needs to know this protocol).

Epilogue

This article takes IP-in-IP, GRE, NVGRE, VXLAN, STT and other protocols as examples to introduce the network virtualization technologies that are flourishing. When learning a network virtualization technology, you must first figure out its encapsulation level, the level of the virtual network and the information contained in the middle layer , and compare it with other similar protocols, and then look at details such as flag bits, QoS, and encryption. When selecting a network virtualization technology, comprehensive consideration should also be given to the degree of support of the operating system and network equipment, as well as the scale, topology, and traffic characteristics of the customer network.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324770631&siteId=291194637
Recommended