tun virtual network card how to play and not how to play

Tun devices are mostly used to simulate network forwarding devices in user mode, such as tunnel endpoints, routers, NAT gateways, etc. However, as a simulation of forwarding devices, its programming model is very different from the server programming model as a data originating station.

Let's take a look at the simplest tun program model:
Insert picture description here

The tun character device and TCP/UDP socket connection are polled equally as file descriptors, and the data received from one fd is processed and then written to another fd.

But this line of thinking is wrong. Do you think the performance is low, and the tun device has a multi-queue mode, you can open a tun virtual network card as multiple file descriptors, as shown below:
Insert picture description here

But it's still wrong. What went wrong? Next, I based on the above monomer structure gradually optimized.

The so-called monolithic structure is how to make a single stream (a single quintuple) pass the forwarding path simulated by tun in the most efficient way without considering concurrency.

The biggest difference between processing data as a forwarding device and as an end device is:

  • The forwarding device must process data streams in both directions of the same stream in different threads to simulate full duplex.
  • It is best for endpoint devices to process the same connection in the same thread to maximize cache utilization.

Then, we divide the multi-queue of tun into two directions: sending (TX) and receiving (RX). The two file descriptors are processed separately. In this way, TCP/UDP needs to process network reception in a separate thread:
Insert picture description here

However, there is still a problem. TCP receiving and sending are not independent.

Here is a brief talk about the TCP full-duplex problem.

TCP is designed as a two-way full-duplex protocol. Does this mean that in the same TCP connection, the bandwidth can be reached in both directions?

Let's first look at a standard TCP pipe:
Insert picture description here

If you want to reach the bandwidth limit in both directions, you must let the ACK be piggybacked and transmitted by the opposite data stream, as shown in the following figure:
Insert picture description here

This is clearly stated in the protocol specification, and it is a state that can be achieved. But when TCP is actually used, it is a socket. TCP full-duplex means that it actually sends and receives data. However, data is sent to the sending buffer and received from the receiving buffer through the socket. The socket cannot be sent and received at the same time. recv. Unless there is enough data in the buffer and the ACK arrives smoothly, it is difficult to control the two-way pipeline to remain full.

I tested it with the following program:

...
	pthread_create(&threads[0], NULL, receiver, NULL);
	pthread_create(&threads[1], NULL, sender, NULL);

	sleep(10000);
	return 0;
}

void *sender(void *arg)
{
    
    
	while (1) {
    
    
		send(csd, buffer, 1400, 0);
	}
	return 0;
}

void *receiver(void *arg)
{
    
    
	while (1) {
    
    
		recv(csd, rbuffer, 1400, MSG_DONTWAIT);
	}
	return 0;
}

Compared with unidirectional bandwidth, the two-way bandwidth will be halved. The reason why the size of 1400 is used here is to consider that the tun device processes all network packets, and does not enter and exit the socket buffer according to streaming data. Of course, you can use netcat for streaming tests:

# 单向
nc -l -p 1234 >/dev/null
pv /dev/zero |nc -w 1 192.168.56.101 1234 >/dev/null
# 双向
nc -l -p 1234 </dev/zero >/dev/null
pv /dev/zero |nc -w 1 192.168.56.101 1234 >/dev/null

Using UDP socket can solve this problem, but for TCP, if you want to solve this problem, you can establish two independent connections, one is responsible for sending (TX), the other is responsible for receiving (RX), which corresponds to the tun character device :
Insert picture description here

In the single-stream situation, this is the correct approach. If it is a tunnel server, there will be multiple clients access, then the single-threaded buffer capability needs to be improved:
Insert picture description here

This is similar to constructing a buffer queue for all tun RX queues, and all TCP RX sockets are another buffer queue.

The following are the details of a queue selection, including the tun RX queue selection when the kernel is xmit and the tun TX queue selection when the application writes fd:
Insert picture description here

For the same flow, always select the same queue, which requires steering of the multi-queue tun network card to assist in writing the eBPF program of the queue selection. The mechanism is detailed at:
https://lore.kernel.org/patchwork/patch/858162/

Although the above arrangement has become perfect, it is difficult for programmers who solidify the server programming framework to accept the idea of ​​creating two TCP connections to send and receive independently, so the real compromise is:
Insert picture description here

The above-described monomeric structure multithreading can cope with a high concurrent, multiple start threads, each performing a single logic, it can be used to simulate a high concurrent tun forwarding device of high performance.

Okay, the thread relationship is set up. Next, tie the CPU:

  • The transfer process from the RSS queue of the physical network card to the tun queue is tied to the same CPU.
  • The thread processing the tun RX queue is tied to the sibling CPU of the enqueued CPPU.
  • The thread that processes the TCP packet received to the tun TX fd write process is bound to the same CPU.
  • The sibling CPU that processes the softirq received by the tun TX queue in the tun TX write thread.

Of course, if you want to use buffers to smoothly produce and consume, any of the above processes from one file descriptor to another can be interrupted and returned at the buffer, so the relay model similar to Nginx is It's a must.

Next, let's look at another typical example of tun application, NAT64.

tayga is a NAT64 toy, it is as follows:
Insert picture description here

This structure focuses on data forwarding logic, which is very wrong. It not only serializes the two-way data stream forwarding process, but also serializes the parallel processing of multiple CPUs.

NAT64 requires at least the multi-queue feature of the tun network:
Insert picture description here

The following is the correct structure of the NAT64 monolithic structure:
Insert picture description here

Multithreading the above-mentioned NAT64 single structure is a high-performance NAT64 gateway. (Is this really the case? Not necessarily, I will write a separate article later...)

I often say that the thinking of forwarding logic programming and server programming is completely different, in short:

  • Server programming requires high concurrency based on connection processing.
  • Forwarding logic programming needs to handle high concurrency based on the load of the connection.

If you look at the forwarding logic from the thinking of server programming, multiple fd may represent the same load flow. How to deal with the conflict between different fd of the same load flow is very important. A little carelessness will make the original full duplex The flow degenerates into half-duplex, forming distortion.

I also often say that forwarding logic is very suitable for DPDK/XDP, but not suitable for server framework. DPDK directly targets the data stream payload. The programmer knows what he is doing and also knows how to do it. However, the application server deals with abstract file descriptors, and the programmer can only read and write these file descriptors. As for when to read and write, lower-level knowledge rather than higher-level business logic is required.

Ears heard a word out of the cocoon is the same CPU processing the same connection, to ensure that the cache hit rate , but such a deal forward Reuben wrong, for forwarding, a logic flow in different directions just to be different CPU The upper processing can ensure that the full-duplex feature of the data stream is not destroyed. right…

Very regrettable that the vast majority now use open source software tun were not process the data stream in the correct way, it may still be writing code and understand the network gap between the leads, and most people generally believe different, understand protocol stack software and hardware implementation details by no means understand the network, anyone can write on your resume proficient data plane forwarding of now, we do not know how to use the tun device, which is a pity. Of course, what I said didn't have many people's attention, and may not have received much recognition, but this is exactly what is confirmed on the negative side?

Is it necessary to build a multi-threaded multi-queue after seeing the tun network card? Far from it!

To prove this, I'm taking my own case, to analyze my stupidity to explain the code and network barriers between.


The leather shoes in Wenzhou, Zhejiang are wet, so they won’t get fat in the rain.

Guess you like

Origin blog.csdn.net/dog250/article/details/115019766