can not miss! 5 pictures to help you understand the principle of container network

Working with containers has always felt like magic. Containers are great for those who understand the underlying principles, but a nightmare for those who don't. Fortunately, we have been studying container technology for a long time, and even successfully revealed that containers are just isolated and restricted Linux processes. Running containers does not require images, and on the other hand, building images requires running some containers.

Now is the time to tackle container networking. Or more precisely, single-host container networking problems. This article will answer these questions:

  • How to virtualize network resources so that containers think they have an exclusive network?

  • How to let the containers coexist peacefully without interfering with each other and communicate with each other?

  • How can I access the outside world (e.g., the Internet) from inside the container?

  • How can the container on a machine be accessed from the outside world (eg, port publishing)?

The end result is clear, single-host container networking is a simple combination of known Linux features:

  • Network Namespace (namespace)

  • Virtual Ethernet device (veth)

  • Virtual network switch (bridge)

  • IP Routing and Network Address Translation (NAT)

  • And it doesn't take any code to make such web magic happen...

prerequisite

Any Linux distribution will do. The examples in this article are all executed on the virtual machine of vagrant CentOS 8:

$ vagrant init centos/8 $ vagrant up $ vagrant ssh 
[vagrant@localhost ~]$ uname -a Linux localhost.localdomain 4.18.0-147.3.1.el8_1.x86_64

For simplicity, this article uses a containerization solution (eg, Docker or Podman). We focus on basic concepts and use the simplest tools to achieve our learning goals.

network namespace isolates containers

What are the parts of the Linux networking stack? Obviously, a series of network devices. anything else? May also include a series of routing rules. And don't forget, netfilter hooks, including those defined by iptables rules.

We can quickly create an uncomplicated script inspect-net-stack.sh:

#!/usr/bin/env bash echo  "> Network devices" ip link 
echo -e "\n> Route table" ip route 
echo -e "\n> Iptables rules" iptables --list-rules

Before running the script, let's modify the iptable rule:

$ sudo iptables -N ROOT_NS

After that, execute the above script on the machine, the output is as follows:

$ sudo ./inspect-net-stack.sh     > Network devices     1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00     2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff     > Route table     default via 10.0.2.2 dev eth0 proto dhcp metric 100     10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100     > Iptables rules     -P INPUT ACCEPT     -P FORWARD ACCEPT     -P OUTPUT ACCEPT     -N ROOT_NS

We are interested in these outputs because we want to ensure that each container we are about to create has its own separate network stack.

As you probably already know, one Linux namespace used for container isolation is the network namespace. From man ip-netns, "A network namespace is a logical second copy of the network stack, with its own routes, firewall rules, and network devices." For simplicity, this is the only namespace used in this article. Instead of creating completely isolated containers, we limit the scope to the network stack.

One way to create network namespaces is the ip tool, which is part of iproute2:​​​​​​​​

$ sudo ip netns add netns0 $ ip netns netns0如何使用刚才创建的命名空间呢?一个很好用的命令 nsenter。进入一个或多个特定的命名空间,然后执行指定的脚本:
$ sudo nsenter --net=/var/run/netns/netns0 bash     # 新建的 bash 进程在 netns0 里 $ sudo ./inspect-net-stack.sh     > Network devices 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00     > Route table     > Iptables rules     -P INPUT ACCEPT     -P FORWARD ACCEPT     -P OUTPUT ACCEPT

From the output above, it is clear that the bash process is running in the netns0 namespace, which is a completely different network stack. There are no routing rules, no custom iptables chain, only a loopback network device.

 

60af30c6a36a22bcd832f19dea8243b6.png

Use a virtual Ethernet device (veth) to connect the container to the host

If we can't communicate with a proprietary network stack, it doesn't seem useful. Fortunately, Linux provides a useful tool - a virtual Ethernet device. As you can see from man veth, "veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces, creating bridges to physical network devices in another namespace, but can also act as standalone used by network devices."

Virtual Ethernet devices usually come in pairs. Don't worry, take a look at the created script first:

$ sudo ip link add veth0 type veth peer name ceth0

With this simple command, we can create a pair of interconnected virtual Ethernet devices. The names veth0 and ceth0 are selected by default. ​​​​​​​​

$ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000  link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff 5: ceth0@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000   link/ether 66:2d:24:e3:49:3f brd ff:ff:ff:ff:ff:ff 6: veth0@ceth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000  link/ether 96:e8:de:1d:22:e0 brd ff:ff:ff:ff:ff:ff

Both veth0 and ceth0 are created on the host's network stack (also known as the root network namespace). To connect the netns0 namespace to the root namespace, you need to leave one device in the root namespace and move the other to netns0:​​​​​​​​

$ sudo ip link set ceth0 netns netns0     # 列出所有设备,可以看到 ceth0 已经从 root 栈里消失了    $ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000    link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff    6: veth0@if5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000     link/ether 96:e8:de:1d:22:e0 brd ff:ff:ff:ff:ff:ff link-netns netns0

Once the devices are enabled and assigned an appropriate IP address, packets generated on one device will immediately appear on its partner, bridging the two namespaces. Starting from the root namespace:​​​​​​​​

$ sudo ip link set veth0 up $ sudo ip addr add 172.18.0.11/16 dev veth0

Then netns0:​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ip link set lo up $ ip link set ceth0 up $ ip addr add 172.18.0.10/16 dev ceth0 $ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 5: ceth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000  link/ether 66:2d:24:e3:49:3f brd ff:ff:ff:ff:ff:ff link-netnsid 0

 

b2fec0a16e5717b83cade5cff308dbfa.png

Check connectivity:​​​​​​

# 在 netns0 里 ping root 的 veth0  $ ping -c 2 172.18.0.11  PING 172.18.0.11 (172.18.0.11) 56(84) bytes of data.  64 bytes from 172.18.0.11: icmp_seq=1 ttl=64 time=0.038 ms  64 bytes from 172.18.0.11: icmp_seq=2 ttl=64 time=0.040 ms  --- 172.18.0.11 ping statistics ---  2 packets transmitted, 2 received, 0% packet loss, time 58ms  rtt min/avg/max/mdev = 0.038/0.039/0.040/0.001 ms  # 离开 netns0 $ exit   # 在root命名空间里ping ceth0  $ ping -c 2 172.18.0.10  PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data.  64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.073 ms  64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.046 ms  --- 172.18.0.10 ping statistics ---  2 packets transmitted, 2 received, 0% packet loss, time 3ms  rtt min/avg/max/mdev = 0.046/0.059/0.073/0.015 ms

At the same time, if you try to access other addresses from the netns0 namespace, it will not succeed:​​​​​​​​

# 在 root 命名空间    $ ip addr show dev eth0    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000     link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff     inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0   valid_lft 84057sec preferred_lft 84057sec    inet6 fe80::5054:ff:fee3:2777/64 scope link      valid_lft forever preferred_lft forever     # 记住这里 IP 是 10.0.2.15    $ sudo nsenter --net=/var/run/netns/netns0    # 尝试ping主机的eth0    $ ping 10.0.2.15    connect: Network is unreachable    # 尝试连接外网   $ ping 8.8.8.8    connect: Network is unreachable

This is also easy to understand. There is no route for such packets in the netns0 routing table. The only entry is how to reach the 172.18.0.0/16 network:​​​​​​​​

# 在netns0命名空间:     $ ip route     172.18.0.0/16 dev ceth0 proto kernel scope link src 172.18.0.10

Linux has several ways to build routing tables. One of them is to extract routes directly from network interfaces. Remember, the routing table in netns0 is empty after the namespace is created. But then we added the ceth0 device and assigned the IP address 172.18.0.0/16. Because we're not using a simple IP address, but a combination of an address and a subnet mask from which the network stack can extract routing information. Every network packet destined for 172.18.0.0/16 goes through the ceth0 device. But other packets are dropped. Similarly, the root namespace also has a new route:​​​​​​​​

# 在root命名空间:     $ ip route     # ... 忽略无关行 ...     172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11

Here, the first question can be answered. We saw how to isolate, virtualize and connect the Linux network stack.

Use a virtual network switch (bridge) to connect containers

The driving force behind containerization thinking is efficient resource sharing. So, it is not common to run only one container on a machine. Instead, the end goal is to run as many isolated processes as possible on a shared environment. So what happens if you put multiple containers on the same host as per the veth scheme above? Let's try adding a second container. ​​​​​​​​

# 从 root 命名空间     $ sudo ip netns add netns1     $ sudo ip link add veth1 type veth peer name ceth1     $ sudo ip link set ceth1 netns netns1     $ sudo ip link set veth1 up     $ sudo ip addr add 172.18.0.21/16 dev veth1     $ sudo nsenter --net=/var/run/netns/netns1     $ ip link set lo up     $ ip link set ceth1 up     $ ip addr add 172.18.0.20/16 dev ceth1

faint! Something is wrong... netns1 has a problem. It cannot connect to root, and it cannot be accessed from the root namespace. However, because both containers are in the same IP network segment 172.18.0.0/16, the host's veth1 can be accessed from the netns0 container.

It took some time to find the cause here, but it was clearly a routing issue. First check the routing table of the root namespace:​​​​​​​​

$ ip route     # ... 忽略无关行... #     172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11     172.18.0.0/16 dev veth1 proto kernel scope link src 172.18.0.21
After adding the second veth pair, root's networking stack knows about the new route 172.18.0.0/16 dev veth1 proto kernel scope link src 172.18.0.21, but there was already a route for that network before. When the second container tries to ping veth1, the first routing rule is selected, which makes the network unreachable. If we delete the first route sudo ip route delete 172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11 and recheck connectivity, there should be no problem. netns1 can connect, but netns0 cannot.

5341a628f2d2e89213d2c78bfbcb79c7.png

If we choose another network segment for netns1, it should be able to connect. However, multiple containers on the same IP network segment should be a reasonable usage scenario. Therefore, we need to adjust the veth scheme.

And don't forget about Linux bridges -- another virtualized networking technology! A Linux bridge acts like a network switch. It forwards network packets between the interfaces connected to it. And because it is a switch, it completes these forwarding at the L2 layer.

Try this tool. But first, you need to clear the existing settings, because some previous configurations are no longer needed. Delete network namespace:​​​​​​

$ sudo ip netns delete netns0 $ sudo ip netns delete netns1 $ sudo ip link delete veth0 $ sudo ip link delete ceth0 $ sudo ip link delete veth1 $ sudo ip link delete ceth1

Quickly rebuild both containers. Note that we did not assign any IP addresses to the new veth0 and veth1 devices:​​​​​​​​

$ sudo ip netns add netns0 $ sudo ip link add veth0 type veth peer name ceth0 $ sudo ip link set veth0 up $ sudo ip link set ceth0 netns netns0 
$ sudo nsenter --net=/var/run/netns/netns0 $ ip link set lo up $ ip link set ceth0 up $ ip addr add 172.18.0.10/16 dev ceth0 $ exit 
$ sudo ip netns add netns1 $ sudo ip link add veth1 type veth peer name ceth1 $ sudo ip link set veth1 up $ sudo ip link set ceth1 netns netns1 
$ sudo nsenter --net=/var/run/netns/netns1 $ ip link set lo up $ ip link set ceth1 up $ ip addr add 172.18.0.20/16 dev ceth1 $ exit

Make sure there are no new routes on the host:​​​​​​​​

$ ip route default via 10.0.2.2 dev eth0 proto dhcp metric 100 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100

Finally create the bridge interface:​​​​​​​​

$ sudo ip link add br0 type bridge $ sudo ip link set br0 up

Connect veth0 and veth1 to the bridge:​​​​​​​​

$ sudo ip link set veth0 master br0 $ sudo ip link set veth1 master br0

52673b27c5c386c49c56379137fdf204.png

Check connectivity between containers:​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ping -c 2 172.18.0.20 PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 64 bytes from 172.18.0.20: icmp_seq=1 ttl=64 time=0.259 ms 64 bytes from 172.18.0.20: icmp_seq=2 ttl=64 time=0.051 ms --- 172.18.0.20 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 2ms rtt min/avg/max/mdev = 0.051/0.155/0.259/0.104 ms$ sudo nsenter --net=/var/run/netns/netns1 $ ping -c 2 172.18.0.10 PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.037 ms 64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.089 ms --- 172.18.0.10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 36ms rtt min/avg/max/mdev = 0.037/0.063/0.089/0.026 ms

Very good! Works great. With this new scheme, we don't need to configure veth0 and veth1 at all. Only two IP addresses need to be assigned at the ceth0 and ceth1 endpoints. But because they are all connected to the same Ethernet (remember, they are connected to a virtual switch), they are connected at the L2 layer:​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ip neigh 172.18.0.20 dev ceth0 lladdr 6e:9c:ae:02:60:de STALE $ exit 
$ sudo nsenter --net=/var/run/netns/netns1 $ ip neigh 172.18.0.10 dev ceth1 lladdr 66:f3:8c:75:09:29 STALE $ exit

Great, we learned how to turn containers into neighbors, so that they don't interfere with each other, but they can still communicate.

Connect to the outside world (IP routing and address masquerading (masquerading))

Containers can communicate with each other. But can they communicate with the host, such as the root namespace? ​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ping 10.0.2.15 # eth0 address connect: Network is unreachable
Here it is obvious that netns0 has no route:​​​​​​​​
$ ip route 172.18.0.0/16 dev ceth0 proto kernel scope link src 172.18.0.10

The root namespace cannot communicate with the container:​​​​​​​​

# 首先使用 exit 离开netns0: $ ping -c 2 172.18.0.10 PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. From 213.51.1.123 icmp_seq=1 Destination Net Unreachable From 213.51.1.123 icmp_seq=2 Destination Net Unreachable --- 172.18.0.10 ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms 
$ ping -c 2 172.18.0.20 PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. From 213.51.1.123 icmp_seq=1 Destination Net Unreachable From 213.51.1.123 icmp_seq=2 Destination Net Unreachable --- 172.18.0.20 ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms

To establish connectivity between root and the container namespace, we need to assign an IP address to the bridge network interface:

$ sudo ip addr add 172.18.0.1/16 dev br0

Once an IP address is assigned to the bridge network interface, there will be an additional route in the host's routing table:​​​​​​​​

$ ip route # ...忽略无关行 ... 172.18.0.0/16 dev br0 proto kernel scope link src 172.18.0.1 
$ ping -c 2 172.18.0.10 PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.036 ms 64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.049 ms 
--- 172.18.0.10 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 11ms rtt min/avg/max/mdev = 0.036/0.042/0.049/0.009 ms 
$ ping -c 2 172.18.0.20 PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 64 bytes from 172.18.0.20: icmp_seq=1 ttl=64 time=0.059 ms 64 bytes from 172.18.0.20: icmp_seq=2 ttl=64 time=0.056 ms 
--- 172.18.0.20 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 4ms rtt min/avg/max/mdev = 0.056/0.057/0.059/0.007 ms

The containers may also be able to ping the bridge interface, but they still cannot connect to the host's eth0. A default route needs to be added to the container:​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ip route add default via 172.18.0.1 $ ping -c 2 10.0.2.15 PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. 64 bytes from 10.0.2.15: icmp_seq=1 ttl=64 time=0.036 ms 64 bytes from 10.0.2.15: icmp_seq=2 ttl=64 time=0.053 ms --- 10.0.2.15 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 14ms rtt min/avg/max/mdev = 0.036/0.044/0.053/0.010 ms     # 为`netns1`也做上述配置

This change basically turns the host into a router, and the bridge interface into the default gateway between containers.

 

1364155375271b2608474e20266ded94.png

Great, we attach the container to the root namespace. Now, go ahead and try to connect them to the outside world. On Linux, network packet forwarding (for example, routing function) is disabled by default. We need to enable this feature first:​​​​​​​​

# 在 root 命名空间 sudo bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'

Check connectivity again:​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ping 8.8.8.8 # hung住了...

Still not working. What went wrong? If the container can send packets to the outside, the target server cannot send the packet back to the container, because the IP address of the container is private, and only the local network knows the routing rules for that specific IP. And there are many containers sharing the exact same private IP address 172.18.0.10. The solution to this problem is called Network Address Translation (NAT). Before reaching the external network, the packet sent by the container will have the source IP address replaced with the external network address of the host machine. The host also keeps track of any existing mappings, and will restore the previously replaced IP address before forwarding the packet back to the container. It sounds complicated, but there is good news! The iptables module lets us do all this with just one command:

$ sudo iptables -t nat -A POSTROUTING -s 172.18.0.0/16 ! -o br0 -j MASQUERADE

The commands are very simple. A new route for the POSTROUTING chain is added to the nat table, which will replace masquerading all packets originating from the 172.18.0.0/16 network, but not passing through the bridge interface.

Check connectivity:​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ ping -c 2 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=61 time=43.2 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=61 time=36.8 ms --- 8.8.8.8 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 2ms rtt min/avg/max/mdev = 36.815/40.008/43.202/3.199 ms

Be aware that the default policy we use here - to allow all traffic, which is very dangerous in a real environment. The host's default iptables policy is ACCEPT:​​​​​​​​

sudo iptables -S -P INPUT ACCEPT -P FORWARD ACCEPT -P OUTPUT ACCEPTDocker 默认限制所有流量,随后仅仅为已知的路径启用路由。
如下是在 CentOS 8 机器上,单个容器暴露了端口 5005 时,由 Docker daemon 生成的规则:
$ sudo iptables -t filter --list-rules -P INPUT ACCEPT -P FORWARD DROP -P OUTPUT ACCEPT -N DOCKER -N DOCKER-ISOLATION-STAGE-1 -N DOCKER-ISOLATION-STAGE-2 -N DOCKER-USER -A FORWARD -j DOCKER-USER -A FORWARD -j DOCKER-ISOLATION-STAGE-1 -A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT-A FORWARD -o docker0 -j DOCKER -A FORWARD -i docker0 ! -o docker0 -j ACCEPT -A FORWARD -i docker0 -o docker0 -j ACCEPT -A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT -A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2 -A DOCKER-ISOLATION-STAGE-1 -j RETURN -A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP -A DOCKER-ISOLATION-STAGE-2 -j RETURN -A DOCKER-USER -j RETURN 
$ sudo iptables -t nat --list-rules -P PREROUTING ACCEPT -P INPUT ACCEPT -P POSTROUTING ACCEPT -P OUTPUT ACCEPT -N DOCKER -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE -A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER -A DOCKER -i docker0 -j RETURN -A DOCKER ! -i docker0 -p tcp -m tcp --dport 5005 -j DNAT --to-destination 172.17.0.2:5000 
$ sudo iptables -t mangle --list-rules -P PREROUTING ACCEPT -P INPUT ACCEPT -P FORWARD ACCEPT -P OUTPUT ACCEPT -P POSTROUTING ACCEPT 
$ sudo iptables -t raw --list-rules -P PREROUTING ACCEPT -P OUTPUT ACCEPT

Make the container accessible to the outside world (port publishing)

We all know interfaces that can publish container ports to some (or all) hosts. But what exactly does port publishing mean?

Assuming a server is running inside the container:​​​​​​​​

$ sudo nsenter --net=/var/run/netns/netns0 $ python3 -m http.server --bind 172.18.0.10 5000

If we try to send an HTTP request from the host to this server, everything works fine (there is a link between the root namespace and all container interfaces, and of course the connection succeeds):​​​​​​​​

# 从 root 命名空间 $ curl 172.18.0.10:5000 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"  "http://www.w3.org/TR/html4/strict.dtd"> # ... 忽略无关行 ...

But, if you want to access this server from the outside, which IP should you use? The only IP we know is the host's external interface address eth0:​​​​​​​​

$ curl 10.0.2.15:5000 curl: (7) Failed to connect to 10.0.2.15 port 5000: Connection refused

Therefore, we need to find a way to be able to forward all packets arriving on the host's eth0 port 5000 to the destination 172.18.0.10:5000. It is iptables to help again! ​​​​​​​​

# 外部流量      sudo iptables -t nat -A PREROUTING -d 10.0.2.15 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.18.0.10:5000     # 本地流量 (因为它没有通过 PREROUTING chain)     sudo iptables -t nat -A OUTPUT -d 10.0.2.15 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.18.0.10:5000

Additionally, iptables needs to be able to intercept traffic on the bridged network:

sudo modprobe br_netfilter

Test:​​​​​​​​

curl 10.0.2.15:5000 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"  "http://www.w3.org/TR/html4/strict.dtd">      # ... 忽略无关行 ...

Understanding Docker network drivers

How can we use this knowledge? For example, try to understand the Docker network mode [1].

Start with  --network host the pattern. Try comparing the output of the command  ip link and  sudo docker run -it --rm --network host alpine ip link . They are almost the same! In host mode, Docker simply does not use network namespace isolation, and the container works in the root network namespace and shares the network stack with the host.

The next mode is --networknone. The output of sudo docker run -it --rm --network host alpine ip link has only one loopback network interface. This is very similar to the network namespace created earlier, without adding the veth device.

Finally there is the --network bridge (default) mode. This is exactly the pattern we tried to create earlier. You can try the ip and iptables commands to observe the network stack from the perspective of the host and the container respectively.

rootless containers and networking

A nice feature of the Podman container manager is its focus on rootless containers. However, you may have noticed that this article uses a lot of sudo commands. Note that the network cannot be configured without root privileges. Podman's scheme [2] on the root network is very similar to Docker. But on rootless containers, Podman uses the slirp4netns[3] project:

Starting with Linux 3.8, unprivileged users can create network_namespaces(7) at the same time as user_namespaces(7). However, unprivileged network namespaces are not very useful, because creating a veth(4) between the host and the network namespace still requires root privileges

slirp4netns can connect a network namespace to the Internet in a completely unprivileged manner, through a TAP device in the network namespace to connect to the user-mode TCP/IP stack (slirp).

Rootless networking is very limited: "Technically speaking, the container itself does not have an IP address, because without root privileges, network device association cannot be achieved. Also, pinging from a rootless container will not work because it lacks the CAP_NET_RAW security capability, And that's required for the ping command." But it's still better than no connection at all.

in conclusion

The scheme for organizing a network of containers presented in this article is just one (and probably the most widely used) of the possible schemes. There are many other ways, implemented by official or third-party plug-ins, but all these solutions rely heavily on Linux network virtualization technology [4]. Therefore, containerization can be considered as a virtualization technology.

 

Guess you like

Origin blog.csdn.net/qq_27817851/article/details/128228124