Linux kernel protocol stack bind system call

table of Contents

1. Bind overview

2. Port information management

2.1 Port information inet_bind_bucket

2.2 Bind port information hash table inet_bind_hashbucket

3. Bind kernel implementation (tcp)

3.1 sys_bind ()

3.2 inet_bind ()

3.3 inet_csk_get_port() (core)

3.3.1 Dynamic port range

3.3.2 Port and socket mapping inet_bind_hash

3.3.3 Port multiplexing


1. Bind overview

The application can bind the socket to the local address through the bind() system call. The address here includes the IP address of L3 and the port of L4 . The application can specify only one of them, and the other is automatically selected by the kernel. For the specific usage of the bind system call, see "Socket Programming: Bind Function Description"

Here we do not pay attention to the binding process of the IP address, because there is nothing to look at, it is to verify the legality of the address, and then save it in the kernel-related data structure; here we focus on the binding process of the L4 port.

2. Port information management

Generally speaking, a port cannot be allocated to two sockets at the same time (the situation of port reuse is not considered here, this is another complicated topic), this may destroy the uniqueness of the five-tuple, resulting in data transmission and reception confusion. Therefore, it is necessary for TCP to maintain those allocated ports, so that during the binding process, it can quickly identify whether the binding can be successful.

2.1 Port information inet_bind_bucket

The kernel defines struct inet_bind_bucket to represent a bound port, and each bound port corresponds to this structure. The structure is defined as follows:

struct inet_bind_bucket {
	struct net		*ib_net;
	//端口号,主机字节序
	unsigned short		port;
	//端口复用相关
	signed short		fastreuse;
	//用于将inet_bind_bucket结构组织成哈希列表
	struct hlist_node	node;
	//端口被分配给了哪个套接字。由于端口可能被多个套接字复用,所以这里使用哈希链表
	//该链表的元素为struct tcp_sock
	struct hlist_head	owners;
};

It can be seen that the definition of the data structure is relatively straightforward, including the port number and the corresponding TCB (Transmission Control Block).

2.2 Bind port information hash table inet_bind_hashbucket

The entire TCP layer uses a hash table to organize the bound port information, that is, the above struct inet_bind_bucket. The hash table is a global structure, and its occupied memory is allocated during the execution of the TCP protocol initialization function tcp_init().

path: net/ipv4/tcp_ipv4.c

struct inet_hashinfo __cacheline_aligned tcp_hashinfo;
//inet_hashinfo是TCP层面的多个哈希表的集合,下面只列出了和端口管理相关的字段
struct inet_hashinfo {
	...
	//指向已绑定端口哈希表,哈希表占用内存在tcp_init()中分配
	struct inet_bind_hashbucket	*bhash;
	//bhash哈希表的桶大小,必要时会扩大哈希表的容量以提升效率
	unsigned int			bhash_size;

	//保护对该结构成员的互斥访问
	rwlock_t			lhash_lock ____cacheline_aligned;
	//对该结构的引用计数
	atomic_t			lhash_users;
	//指向一个用于分配struct inet_bind_bucket的高速缓存,该缓存同样在tcp_init()中创建
	struct kmem_cache			*bind_bucket_cachep;
};

As you can see from the above, the header element in the hash bucket is not struct inet_bind_bucket, but struct inet_bind_hashbucket. The structure is defined as follows. The header element defines a spin lock, which can reduce the granularity of the lock and improve the hash table s efficiency.

struct inet_bind_hashbucket {
	spinlock_t		lock;
	struct hlist_head	chain;
};

Finally, the organization structure of the hash table of assigned port information is shown in the figure below:
Insert picture description here

3. Bind kernel implementation (tcp)

Here we use the tcp protocol type socket to implement the bing process, taking the Linux kernel 2.6.38 as an example, the kernel stack is as follows:

sys_bind
inet_bind
udp_v4_get_port ---udp
inet_csk_get_port --tcp

3.1 sys_bind ()

As described in the interface comment lock, sys_bind performs the following actions:

  1. Obtain the corresponding socket sock according to the fd passed in from the user space
  2. Copy address information of user space (ip_addr, port)
  3. Call the binding interface inet_bind of the AF_INET protocol family
/*
 *	Bind a name to a socket. Nothing much to do here since it's
 *	the protocol's responsibility to handle the local address.
 *
 *	We move the socket address to kernel space before we call
 *	the protocol layer (having also checked the address is ok).
 */

asmlinkage long sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen)
{
	struct socket *sock;
	struct sockaddr_storage address;
	int err, fput_needed;
    
    //1、根据 sock_fd 获取sock
	sock = sockfd_lookup_light(fd, &err, &fput_needed);
	if (sock) {
        //2、拷贝用户用户空间的地址
		err = move_addr_to_kernel(umyaddr, addrlen, (struct sockaddr *)&address);
		if (err >= 0) {
			err = security_socket_bind(sock,
						   (struct sockaddr *)&address,
						   addrlen);

            //3、调用inet协议族的绑定接口inet_bind
			if (!err)
				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
		}
		fput_light(sock->file, fput_needed);
	}
	return err;
}

3.2 inet_bind ()

inet_bind() is an interface provided by the AF_INET protocol family to process the bind() system call. Function: 1. Parameter check; 2. Call the port binding interface corresponding to the socket, as follows:

  1. Determine whether the socket provides a bind interface, currently raw sockets provide a bind interface
  2. Verify whether the IP address type meets AF_INET, broadcast, multicast, any address type, etc.
  3. Verify the legitimacy of the port: whether it is bound repeatedly, whether it has permission to <1024
  4. Bind the address information to the socket according to the socket type, and udp and tcp are split here
int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
	struct sockaddr_in *addr = (struct sockaddr_in *)uaddr;
	struct sock *sk = sock->sk;
	struct inet_sock *inet = inet_sk(sk);
	unsigned short snum;
	int chk_addr_ret;
	int err;

	//如果传输层提供了bind()接口,则直接使用传输层的接口完成绑定;
	//IPv4协议族中只有RAW套接字实现了该接口
	if (sk->sk_prot->bind) {
		err = sk->sk_prot->bind(sk, uaddr, addr_len);
		goto out;
	}
	//校验地址信息结构是否是AF_INET协议族的地址结构
	err = -EINVAL;
	if (addr_len < sizeof(struct sockaddr_in))
		goto out;

	//识别应用程序指定的IP地址类型
	chk_addr_ret = inet_addr_type(&init_net, addr->sin_addr.s_addr);
	//这里涉及较多的新概念,不过其大体意思是判定应用程序是否可以绑定到某些特别的IP地址上面
	/* Not specified by any standard per-se, however it breaks too
	 * many applications when removed.  It is unfortunate since
	 * allowing applications to make a non-local bind solves
	 * several problems with systems using dynamic addressing.
	 * (ie. your servers still start up even if your ISDN link
	 *  is temporarily down)
	 */
	err = -EADDRNOTAVAIL;
	if (!sysctl_ip_nonlocal_bind &&
	    !inet->freebind &&
	    addr->sin_addr.s_addr != htonl(INADDR_ANY) &&
	    chk_addr_ret != RTN_LOCAL &&
	    chk_addr_ret != RTN_MULTICAST &&
	    chk_addr_ret != RTN_BROADCAST)
		goto out;

	//系统调用参数指定要绑定的端口,0表示有内核自动绑定一个端口
	snum = ntohs(addr->sin_port);
	err = -EACCES;
	//如果应用程序指定了想要绑定的端口(不为0),并且指定的端口号小于1024,
	//那么需要判端调用者是否有权限绑定这些保留端口,如果没有绑定则绑定失败
	if (snum && snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE))
		goto out;

	/*      We keep a pair of addresses. rcv_saddr is the one
	 *      used by hash lookups, and saddr is used for transmit.
	 *
	 *      In the BSD API these are the same except where it
	 *      would be illegal to use them (multicast/broadcast) in
	 *      which case the sending device address is used.
	 */
	lock_sock(sk);

	/* Check these errors (active socket, double bind). */
	err = -EINVAL;
	//如果TCB的状态不是CLOSE或者该TCB已经绑定过了(绑定后的源端口信息会被保存
	//到inet->num中,见下文),那么绑定失败,可以看出内核不允许重复调用bind()
	if (sk->sk_state != TCP_CLOSE || inet->num)
		goto out_release_sock;

	//将应用程序指定要绑定的地址保存到TCB中。关于这两个地址的区别,待研究
	inet->rcv_saddr = inet->saddr = addr->sin_addr.s_addr;
	if (chk_addr_ret == RTN_MULTICAST || chk_addr_ret == RTN_BROADCAST)
		inet->saddr = 0;  /* Use device */

	//调用传输层协议提供的接口执行具体的端口绑定:
	//TCP为inet_csk_get_port();UDP为udp_v4_get_port(),
	if (sk->sk_prot->get_port(sk, snum)) {
		//返回非0值,绑定失败,返回地址被使用错误
		inet->saddr = inet->rcv_saddr = 0;
		err = -EADDRINUSE;
		goto out_release_sock;
	}

	//设置地址和端口绑定标记到TCB中
	if (inet->rcv_saddr)
		sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
	if (snum)
		sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
	//已绑定端口的网络字节序表示保存到inet->sport中
	inet->sport = htons(inet->num);
	inet->daddr = 0;
	inet->dport = 0;
	//复位路由信息
	sk_dst_reset(sk);
	err = 0;
out_release_sock:
	release_sock(sk);
out:
	return err;
}

Note: inet_bind() belongs to the binding processing at the AF_INET protocol family level, so the UDP binding will also execute this function.

3.3 inet_csk_get_port() (core)

     The port binding process of the tcp protocol is completed by the function inet_csk_get_port(). Before looking at the implementation of this function, we must first clarify the work to be completed by this function:

  1. Determine whether the port is designated by the user or randomly designated by the system? The user specifies that it is necessary to determine whether the port is bound repeatedly ( same IP address )
  2. If the application does not specify the port to be bound, first allocate an available port and add it to the port hash table ;
  3. After getting the available port, if it is not yet allocated, then you need to create the corresponding port information tb, namely struct inet_bind_bucket, and initialize each field; if it is an already allocated port, then port reuse has occurred, and the existing The struct inet_bind_bucket field can be;
  4. After creating tb, add tb to TCP's icsk_bind_hash, and then add the TCB to the owner list of port information , that is , to establish the mapping relationship between TCB and port information structure.
//snum就是应用调用bind()时指定的端口号
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
	struct inet_hashinfo *hashinfo = sk->sk_prot->hashinfo;
	struct inet_bind_hashbucket *head;
	struct hlist_node *node;
	struct inet_bind_bucket *tb;
	int ret;
	struct net *net = sk->sk_net;

	local_bh_disable();
	if (!snum) {
		//应用没有指明要绑定哪个端口,需要由内核自动选择一个
		int remaining, rover, low, high;
		//获取可用于动态绑定的端口区间[low,high],这是由两个系统参数指定的值
		inet_get_local_port_range(&low, &high);
		
		//下面循环的核心目的就是找一个可用的端口号,而且尽可能的保证寻找过程具有一定的随机性;
		//这样可以保证动态分配的端口号能够尽可能均匀的分布在bhash中
		
		//初始化remaining为端口区间的长度,下面会尝试在[low,high]之间
		//找一个可用端口,所以remaining代表的就是最大循环次数
		remaining = (high - low) + 1;
		//随机选取一个循环起点
		rover = net_random() % remaining + low;

		do {
			//获取端口号对应哈系表表头,哈希算法就是“端口号%哈希表长度”
			head = &hashinfo->bhash[inet_bhashfn(rover, hashinfo->bhash_size)];
			spin_lock(&head->lock);
			//遍历该哈希表,一旦该列表中有相同的端口号(已经被绑定了)则继续轮询下一个
			//端口号(跳到netx标签处),由此可见,动态绑定是永远都不会复用已绑定端口的
			inet_bind_bucket_for_each(tb, node, &head->chain)
				if (tb->ib_net == net && tb->port == rover)
					goto next;
			//到这里,说明rover就是一个可用的空闲端口,结束查找过程
			break;
		next:
			spin_unlock(&head->lock);
			//轮询到达动态端口区间上界,则从下界开始继续轮询
			if (++rover > high)
				rover = low;
		} while (--remaining > 0);

		/* Exhausted local port range during search?  It is not
		 * possible for us to be holding one of the bind hash
		 * locks if this test triggers, because if 'remaining'
		 * drops to zero, we broke out of the do/while loop at
		 * the top level, not from the 'break;' statement.
		 */
		ret = 1;
		//remaining小于等于0,说明上面没有找到空闲端口
		if (remaining <= 0)
			goto fail;

		//将找到的空闲端口号记录到snum中
		/* OK, here is the one we will use.  HEAD is
		 * non-NULL and we hold it's mutex.
		 */
		snum = rover;
	} else {
		//应用程序指明了要绑定的端口号,直接找到对应的哈希列表
		head = &hashinfo->bhash[inet_bhashfn(snum, hashinfo->bhash_size)];
		spin_lock(&head->lock);
		//遍历该哈希列表:
		//1. 如果能够找到该端口号,说明该端口号已经被其它套接字绑定过了(相同ip、port),这时需要跳转到
		//		tb_found标签处继续判断该端口是否允许复用.
		//2. 如果该循环没有找到该端口号,那么说明应用程序指定的端口号还没有被绑定过
		inet_bind_bucket_for_each(tb, node, &head->chain)
			if (tb->ib_net == net && tb->port == snum)
				goto tb_found;
	}
	//到这里有两种情况:
	//1. 动态绑定场景:找到了一个可用的空闲端口号
	//2. 应用程序指定了端口号场景:该端口号尚未被任何套接字绑定过
	//这两种场景都需要跳转到tb_not_found处创建端口信息结构struct inet_bind_bucket,
	//并将其加入到TCP的bhash表中
	tb = NULL;
	goto tb_not_found;
tb_found:
	//到这里说明要绑定的端口已经被其它套接字绑定,这时需要判断端口是否允许被复用。
	//这里之所以还要判断owners链表不为空,是为了让该函数提供端口检查的功能:即判
	//断是否已经当前套接字是否已经绑定了指定端口,如果绑定了,那么直接返回成功,见
	//《TCP之系统调用listen()》中有关该功能的用法

	//下面实际上就是判断端口是否可以复用的逻辑,如果判断可以复用,那么绑定成功,否则绑定失败
	if (!hlist_empty(&tb->owners)) {
		if (sk->sk_reuse > 1)
			goto success;
		if (tb->fastreuse > 0 &&
		    sk->sk_reuse && sk->sk_state != TCP_LISTEN) {
			goto success;
		} else {
			ret = 1;
			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb))
				goto fail_unlock;
		}
	}
tb_not_found:
	ret = 1;
	//根据snum创建一个新的端口信息结构并将该结构加入到bhash中
	if (!tb && (tb = inet_bind_bucket_create(hashinfo->bind_bucket_cachep,
					net, head, snum)) == NULL)
		goto fail_unlock;
	//设置struct inet_bind_bucket中的复用标记
	if (hlist_empty(&tb->owners)) {
		if (sk->sk_reuse && sk->sk_state != TCP_LISTEN)
			tb->fastreuse = 1;
		else
			tb->fastreuse = 0;
	}
	else if (tb->fastreuse &&  (!sk->sk_reuse || sk->sk_state == TCP_LISTEN))
		tb->fastreuse = 0;
success:
	//使得TCB的icsk_bind_hash成员指向端口信息结构,并将该TCB加入到端口信息的owner链表中,
	//即建立TCB和端口信息结构之间的相互关联关系
	if (!inet_csk(sk)->icsk_bind_hash)
		inet_bind_hash(sk, tb, snum);
	BUG_TRAP(inet_csk(sk)->icsk_bind_hash == tb);
	ret = 0;

fail_unlock:
	spin_unlock(&head->lock);
fail:
	local_bh_enable();
	return ret;
}

3.3.1 Dynamic port range

As can be seen from the above, the dynamically designated port is taken from a range, and the dynamically designated port must be a port that has not been bound by any socket. These two parameters can be set through /proc/sys/net/ipv4/ip_local_port_range.

/*
 * This array holds the first and last local port number.
 */
int sysctl_local_port_range[2] = { 32768, 61000 };
DEFINE_SEQLOCK(sysctl_port_range_lock);

void inet_get_local_port_range(int *low, int *high)
{
	unsigned seq;
	do {
		seq = read_seqbegin(&sysctl_port_range_lock);
		//动态可分配的端口区间是由下面的系统参数确认的,代码默认范围为[32768, 61000]
		*low = sysctl_local_port_range[0];
		*high = sysctl_local_port_range[1];
	} while (read_seqretry(&sysctl_port_range_lock, seq));
}

3.3.2 Port and socket mapping inet_bind_hash

  1. Save the port number to sock->snum.
  2. Add sock to the owner list of port information.
  3. Save the port information to the binding information icsk_bind_hash of the sock.
void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
		    const unsigned short snum)
{
	inet_sk(sk)->num = snum;
	//将TCB加入到端口信息接口的owner链表中
	sk_add_bind_node(sk, &tb->owners);
	inet_csk(sk)->icsk_bind_hash = tb;
}

3.3.3 Port multiplexing

....

Guess you like

Origin blog.csdn.net/wangquan1992/article/details/108868646