NIO Bible: Penetrate the underlying principles of NIO, Selector, and Epoll at once

This pdf e-book is the work of the Nion architecture team that continues to upgrade and iterate. The goal is to build an ultra-low-level, ultra-powerful high-performance technology for everyone through continuous upgrades and iterations.

Original: "Nine Yang Manual: Completely Understand the Core Principles of Operating System Select and Epoll"

Changed: Named "NIO Bible: Penetrating the underlying principles of NIO, Selector, and Epoll at once"

Iteration 1: 2021.4

Iteration 2: 2022.4

Iteration 3: 2023.9

Said before:

It’s super hard to get an offer now , and you can’t even get a call for an interview. In Nien's technical community (50+), many friends have obtained offers from major manufacturers such as Didi/Toutiao/JD by virtue of their unique skills of "proficient in NIO + proficient in Netty" and have realized their dream of becoming a major manufacturer.

During the learning process, everyone found that NIO is difficult and Netty is difficult.

Here, Nien gives a penetrating introduction to the core principles of NIO from the perspective of an architect.

Wrote a pdf e-book and will continue to upgrade it:

(1) "NIO Bible: Penetrate the underlying principles of NIO, Selector, and Epoll at once" PDF

Of course, many other Bible PDFs by Nien are also very important and valuable.

(2) "K8S Study Bible" PDF

(3) "Docker Study Bible" PDF

Let everyone penetrate Docker + K8S and realize the freedom of Docker + K8S so that everyone will not get lost.

The V3 version of the PDF book "NIO Bible: A Penetration of the Underlying Principles of NIO, Selector, and Epoll" will continue to be iterated and upgraded in the future. It is for reference by subsequent friends to improve everyone’s 3-level architecture, design, and development levels.

Help people enter big factories, build structures, and get high salaries.

Article directory

V1 version background

Purpose of this Bible:

Cooperate with the "High Concurrency Trilogy" to open up the two channels of Ren and Du for everyone

To learn high concurrency, epoll is the foundation

The importance of epoll

epoll is an essential technology for high-performance network servers under Linux.

Java NIO、nginx、redis、skynet和大部分游戏服务器都使用到这一多路复用技术。

epoll的重要性,大厂面试必备

不少大厂在招聘服务端同学时,可能会问及epoll相关的问题。

比如epoll和select的区别是什么?

epoll高效率的原因是什么?

高性能的核心杀手锏:异步化

我们说,高性能的核心杀手锏:异步化,或者异步架构

nio (非阻塞io、noblocking io)是一种高性能的 io 架构方案, 是和bio(阻塞io) 相对而言的。

所以,从架构的维度来说,nio 就是一种异步化 的架构方案

问题的关键是,nio底层用的,又是一种同步io 模型。

这里,出现了围绕 nio 的一个大的知识冲突:

  • nio 是异步的吗? 是的
  • nio 是同步 io 吗? 是的

从文字上看, 这不彻底的自相矛盾, 狗屁不通, 天理不容吗?

是的,这就是围绕 nio 的 一个大的知识冲突,一个大的概念冲突。

这也是尼恩社群中, 一个困扰过上百个读者、甚至上千个读者的 技术困惑。

如何理清楚这个 技术谜题呢?

我们在架构学上,有一个很重要的策略: 解耦。

在这里,尼恩把nio 所在的处理链路、或者说 用户的api调用链路,也进行解耦。

如何对 用户的api调用链路解耦?

可以简单的, 把用户的api调用链路,解耦为三层, 如下图所示:

  • 应用层:编程模型的异步化
  • 框架层:IO线程的异步化
  • OS层: IO模型的异步化

解耦之后,再庖丁解牛,一层一层的进行异步化架构。 引入一个牛逼轰轰的概念: 全链路异步。

全链路异步的三个层面

全链路异步化的最终目标,每一个组件,实现三个层面的 异步化,

一:应用层的异步化 :编程模型

随着 云原生时代的到来, 底层的 组件编程 越来越 响应式、流处理化。

于是应用层的开发,就引入了响应式 编程。

从命令式 编程转换到 响应式 编程,在非常多的场景 是大势所趋,比如在 io密集型场景。

需要注意的是:响应式编程, 学习曲线很大, 大家需要多看,多实操。

这个请大家去看 尼恩的 《响应式 圣经 PDF》电子书

二:框架层的异步化:IO线程模型异步架构

什么是 IO线程模型异步架构?

从一个io线程一次只能处理一个请求,到一个io线程一次能处理大量请求,就是 IO线程模型异步架构

来看看我们的经典组件,是同步还是异步的:

  • tomcat 同步的线程模型, 每一个线程在处理一个请求的时候,这个请求的处理过程中, 是阻塞的
  • HTTPClient 客户端组件 , 线程模型是同步的, 每一个线程在处理一个请求的时候,这个请求的处理过程中, 是阻塞的
  • AsyncHttpClient IO线程模型异步架构,一个io线程一次能处理大量请求
  • netty 底层IO框架, 核心就是线程模型异步架构,一个io线程一次能处理大量请求

IO线程模型异步架构 ,非常经典的模式,就是 io Reactor 线程模型

IO Reactor模式

了解了BIO和NIO的一些使用方式,Reactor模式就呼之欲出了。

NIO是基于事件机制的,有一个叫做Selector的选择器,阻塞获取关注的事件列表。获取到事件列表后,可以通过分发器,进行真正的数据操作。

上图是Doug Lea在讲解NIO时候的一张图,指明了最简单的Reactor模型的基本元素。

你可以对比这上面的NIO代码分析一下,里面有四个主要元素:

  • Acceptor 处理client的连接,并绑定具体的事件处理器
  • Event 具体发生的事件
  • Handler 执行具体事件的处理者。比如处理读写事件
  • Reactor 将具体的事件分配给Handler

我们可以对上面的模型进行近一步细化,下面这张图同样是Doug Lea的ppt中的。

它把Reactor部分分为mainReactor和subReactor两部分。mainReactor负责监听处理新的连接,然后将后续的事件处理交给subReactor,subReactor对事件处理的方式,也由阻塞模式变成了多线程处理,引入了任务队列的模式。

这两个线程模型,非常重要。

一定要背到滚瓜烂熟。

这里不做展开,请大家去看尼恩的畅销书《Java 高并发核心编程卷 1 加强版》。

这个是面试的绝对重点

IO的王者组件,Netty框架,整体就是一个 Reactor 线程模型 实现

也是非常核心的知识,这里不做展开,请大家去看尼恩的畅销书《Java 高并发核心编程卷 1 加强版》。

首先来看线程模型的异步化。

三:OS层的异步化:IO模型

目前的一个最大难题,是底层操作系统层的 IO模型的异步化。

注意这里一个大的问题:

Netty 底层的IO模型,咱们一般用的是select或者 epoll,是同步IO,不是异步IO.

有关5大IO模型,是本文的基础知识,也是非常核心的知识,稍后详细介绍

为啥需要IO模型异步化

这里有一个很大的性能损耗点,同步IO中,线程的切换、 IO事件的轮询、IO操作, 都是需要进行 系统调用完成的。

系统调用的性能耗费在哪里?

首先,线程是很”贵”的资源,主要表现在:

  1. 线程的创建和销毁成本很高,线程的创建和销毁都需要通过重量级的系统调用去完成。
  2. 线程本身占用较大内存,像Java的线程的栈内存,一般至少分配512K~1M的空间,如果系统中的线程数过千,整个JVM的内存将被耗用1G。
  3. 线程的切换成本是很高的。操作系统发生线程切换的时候,需要保留线程的上下文,然后执行系统调用。过多的线程频繁切换带来的后果是,可能执行线程切换的时间甚至会大于线程执行的时间,这时候带来的表现往往是系统CPU sy值特别高(超过20%以上)的情况,导致系统几乎陷入不可用的状态。

在Linux的性能指标里,有ussy两个指标,使用top命令可以很方便的看到。

us是用户进程的意思,而sy是在内核中所使用的cpu占比。

如果进程在内核态和用户态切换的非常频繁,那么效率大部分就会浪费在切换之上。一次内核态和用户态切换的时间,普遍在 微秒 级别以上,可以说非常昂贵了。

cpu的性能是固定的,在无用的东西上浪费越小,在真正业务上的处理就效率越高。

影响效率的有两个方面:

  1. 进程或者线程的数量,引起过多的上下文切换。
    进程是由内核来管理和调度的,进程的切换只能发生在内核态。所以,如果你的代码切换了线程,它必然伴随着一次用户态和内核态的切换。
  2. IO的编程模型,引起过多的系统态和内核态切换。
    比如同步阻塞等待的模型,需要经过数据接收、软中断的处理(内核态),然后唤醒用户线程(用户态),处理完毕之后再进入等待状态(内核态)。

注意:一次内核态和用户态切换的时间,普遍在 微秒 级别以上,可以说非常昂贵了。

IO模型的异步化的第一个目标: 减少线程数量,减少线程切换系统调用带来 CPU 上下文切换的开销。

IO模型的异步化的第一个目标: 减少IO系统调用,减少线程切换系统调用带来的带来 CPU 上下文切换开销。

IO模型层的异步化

IO模型层的异步化, 也是逐步演进的, 演进的过程中,大概涉及到以下的模型

  • 阻塞式IO (bio)
  • 非阻塞式IO
  • IO复用 (nio)
  • 信号驱动式IO
  • 异步IO(aio)

作为热身, 先简单看看两类简单的模型:阻塞IO模型、非阻塞IO模型

1.阻塞IO模型

如上图,是典型的BIO模型,每当有一个连接到来,经过协调器的处理,就开启一个对应的线程进行接管。

如果连接有1000条,那就需要1000个线程。线程资源是非常昂贵的,除了占用大量的内存,还会占用非常多的CPU调度时间,所以BIO在连接非常多的情况下,效率会变得非常低。

就单个阻塞IO来说,它的效率并不比NIO慢。但是当服务的连接增多,考虑到整个服务器的资源调度和资源利用率等因素,NIO就有了显著的效果,NIO非常适合高并发场景。

2.非阻塞IO模型

其实,在处理IO动作时,有大部分时间是在等待。

比如,socket连接要花费很长时间进行连接操作,在完成连接的这段时间内,它并没有占用额外的系统资源,但它只能阻塞等待在线程中。这种情况下,系统资源并不能被合理的利用。

Java的NIO,在Linux上底层是使用epoll实现的。epoll是一个高性能的多路复用I/O工具,改进了select和poll等工具的一些功能。在网络编程中,对epoll概念的一些理解,几乎是面试中必问的问题。

epoll的数据结构是直接在内核上进行支持的。通过epoll_create和epoll_ctl等函数的操作,可以构造描述符(fd)相关的事件组合(event)。

这里有两个比较重要的概念:

  • fd 每条连接、每个文件,都对应着一个描述符,比如端口号。内核在定位到这些连接的时候,就是通过fd进行寻址的
  • event 当fd对应的资源,有状态或者数据变动,就会更新epoll_item结构。在没有事件变更的时候,epoll就阻塞等待,也不会占用系统资源;一旦有新的事件到来,epoll就会被激活,将事件通知到应用方

关于epoll还会有一个面试题:相对于select,epoll有哪些改进?

这里直接给出答案:

  • epoll不再需要像select一样对fd集合进行轮询,也不需要在调用时将fd集合在用户态和内核态进行交换
  • 应用程序获得就绪fd的事件复杂度,epoll时O(1),select是O(n)
  • select最大支持约1024个fd,epoll支持65535个
  • select使用轮询模式检测就绪事件,epoll采用通知方式,更加高效

有关5大IO模型,是本文的基础知识,也是非常核心的知识,非常重要

Next, through Chapter 2 of Nien's best-selling book " Java High Concurrency Core Programming Volume 1 Enhanced Edition ", I will give you a detailed introduction to the asynchronous evolution of the IO model layer.

"Java High Concurrency Core Programming Volume 1" Chapter 2: The underlying principles of high concurrency IO

The principle of this book is: start with the basics. The underlying principles of IO are the basic knowledge hidden under the knowledge of Java programming. It is the basic principle that developers must master. It can be said to be the basic foundation, and it is also the necessary knowledge for passing interviews in large companies.

This chapter starts from the underlying principles of the operating system, and provides an in-depth analysis of the underlying principles of high-concurrency IO through pictures and texts, and introduces how to make the operating system support high concurrency through settings.

Section 2.1 Basic principles of IO reading and writing

In order to prevent user processes from directly operating the kernel and ensure kernel security, the operating system divides memory (virtual memory) into two parts, one is kernel space (Kernel-Space) and the other is user space (User-Space). In the Linux system, the kernel module runs in the kernel space, and the corresponding process is in the kernel state; while the user program runs in the user space, and the corresponding process is in the user state.

The core of the operating system is the kernel, which is independent of ordinary applications and has access to protected kernel space and access to underlying hardware devices. Kernel space always resides in memory and is reserved for the operating system's kernel. Applications are not allowed to directly read or write in the kernel space area, nor are they allowed to directly call functions defined by the kernel code. Each application process has a separate user space, and the corresponding process is in user mode. User mode processes cannot access data in the kernel space, nor can they directly call kernel functions. Therefore, when making system calls, they must The process must be switched to kernel mode to proceed.

Kernel-mode processes can execute arbitrary commands and call all resources of the system, while user-mode processes can only perform simple operations and cannot directly call system resources. Now the question arises: How do user-mode processes perform system calls? The answer is: The user-mode process must pass the system interface (System Call) in order to issue instructions to the kernel and complete operations such as calling system resources.

illustrate:

Unless otherwise stated, the kernel mentioned later in this book refers to the kernel of the operating system.

用户程序进行IO的读写,依赖于底层的IO读写,基本上会用到底层的两大系统调用:sys_read & sys_write。虽然在不同的操作系统中,sys_read&sys_write两大系统调用的名称和形式可能不完全一样,但是他们的基本功能是一样的。

操作系统层面的sys_read系统调用,并不是直接从物理设备把数据读取到应用的内存中;sys_write系统调用,也不是直接把数据写入到物理设备。上层应用无论是调用操作系统的sys_read,还是调用操作系统的sys_write,都会涉及缓冲区。具体来说,上层应用通过操作系统的sys_read系统调用,是把数据从内核缓冲区复制到应用程序的进程缓冲区;上层应用通过操作系统的sys_write系统调用,是把数据从应用程序的进程缓冲区复制到操作系统内核缓冲区。

简单来说,应用程序的IO操作,实际上不是物理设备级别的读写,而是缓存的复制。sys_read&sys_write两大系统调用,都不负责数据在内核缓冲区和物理设备(如磁盘、网卡等)之间的交换。这项底层的读写交换操作,是由操作系统内核(Kernel)来完成的。所以,应用程序中的IO操作,无论是对Socket的IO操作,还是对文件的IO操作,都属于上层应用的开发,它们的在输入(Input)和输出(Output)维度上的执行流程,都是类似的,都是在内核缓冲区和进程缓冲区之间的进行数据交换。

其中,涉及到 用户空间内核空间、用户态内核态,又是一组极致复杂的概念

同样是本文的基础知识,也是非常核心的知识,非常重要

这里不做展开,请大家去看尼恩的3 高架构笔记 《高性能之葵花宝典》。

1.1.1 内核缓冲区与进程缓冲区

为什么设置那么多的缓冲区,导致读写过程那么麻烦呢?

The purpose of the buffer is to reduce frequent physical exchanges with the device. There is a very big gap between the computer's external physical devices, memory and CPU. Direct reading and writing of external devices involves operating system interrupts. When a system interruption occurs, the previous process data and status information need to be saved, and after the interruption ends, the previous process data, status information, and other information need to be restored. In order to reduce the time loss and performance loss caused by frequent interruptions of the underlying system, the kernel buffer appeared.

With the kernel buffer, the operating system will monitor the kernel buffer, wait for the buffer to reach a certain number, then perform interrupt processing on the IO device, and centrally execute the actual IO operation of the physical device. Through this mechanism, the system performance can be improved. performance. As for when to execute system interrupts (including read interrupts and write interrupts), it is determined by the kernel of the operating system, and the application does not need to care.

When an upper-layer application uses the sys_read system call, it only copies data from the kernel buffer to the upper-layer application's buffer (process buffer); when an upper-layer application uses the sys_write system call, it only copies data from the application's user buffer to the kernel buffer. District.

The number of kernel buffers and application buffers is also different. In Linux systems, the operating system kernel has only one kernel buffer. Each user program (process) has its own independent buffer, called the user buffer or process buffer. In most cases, the IO reading and writing programs of user programs in Linux systems do not perform actual IO operations, but directly exchange data between the user buffer and the kernel buffer.

1.1.2 Execution process of typical IO system calls sys_read&sys_write

The following is a simple C language code on the server side for Socket data transmission. The reason why it is simple is that the server only receives one connection, and then starts reading and writing Socket data through the read&write function of C language. The reference code is as follows:

#include "InitSock.h" 
#include <stdio.h> 
#include <iostream>
using namespace std;
CInitSock initSock;     // 初始化Winsock库 
 
int main() 
{
    
     
// 创建套节字 
//参数1用来指定套接字使用的地址格式,通常使用AF_INET
//参数2指定套接字的类型,SOCK_STREAM指的是TCP,SOCK_DGRAM指的是UDP
	SOCKET sListen = ::socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
sockaddr_in sin;  //创建IP地址: ip+端口
sin.sin_family = AF_INET; 
   sin.sin_port = htons(4567);  //1024 ~ 49151:普通用户注册的端口号
sin.sin_addr.S_un.S_addr = INADDR_ANY; 
// 绑定这个套接字到一个IP地址 
if(::bind(sListen, (LPSOCKADDR)&sin, sizeof(sin)) == SOCKET_ERROR) 
    {
    
     
        printf("Failed bind() \n"); 
        return 0; 
    } 
     
    //开始监听连接 
	//第二个参数2指的监听队列中允许保持的尚未处理的最大连接数
     if(::listen(sListen, 2) == SOCKET_ERROR) 
    {
    
     
        printf("Failed listen() \n"); 
        return 0; 
    } 
     
    // 接受客户的连接请求,注意,这里只是演示,只接收一个客户端,不接收更多客户端
    sockaddr_in remoteAddr;  
    int nAddrLen = sizeof(remoteAddr); 
    SOCKET sClient = 0; 
    char szText[] = " TCP Server Demo! \r\n"; 
    while(sClient==0) 
    {
    
     
      // 接受一个新连接 
		  //((SOCKADDR*)&remoteAddr)一个指向sockaddr_in结构的指针,用于获取对方地址
        sClient = ::accept(sListen, (SOCKADDR*)&remoteAddr, &nAddrLen); 
        if(sClient == INVALID_SOCKET) 
        {
    
     
            printf("Failed accept()"); 
        } 
        
        printf("接受到一个连接:%s \r\n", inet_ntoa(remoteAddr.sin_addr)); 
        break; 
    } 
 
    while(TRUE) 
    {
    
     
        // 向客户端发送数据 
       ::send(sClient, szText, strlen(szText), 0); 
         
        // 从客户端接收数据 
        char buff[256] ; 
        int nRecv = ::read(sClient, buff, 256, 0); 
        if(nRecv > 0) 
        {
    
     
            buff[nRecv] = '\0'; 
            printf(" 接收到数据:%s\n", buff); 
        } 
    } 
 
    // 关闭客户端的连接 
    ::closesocket(sClient);        
    // 关闭监听套节字 
    ::closesocket(sListen);  
    return 0; 
}

The read and write functions used by the user program can be understood as library functions in the C language. This library function is exclusively used by the user program. Note: These library functions are not kernel programs, and data reading and writing in the kernel space need to be completed by the kernel program. Therefore, in these library functions, further encapsulation and calling of system calls are required. So, what system calls are involved here? Since different operating systems, or different versions of the same operating system, have differences in specific implementations, you can roughly understand that the system call called by the read library function used in the C program is sys_read, which is sys_read completes the data reading in the kernel space; the write library function used in the user C program will call the system call sys_write, and sys_write completes the data writing in the kernel space.

The system calls sys_read&sys_write do not exchange data between the kernel buffer and the physical device. The sys_read call copies data from the kernel buffer to the application's user buffer, and the sys_write call copies data from the application's user buffer to the kernel buffer. The general flow of the two system calls is shown in Figure 2-1.

Figure 2-1 Execution process of system call sys_read&sys_write

Figure 2-1 Execution process of system call sys_read&sys_write

Here we take the sys_read system call as an example and first look at the two stages of the complete input process:

  • The application waits for the data to be ready.
  • Copy data from kernel buffer to user buffer.

If it is sys_read a socket (socket), then the specific processing flow of the above two stages is as follows:

  • In the first phase, the application waits for data to reach the network card through the network. When the waiting packet arrives, the data is copied to the kernel buffer by the operating system. This work is completed automatically by the operating system without the user program being aware of it.
  • In the second phase, the kernel copies data from the kernel buffer to the application's user buffer.

To be more specific, if a data exchange of socket request and response (including sys_read and sys_write) is completed between the C program client and the server, the complete process is as follows:

  • The client sends a request: The C program client program copies the data to the kernel buffer through the sys_write system call, and Linux sends the request data in the kernel buffer through the client's network card.
  • 服务端系统接收数据:在服务端,这份请求数据会被服务端操作系统通过DMA硬件,从接收网卡中读取到服务端机器的内核缓冲区。
  • 服务端C程序获取数据:服务端C程序通过sys_read系统调用,从Linux内核缓冲区复制数据,复制到C用户缓冲区。
  • 服务器端业务处理:服务器在自己的用户空间中,完成客户端的请求所对应的业务处理。
  • 服务器端返回数据:服务器C程序完成处理后,构建好的响应数据,将这些数据从用户缓冲区写入内核缓冲区,这里用到的是sys_write系统调用,操作系统会负责将内核缓冲区的数据发送出去。
  • 服务端系统发送数据:服务端Linux系统将内核缓冲区中的数据写入网卡,网卡通过底层的通信协议,会将数据发送给目标客户端。

说明:

由于生产环境的Java高并发应用基本都运行在Linux操作系统上,所以,以上案例中的操作系统,以Linux作为实例。

2.2 节 五种主要的IO模型

服务器端高并发IO编程,往往要求的性能都非常高,一般情况下都需要选用高性能的IO模型。还有,对于Java工程师来说,有关IO模型的知识也是通关大公司面试的必备知识。本章从最为基础的模型开始,为大家揭秘IO模型的核心原理。

常见的IO模型虽然有五种,但是可以分成四大类:

  1. 同步阻塞IO(Blocking IO)

首先,解释一下阻塞与非阻塞。阻塞IO,指的是需要内核IO操作彻底完成后,才返回到用户空间执行用户程序的操作指令,阻塞一词所指的是用户程序(发起IO请求的进程或者线程)的执行状态是阻塞的。可以说传统的IO模型都是阻塞IO模型,并且在Java中,默认创建的socket都属于阻塞IO模型。

其次,解释一下同步与异步。简单理解,同步与异步可以看成是发起IO请求的两种方式。同步IO是指用户空间(进程或者线程)是主动发起IO请求的一方,系统内核是被动接受方。异步IO则反过来,系统内核主动发起IO请求的一方,用户空间是被动接受方。

The so-called synchronous blocking IO refers to an IO operation initiated by the user space (or thread) actively and needs to wait for the kernel IO operation to be completely completed before returning to the user space. During the IO operation, the user process (or thread) that initiates the IO request is in a blocked state.

  1. Synchronous non-blocking NIO (Non-Blocking IO)

Non-blocking IO means that the user space program does not need to wait for the kernel IO operation to be completely completed and can immediately return to the user space to execute subsequent instructions. That is, the user process (or thread) that initiated the IO request is in a non-blocking state. With this At the same time, the kernel will immediately return an IO status value to the user.

What is the difference between blocking and non-blocking?

Blocking means that the user process (or thread) has been waiting and cannot do other things; non-blocking means that the user process (or thread) returns to its own space after getting the status value returned by the kernel, and can do other things. In Java, non-blocking IO socket sockets are required to be set to NONBLOCK mode.

illustrate:

The NIO (synchronous non-blocking IO) model mentioned here is not the NIO (New IO) class library in Java programming.

The so-called synchronous non-blocking NIO refers to an IO operation that is actively initiated by the user process and can immediately return to the user space without waiting for the kernel IO operation to be completely completed. During the IO operation, the user process (or thread) that initiates the IO request in a non-blocking state.

  1. IO Multiplexing

In order to improve performance, the operating system introduces a new type of system call specifically used to query the readiness status of IO file descriptors (including socket connections). In Linux systems, the new system call is the select/epoll system call. Through this system call, a user process (or thread) can monitor multiple file descriptors. Once a descriptor is ready (usually the kernel buffer is readable/writable), the kernel can return the readiness status of the file descriptor to User process (or thread), user space can make corresponding IO system calls based on the ready status of the file descriptor.

IO multiplexing (IO Multiplexing) is the basic IO model of the high-performance Reactor thread model. Of course, this model is an upgraded version based on the synchronous and non-blocking model.

  1. Signal driven IO model

In the signal-driven IO model, user threads avoid blocking of IO time queries by registering callback functions for IO events with the core.

具体来说,用户进程预先在内核中设置一个回调函数,当某个事件发生时,内核使用信号(SIGIO)通知进程运行回调函数。然后进入IO操作的第二个阶段——执行阶段:用户线程会继续执行,在信号回调函数中调用IO读写操作来进行实际的IO请求操作。

信号驱动IO可以看成是一种异步IO,可以简单理解为系统进行用户函数的回调。只是,信号驱动IO的异步特性做的不彻底。为什么呢? 信号驱动IO仅仅在IO事件的通知阶段是异步的,而在第二阶段,也就是在将数据从内核缓冲区复制到用户缓冲区这个过程,用户进程是阻塞的、同步的。

  1. 异步IO(Asynchronous IO)

异步IO,指的是用户空间与内核空间的调用方式大反转。用户空间的线程变成被动接受者,而内核空间成了主动调用者。在异步IO模型中,当用户线程收到通知时,数据已经被内核读取完毕,并放在了用户缓冲区内,内核在IO完成后通知用户线程直接使用即可。

异步IO类似于Java中典型的回调模式,用户进程(或者线程)向内核空间注册了各种IO事件的回调函数,由内核去主动调用。

异步IO包含两种:不完全异步的信号驱动IO模型和完全的异步IO模型。

接下来,对以上的五种常见的IO模型进行一下详细的介绍。

2.2.1 同步阻塞IO(Blocking IO)

默认情况下,在Java应用程序进程中,所有对socket连接的进行的IO操作都是同步阻塞IO(Blocking IO)。

在阻塞式IO模型中,Java应用程序从发起IO系统调用开始,一直到系统调用返回,在这段时间内,发起IO请求的Java进程(或者线程)是阻塞的。直到返回成功后,应用进程才能开始处理用户空间的缓存区数据。

同步阻塞IO的具体流程,如图2-2所示。

Figure 2-2 Process of synchronous blocking IO

图2-2 同步阻塞IO的流程

举个例子,在Java中发起一个socket的sys_read读操作的系统调用,流程大致如下:

(1)从Java进行IO读后发起sys_read系统调用开始,用户线程(或者线程)就进入阻塞状态。

(2)当系统内核收到sys_read系统调用,就开始准备数据。一开始,数据可能还没有到达内核缓冲区(例如,还没有收到一个完整的socket数据包),这个时候内核就要等待。

(3)内核一直等到完整的数据到达,就会将数据从内核缓冲区复制到用户缓冲区(用户空间的内存),然后内核返回结果(例如返回复制到用户缓冲区中的字节数)。

(4)直到内核返回后,用户线程才会解除阻塞的状态,重新运行起来。

阻塞IO的特点是:在内核进行IO执行的两个阶段,发起IO请求的用户进程(或者线程)被阻塞了。

阻塞IO的优点是:应用的程序开发非常简单;在阻塞等待数据期间,用户线程挂起,用户线程基本不会占用CPU资源。

阻塞IO的缺点是:一般情况下,会为每个连接配备一个独立的线程,一个线程维护一个连接的IO操作。在并发量小的情况下,这样做没有什么问题。

但是,当在高并发的应用场景下,需要大量的线程来维护大量的网络连接,内存、线程切换开销会非常巨大。在高并发应用场景中,阻塞IO模型是性能很低的,基本上是不可用的。

总之,阻塞IO 存在 c10k 问题。

所谓c10k问题,指的是:服务器如何支持10k个并发连接,也就是concurrent 10000 connection(这也是c10k这个名字的由来)。

由于硬件成本的大幅度降低和硬件技术的进步,如果一台服务器能够同时服务更多的客户端,那么也就意味着服务每一个客户端的成本大幅度降低。

从这个角度来看,c10k问题显得非常有意义。

2.2.2 同步非阻塞NIO(None Blocking IO)

在Linux系统下,socket连接默认是阻塞模式,可以通过设置将socket变成为非阻塞的模式(Non-Blocking)。在NIO模型中,应用程序一旦开始IO系统调用,会出现以下两种情况:

(1)在内核缓冲区中没有数据的情况下,系统调用会立即返回,返回一个调用失败的信息。

(2)在内核缓冲区中有数据的情况下,在数据的复制过程中系统调用是阻塞的,直到完成数据从内核缓冲复制到用户缓冲。复制完成后,系统调用返回成功,用户进程(或者线程)可以开始处理用户空间的缓存数据。

同步非阻塞IO的流程,如图2-3所示。

Figure 2-3 Synchronous non-blocking IO process

图2-3 同步非阻塞IO的流程

举个例子。发起一个非阻塞socket的sys_read读操作的系统调用,流程如下:

(1)在内核数据没有准备好的阶段,用户线程发起IO请求时,立即返回。所以,为了读取到最终的数据,用户进程(或者线程)需要不断地发起IO系统调用。

(2)内核数据到达后,用户进程(或者线程)发起系统调用,用户进程(或者线程)阻塞(大家一定要注意,此处用户进程的阻塞状态)。内核开始复制数据,它会将数据从内核缓冲区复制到用户缓冲区,然后内核返回结果(例如返回复制到的用户缓冲区的字节数)。

(3)用户进程(或者线程)在读数据时,没有数据会立即返回而不阻塞,用户空间需要经过多次的尝试,才能保证最终真正读到数据,而后继续执行。

同步非阻塞IO的特点:应用程序的线程需要不断地进行IO系统调用,轮询数据是否已经准备好,如果没有准备好,就继续轮询,直到完成IO系统调用为止。

同步非阻塞IO的优点:每次发起的IO系统调用,在内核等待数据过程中可以立即返回。用户线程不会阻塞,实时性较好。

同步非阻塞IO的缺点:不断地轮询内核,这将占用大量的CPU时间,效率低下。

总体来说,在高并发应用场景中,同步非阻塞IO是性能很低的,也是基本不可用的,一般Web服务器都不使用这种IO模型。在Java的实际开发中,也不会涉及这种IO模型。但是此模型还是有价值的,其作用在于,其他IO模型中可以使用非阻塞IO模型作为基础,以实现其高性能。

说明:

Synchronous non-blocking IO can also be referred to as NIO, but it is not NIO in Java programming. Although their English abbreviations are the same, they cannot be confused. Java's NIO (New IO) class library component does not belong to the NIO (None Blocking IO) model in the basic IO model, but to another model called IO multiplexing model (IO Multiplexing).

2.2.3 IO multiplexing model (IO Multiplexing)

How to avoid the problem of polling and waiting in the synchronous non-blocking IO model? This is the IO multiplexing model.

In the IO multiplexing model, a new system call is introduced to query the readiness status of IO. In Linux systems, the corresponding system call is the select/epoll system call. Through this system call, a process can monitor multiple file descriptors (including socket connections). Once a descriptor is ready (usually the kernel buffer is readable/writable), the kernel can return the ready status to the application. Subsequently, the application makes corresponding IO system calls based on the ready status.

Currently, system calls that support IO multiplexing include select, epoll, etc. The select system call is supported on almost all operating systems and has good cross-platform features. epoll was proposed in the Linux 2.6 kernel and is an enhanced version of the select system call in Linux.

In the IO multiplexing model, through the select/epoll system call, a single application thread can continuously poll the ready status of hundreds or thousands of socket connections. When one or some socket network connections have an IO ready status, , return these ready states (or ready events).

Give an example to illustrate the process of IO multiplexing model. Initiate a system call for the sys_read read operation of multiplexed IO. The process is as follows:

(1) Selector registration. In this mode, first, the target file descriptor (socket connection) that requires sys_read operation is registered in advance with the Linux select/epoll selector. The corresponding selector class in Java is the Selector class. Then, the polling process of the entire IO multiplexing model can be started.

(2) Polling of ready status. Through the selector query method, query the IO readiness status of all target file descriptors (socket connections) registered in advance. Through the query system call, the kernel will return a list of ready sockets. When the data in any registered socket is ready or ready, the kernel buffer has data, and the kernel adds the socket to the ready list and returns the ready event.

(3) After the user thread obtains the list of ready status, it initiates the sys_read system call based on the socket connection, and the user thread blocks. The kernel starts copying data from the kernel buffer to the user buffer.

(4) After the copy is completed, the kernel returns the result, and the user thread will be unblocked. The user thread will read the data and continue execution.

illustrate:

When the user process polls for IO ready events, it needs to call the select query method of the selector. The user process or thread that initiates the query is blocked. Of course, if a non-blocking overloaded version of the query method is used, the user process or thread that initiates the query will not be blocked, and the overloaded version will return immediately.

The sys_read system call process of the IO multiplexing model is shown in Figure 2-4.

Figure 2-4 sys_read system call process of IO multiplexing model

Figure 2-4 sys_read system call process of IO multiplexing model

Characteristics of the IO multiplexing model: IO of the IO multiplexing model involves two system calls, one is the system call for IO operations, and the other is the select/epoll ready query system call. The IO multiplexing model is built on the infrastructure of the operating system, that is, the kernel of the operating system must be able to provide multiplexed system calls select/epoll.

Similar to the NIO model, multiplexed IO also requires polling. The thread responsible for the select/epoll status query call needs to continuously perform select/epoll polling to find the socket connection that is ready for IO operations.

The IO multiplexing model is closely related to the synchronous non-blocking IO model. Specifically, every socket connection registered on the selector that can be queried is generally set to the synchronous non-blocking model. It's just that this is invisible to the user program.

IO多路复用模型的优点:一个选择器查询线程,可以同时处理成千上万的网络连接,所以,用户程序不必创建大量的线程,也不必维护这些线程,从而大大减小了系统的开销。这是一个线程维护一个连接的阻塞IO模式相比,使用多路IO复用模型的最大优势。

通过JDK的源码可以看出,Java语言的NIO(New IO)组件,在Linux系统上,是使用的是select系统调用实现的。所以,Java语言的NIO(New IO)组件所使用的,就是IO多路复用模型。

IO多路复用模型的缺点:本质上,select/epoll系统调用是阻塞式的,属于同步阻塞IO。都需要在读写事件就绪后,由系统调用本身负责进行读写,也就是说这个事件的查询过程是阻塞的。

如果彻底地解除线程的阻塞,就必须使用异步IO模型。

2.2.4 信号驱动IO模型(SIGIO、Signa- Driven I/O)

在信号驱动IO模型中,用户线程通过向核心注册IO事件的回调函数,来避免IO时间查询的阻塞。

具体的做法是,用户进程预先在内核中设置一个回调函数,当某个事件发生时,内核使用信号(SIGIO)通知进程运行回调函数。 然后用户线程会继续执行,在信号回调函数中调用IO读写操作来进行实际的IO请求操作。

信号驱动IO的基本流程是:用户进程通过系统调用,向内核注册SIGIO信号的owner进程和以及进程内的回调函数。内核IO事件发生后(比如内核缓冲区数据就位)后,通知用户程序,用户进程通过sys_read系统调用,将数据复制到用户空间,然后执行业务逻辑。

信号驱动IO模型,每当套接字发生IO事件时,系统内核都会向用户进程发送SIGIO事件,所以,一般用于UDP传输,在TCP套接字的开发过程中很少使用,原因是SIGIO信号产生得过于频繁,并且内核发送的SIGIO信号,并没有告诉用户进程发生了什么IO事件。

但是在UDP套接字上,通过SIGIO信号进行下面两个事件的类型判断即可:

  1. 数据报到达套接字
  2. 套接字上发生错误

因此,在SIGIO出现的时候,用户进程很容易进行判断和做出对应的处理:如果不是发生错误,那么就是有数据报到达了。

举个例子。发起一个异步IO的sys_read读操作的系统调用,流程如下:

(1)设置SIGIO信号的信号处理回调函数。

(2)设置该套接口的属主进程,使得套接字的IO事件发生时,系统能够将SIGIO信号传递给属主进程,也就是当前进程。

(3)开启该套接口的信号驱动I/O机制,通常通过使用fcntl方法的F_SETFL操作命令,使能(enable)套接字的 O_NONBLOCK非阻塞标志和O_ASYNC异步标志完成。

完成以上三步,用户进程就完成了事件回调处理函数的设置。当文件描述符上有事件发生时,SIGIO 的信号处理函数将被触发,然后便可对目标文件描述符执行 I/O 操作。关于以上三步的详细介绍,具体如下:

第一步:设置SIGIO信号的信号处理回调函数。Linux中通过 sigaction() 来完成。参考的代码如下:

// 注册SIGIO事件的回调函数
sigaction(SIGIO, &act, NULL); 

sigaction函数的功能是检查或修改与指定信号相关联的处理动作(可同时两种操作),函数的原型如下:

int sigaction(int signum, const struct sigaction *act,
                     struct sigaction *oldact);

对其中的参数说明如下:

  1. signum参数指出要捕获的信号类型
  2. act参数指定新的信号处理方式
  3. oldact参数输出先前信号的处理方式(如果不为NULL的话)。

该函数是Linux系统的一个基础函数,不是为信号驱动IO特供的。在信号驱动IO的使用场景中,signum的值为常量 SIGIO。

第二步:设置该套接口的属主进程,使得套接字的IO事件发生时,系统能够将SIGIO信号传递给属主进程,也就是当前进程。属主进程是当文件描述符上可执行 I/O 时,会接收到通知信号的进程或进程组。

为文件描述符的设置IO事件的属主进程,通过 fcntl() 的 F_SETOWN 操作来完成,参考的代码如下:

fcntl(fd,F_SETOWN,pid)

当参数pid 为正整数时,代表了进程 ID 号。当参数pid 为负整数时,它的绝对值就代表了进程组 ID 号。

第三步:开启该套接口的信号驱动IO机制,通常通过使用fcntl方法的F_SETFL操作命令,使能(enable)套接字的 O_NONBLOCK非阻塞标志和O_ASYNC异步标志完成。参考的代码如下:

int flags = fcntl(socket_fd, F_GETFL, 0);
    flags |= O_NONBLOCK;  //设置非阻塞
    flags |= O_ASYNC;    //设置为异步
    fcntl(socket_fd, F_SETFL, flags );

这一步通过 fcntl() 的 F_SETF- 操作来完成,O_NONBLOCK为非阻塞标志,O_ASYNC为信号驱动 I/O的标志。

使用事件驱动IO进行UDP通信应用的开发,参考的代码如下(C代码):

int socket_fd = 0;

//事件的处理函数
void do_sometime(int signal) {
    
    
    struct sockaddr_in cli_addr;
    int clilen = sizeof(cli_addr);
    int clifd = 0;

    char buffer[256] = {
    
    0};
    int len = recvfrom(socket_fd, buffer, 256, 0, (struct sockaddr *)&cli_addr,
                       (socklen_t)&clilen);
    printf("Mes:%s", buffer);
    
    //回写
    sendto(socket_fd, buffer, len, 0, (struct sockaddr *)&cli_addr, clilen);
}

int main(int argc, char const *argv[]) {
    
    
    socket_fd = socket(AF_INET, SOCK_DGRAM, 0);

    struct sigaction act;
    act.sa_flags = 0;
    act.sa_handler = do_sometime;
    
    // 注册SIGIO事件的回调函数
    sigaction(SIGIO, &act, NULL); 
    struct sockaddr_in servaddr;
    memset(&servaddr, 0, sizeof(servaddr));

    servaddr.sin_family = AF_INET;
    servaddr.sin_port = htons(8888);
    servaddr.sin_addr.s_addr = INADDR_ANY;

    //第二步为文件描述符的设置 属主
    //设置将要在socket_fd上接收SIGIO的进程
    fcntl(socket_fd, F_SETOWN, getpid());

    //第三步:使能套接字的信号驱动IO
    int flags = fcntl(socket_fd, F_GETFL, 0);
    flags |= O_NONBLOCK;  //设置非阻塞
    flags |= O_ASYNC;    //设置为异步
    fcntl(socket_fd, F_SETFL, flags );

    bind(socket_fd, (struct sockaddr *)&servaddr, sizeof(servaddr));
    while (1) sleep(1); //死循环
    close(socket_fd);
    return 0;
}

当套件字的IO事件发生时,回调函数被执行,在回调函数中,用户进行执行数据复制即可。

信号驱动IO优势:用户进程在等待数据时,不会被阻塞,能够提高用户进程的效率。具体来说:在信号驱动式I/O模型中,应用程序使用套接口进行信号驱动I/O,并安装一个信号处理函数,进程继续运行并不阻塞。

信号驱动IO缺点:

  1. 在大量IO事件发生时,可能会由于处理不过来,而导致信号队列溢出。
  2. 对于处理UDP套接字来讲,对于信号驱动I/O是有用的。可是,对于TCP而言,由于致使SIGIO信号通知的条件为数众多,进行IO信号进一步区分的成本太高,信号驱动的I/O方式近乎无用。
  3. 信号驱动IO可以看成是一种异步IO,可以简单理解为系统进行用户函数的回调。只是,信号驱动IO的异步特性,又做的不彻底。为什么呢? 信号驱动IO仅仅在IO事件的通知阶段是异步的,而在第二阶段,也就是在将数据从内核缓冲区复制到用户缓冲区这个过程,用户进程是阻塞的、同步的。

如果要做彻底的异步IO,那就需要使用第五种IO模式:异步IO模式。

2.2.5 异步IO模型(Asynchronous IO)

异步IO模型(Asynchronous IO,简称为AIO)。AIO的基本流程是:用户线程通过系统调用,向内核注册某个IO操作。内核在整个IO操作(包括数据准备、数据复制)完成后,通知用户程序,用户执行后续的业务操作。

在异步IO模型中,在整个内核的数据处理过程中,包括内核将数据从网络物理设备(网卡)读取到内核缓冲区、将内核缓冲区的数据复制到用户缓冲区,用户程序都不需要阻塞。

The process of the asynchronous IO model is shown in Figure 2-5.

Figure 2-5 Process of asynchronous IO model

Figure 2-5 Process of asynchronous IO model

for example. Initiate a system call for an asynchronous IO sys_read read operation. The process is as follows:

(1) When the user thread initiates the sys_read system call (which can be understood as registering a callback function), it can immediately start doing other things without blocking the user thread.

(2) The kernel begins the first stage of IO: preparing data. When the data is ready, the kernel copies the data from the kernel buffer to the user buffer.

(3) The kernel will send a signal (Signal) to the user thread, or call back the callback method registered by the user thread, telling the user thread that the sys_read system call has been completed and the data has been read into the user buffer.

(4) The user thread reads the data in the user buffer and completes subsequent business operations.

Characteristics of the asynchronous IO model: During the two stages of the kernel waiting for data and copying data, the user thread is not blocked. The user thread needs to receive the event that the kernel's IO operation is completed, or the user thread needs to register a callback function for the completion of the IO operation. Because of this, asynchronous IO is sometimes called signal-driven IO.

Disadvantages of the asynchronous IO asynchronous model: The application only needs to register and receive events, and the rest of the work is left to the operating system, which means that the underlying kernel needs to provide support.

Theoretically, asynchronous IO is truly asynchronous input and output, and its throughput is higher than the throughput of the IO multiplexing model. For now, true asynchronous IO is implemented through IOCP under Windows systems. Under the Linux system, the asynchronous IO model was only introduced in version 2.6, and JDK's support for it is currently incomplete, so asynchronous IO has no obvious advantage in performance.

Most high-concurrency server-side programs are generally based on Linux systems. Therefore, the current development of such high-concurrency network applications mostly uses the IO multiplexing model. The famous Netty framework uses the IO multiplexing model instead of the asynchronous IO model.

2.2.6 Differences between synchronous, asynchronous, blocking and non-blocking

First of all, synchronous and asynchronous are aimed at the direction of the interaction process between the application (such as Java) and the kernel.

For synchronous IO operations, the initiator is the application and the receiver is the kernel.

In synchronous IO, the application process initiates an IO operation and blocks and waits, or polls whether the IO operation is completed.

For asynchronous IO operations, the application does its own thing after registering the callback function in advance. The IO is handed over to the kernel for processing. After the kernel completes the IO operation, the callback function of the process is started.

Blocking and non-blocking focus on the waiting state of the user process during the IO process. The former user process needs to block and wait for IO operations, while the latter user process does not need to block and wait for IO operations. Synchronous blocking IO, synchronous non-blocking IO, and multiplexed IO are all synchronous IO and blocking IO.

Asynchronous IO must be non-blocking, so there is no such thing as asynchronous blocking and asynchronous non-blocking. True asynchronous IO requires deep involvement of the kernel. The user process in asynchronous IO does not consider the execution of IO at all. The IO operation is mainly left to the kernel to complete, and it only waits for a completion signal.

2.3 Support millions of concurrent connections through reasonable configuration

The focus of this chapter is the underlying principle of high concurrent IO. The high-concurrency IO model has been introduced in a simple and easy-to-understand manner before. However, even if the most advanced model is adopted, there is no way to support millions of concurrent network connections without reasonable operating system configuration. In the production environment, everyone uses the Linux system. Therefore, unless otherwise specified in the subsequent text, the operating systems referred to are Linux systems.

illustrate:

In the Linux environment, everything is represented by files. Devices are files, directories are files, and sockets are also files. The interface and the only interface used to represent the objects being processed is the file. When an application reads/writes a file, it first needs to open the file. The essence of the opening process is to establish a connection between the process and the file. The function of the handle is to uniquely identify the connection. Subsequent reading/writing of the file will be represented by this handle. Finally, closing the file is actually the process of releasing the handle, which means that the connection between the process and the file is disconnected. .

The configuration involved here is the limit on the number of file handles in the Linux operating system. In a production environment Linux system, it is basically necessary to lift the limit on the number of file handles. The reason is that the Linux system default value is 1024, which means that a process can accept up to 1024 socket connections. This is not enough.

The principle of this book is: start with the basics.

File handle, also called file descriptor. In the Linux system, files can be divided into: ordinary files, directory files, link files and device files. File descriptor (File Descriptor) is an index created by the kernel in order to efficiently manage opened files. It is a non-negative integer (usually a small integer) used to refer to the opened file. All IO system calls, including socket read and write calls, are completed through file descriptors.

Under Linux, by calling the ulimit command, you can see the maximum number of file handles that a process can open. The specific method of using this command is:

ulimit -n

The ulimit command is a command used to display and modify some basic limits of the current user process. The -n option is used to reference or set the limit value of the current number of file handles. The Linux system default value is 1024.

Theoretically, 1024 file descriptors are enough for most applications (such as Apache and desktop applications). However, it is far from enough for some high-concurrency applications with a large user base. A high-concurrency application often faces hundreds of thousands, millions, or even hundreds of millions of concurrent connections like Tencent QQ.

What are the consequences if there are not enough file handles? When the number of file handles opened by a single process exceeds the upper limit of the system configuration, the error message "Socket/File: Can't open so many files" will be issued.

Therefore, for high-concurrency, high-load applications, it is necessary to adjust this system parameter to adapt to application scenarios that handle a large number of connections concurrently. These two parameters can be set through ulimit. Methods as below:

ulimit  -n  1000000

In the above command, the larger the setting value of n, the larger the number of file handles that can be opened. It is recommended to execute this command as the root user.

There is a flaw in using the ulimit command. This command can only modify some basic restrictions of the current user environment and is only valid in the current user environment. In other words, the modification is effective while the current terminal tool is connected to the current shell; once the user session is disconnected, or the user exits Linux, its value will change back to the system default 1024. Moreover, after the system is restarted, the number of handles will return to the default value.

The ulimit command can only be used for temporary modification. If you want to permanently save the maximum number of file descriptors, you can edit the /etc/rc.local startup file and add the following content to the file:

ulimit -SHn 1000000

以上示例增加-S和-H两个命令选项。选项-S表示软性极限值,-H表示硬性极限值。硬性极限是实际的限制,就是最大可以是100万,不能再多了。软性极限值则是系统发出警告(Warning)的极限值,超过这个极限值,内核会发出警告。

普通用户通过ulimit命令,可将软极限更改到硬极限的最大设置值。如果要更改硬极限,必须拥有root用户权限。

终极解除Linux系统的最大文件打开数量的限制,可以通过编辑Linux的极限配置文件/etc/security/limits.conf来解决,修改此文件,加入如下内容:

* soft nofile 1000000
* hard nofile 1000000

soft nofile表示软性极限,hard nofile表示硬性极限。

举个实际例子,在使用和安装目前非常流行的分布式搜索引擎——ElasticSearch时,基本上就必须去修改这个文件,用于增加最大的文件描述符的极限值。当然,在生产环境运行Netty时,最好是修改/etc/security/limits.conf文件,增加文件描述符数量的限制。

除了修改应用进程的文件句柄上限之外,还需要修改内核基本的全局文件句柄上限,通过修改 /etc/sysctl.conf 配置文件来更改,参考的配置如下:

fs.file-max = 2048000
fs.nr_open = 1024000

fs.file-max表示系统级别的能够打开的文件句柄的上限,可以理解为全局的句柄数上限。是对整个系统的限制,并不是针对用户的。

fs.nr_open指定了单个进程可打开的文件句柄的数量限制,nofile受到这个参数的限制,nofile值不可用超过fs.nr_open值。

2.4 本章小结

本书的原则是:从基础讲起。本章彻底体现了这个原则。

本章聚焦的主题:一是底层IO操作的两个阶段,二是最为基础的四种IO模型,三是操作系统对高并发的底层的支持。

四种IO模型,基本上概况了当前主要的IO处理模型,理论上来说,从阻塞IO到异步IO,越往后,阻塞越少,效率也越优。在这四种IO模型中,前三种属于同步IO,因为真正的IO操作都将阻塞应用线程。

Only the last asynchronous IO model is the real asynchronous IO model. Unfortunately, the underlying implementation of the current Linux operating system or JDK is not yet perfect. However, through excellent application layer frameworks such as Netty, server-side applications that can support high concurrency (such as millions of connections) can also be developed based on the IO multiplexing model.

Finally, I would like to emphasize that this chapter is a theoretical lesson, which is relatively abstract, but you must understand it. After understanding these theories, you will get twice the result with half the effort by studying the following chapters.

The importance of layered decoupling

In Nien's crazy maker circle community (50+), people are often confused by the IO model, the Reactor reactor model, synchronization and asynchronousness.

Nien used decades of experience to summarize and give you a simple summary:

  • It must be layered, just like the WEB application architecture must be layered.
  • Threading model and IO model should be viewed separately and cannot be confused.

Many friends think that the underlying IO model of Reactor is NIO. Let's take a look at the Netty source code. Netty reactor supports various IO models, including BIO.

So, be sure to look at it in layers.

Nien divides the thread model and IO model into three layers: application layer, framework layer, and OS layer.

The details are shown in the figure below:

Netty's Reactor mode corresponds to the thread model, not the IO model.

At the IO model level, Tomcat also uses NIO. You must not think that Tomcat also uses BIO. Most of the HTTPClient client components use NIO and do not use the BIO model.

At the thread model level, many HTTPClient components either do not use the Reactor model or use the Reactor reactive thread model, but our business programs do not use it. Our business programs still use the API code of its synchronous blocking thread model. .

Let’s talk about blocking and non-blocking, synchronous and asynchronous

Next, let’s return to the concepts that are particularly confusing in IO: blocking and synchronization, non-blocking and asynchronous.

Note that synchronous io and asynchronous io are more discussed at the OS operating system layer.

Here we can summarize the entire process into two stages:

  • 数据准备阶段: 网络数据包到达网卡,通过DMA(专门的辅助芯片) 方式将数据包拷贝到内存中,然后经过硬中断,软中断,接着通过内核线程ksoftirqd经过内核协议栈的处理,最终将数据发送到 内核Socket的接收缓冲区 receive buffer 中。
  • 数据拷贝阶段: 当数据到达 内核Socket 的receive buffer 接收缓冲区中时,此时数据存在于内核空间 中,需要将数据拷贝到用户空间中,才能够被应用程序读取。

难点1:io模型中,阻塞与非阻塞的区别?

回到 io模型中,阻塞与非阻塞的区别。

阻塞与非阻塞的区别,主要发生在第一阶段:数据准备阶段。

讨论区别之前,假设一个业务场景:

在读数据场景中,应用程序发起系统调用read。

这时候,线程从用户态转为内核态,内核线程 试图去 读取内核 Socket的receive buffer 接收缓冲区中的网络数据,两种情况:

  • 如果receive buffer 有数据的话,就进行 数据内存复制,
  • 如果receive buffer 没有数据的话 ,怎么办?

处理的方式一:阻塞

如果这时内核Socket的接收缓冲区没有数据,那么线程就会一直阻塞等待,直到Socket接收缓冲区有数据为止。

等有了数据, 随后将数据从内核空间拷贝到用户空间,系统调用read返回。

从图中我们可以看出:阻塞的特点是在第一阶段和第二阶段 都会等待。

处理的方式二:非阻塞(轮询)

数据准备阶段, 如果receive buffer 没有数据的话 ,怎么办?

处理的方式二: 非阻塞 非阻塞(轮询):

Non-blocking IO.png

非阻塞IO

从上图中,我们可以看出:非阻塞的特点是第一阶段不会等待,但是在第二阶段还是会等待

The difference between blocking and non-blocking

  • In the first stage, when Socketthere is no data in the receive buffer, in blocking mode, the IO thread will keep waiting and waiting, and has to do other things. Note that thread resources are precious. The more IO connections there are, the thread resources will be exhausted. In non-blocking mode, the IO thread will not wait, and the system call directly returns the error flag EWOULDBLOCK. In order to read data, the IO thread needs to be polled consistently in non-blocking mode. Of course, you can also do other things first and poll again later.
  • In the second phase, when Socketthere is data in the receive buffer, the performance of 阻塞and 非阻塞is the same, it will enter the CPU data copy operation, the data is copied from the kernel space to the user space, and then the system call returns.

Difficulty 2: What is the difference between synchronization and asynchronousness in the io model?

The difference between synchronous and asynchronous mainly lies in the second stage: the data copy stage.

What is done in the second stage of data copying?

Data copy belongs to CPU copy, which mainly copies the data from the kernel space Socket receiving buffer to the byte array in user space, such as the array in the nio buffer. Applications can then manipulate this data.

The difference between synchronous and asynchronous is that the initiator of data copy is different.

After the data is ready and reaches the Socket receiving buffer, the user thread queries the io event through polling (there is data to read)

Mainly, next, start executing the kernel program and copy data.

The key is, who will initiate the process of finding a job?

  • io thread initiated
  • Kernel thread initiated

Processing method one: io thread initiates data copy

For synchronous IO operations, the initiator is the application and the receiver is the kernel.

In synchronous IO, the application process initiates an IO data copy operation, copying the data from the kernel space Socket receiving buffer to a byte array in user space, such as the array in the nio buffer, or polling whether the IO data copy is completed.

Note that this is the operating system. The thread that completes this operation can be the kernel state of the io thread, or another kernel thread. In the operating system, a thread is just a task structure. Different operating systems, or different versions of the same operating system, handle user tasks and kernel tasks differently.

We don’t have to worry about it here:

  • In the end, the io thread (kernel state) ends up doing the cpu copy by itself.
  • The io thread is still blocked, and another kernel thread copies it, and then wakes up the io thread after it is finished.

As architects/developers at the application layer, we don’t have to worry about it here.

Here, it is assumed that the bottom layer uses the second solution, the io thread is blocked, another kernel thread copies, and then the io thread is awakened.

In any case, for synchronous IO, the io thread initiates the data copy and assumes the responsibility of the data copy initiator. This is beyond doubt.

so,

Select and epoll under Linux both initiate data replication by the io thread, so they are all synchronous IO.

The kqueue under Mac initiates data copying by the io thread, so it is all synchronous IO.

Processing method two: Kernel thread initiates data copy

Processing method two: The characteristic is that the kernel thread initiates the data copy, and the kernel thread performs the second phase of the data copy operation. This is the asynchronous mode.

When the kernel completes the data copy operation, it will call back the data to the user thread.

So in asynchronous mode, the key is callbacks.

The io thread only needs to register the callback processing function of the data and it will be OK.

In asynchronous mode, the data preparation phase and data copy phase are completed by the kernel and will not cause any blocking to the application program.

Among the currently popular operating systems, IOCP in Windows is truly asynchronous IO, and its implementation is also very mature.

The point is, Windows is rarely used as a server.

In Linux, which is commonly used as a server, the asynchronous IO mechanism is not mature enough, and the performance improvement compared with NIO is not obvious enough.

Linux kernel version 5.1 introduced a new asynchronous IO library io_uring by Facebook guru Jens Axboe, which improved some performance issues of the original Linux native AIO.

The performance of io_uring is much improved compared to Epoll and the previous native AIO, which will be introduced later in this Bible.

Asynchronous IO library io_uring, please refer to the following chapters for details: Overturn nio, the king of io_uring is here! !

The underlying principle of blocking IO (BIO)

Next, Nien will lead everyone to start with the simplest bio and penetrate the underlying nio step by step.

Data reading process: dma writes the received data into the Socket kernel buffer

First, let’s look at the macroscopic data reading process

As mentioned earlier, the macro data reading process is: dma will write the data received by the network card into the Socket kernel buffer

① Stage: The network card receives the data from the network cable

②Stage: dma hardware circuit transmission;

Stage 3: Finally write the data to the memory core space address

This process involves hardware-related knowledge such as DMA transmission and IO channel selection.

BIO application layer network programming

This is the most basic network programming code.

The following is the pseudo code of the server side of the application layer. First, create a socket object, call bind, listen, accept in sequence, and finally call recv to receive data.

//创建socket
int s = socket(AF_INET, SOCK_STREAM, 0);   
//绑定
bind(s, ...)
//监听
listen(s, ...)
//接受客户端连接
int c = accept(s, ...)
//接收客户端数据
recv(c, ...);
//将数据打印出来
printf(...)

The interaction process between the server and the client in the application layer is roughly as follows:

In order to facilitate the subsequent introduction, let's look at several basic concepts in the operating system dimension.

From here on, you will know the importance of the operating system course.

When we were in college, we were studying operating systems in a daze, thinking that it was of little use. However, it turned out to be of great use, but we just didn't know it.

This is the fearlessness of the ignorant.

operating system work queue

In order to support multitasking, the operating system implements the process scheduling function and divides the process into several states such as "running" and "waiting".

  • The running state is the state in which the process has obtained the right to use the CPU and is executing code;
  • The waiting state is a blocking state. For example, when the above program runs to recv, the program will change from the running state to the waiting state, and then change back to the running state after receiving the data.

The operating system will execute processes in each running state in a time-sharing manner. Because of its high speed, it looks like it is executing multiple tasks at the same time.

The computer in the figure below is running three processes A, B, and C. Process A executes the above basic network program. At the beginning, these three processes are referenced by the work queue of the operating system and are in a running state. time execution.

The underlying socket structure of the operating system

The socket structure, simply understood, is an extended structure of a file descriptor.

The socket structure has some basic attributes of the file descriptor, such as the file descriptor id. There are also some extended attributes, such as sending cache, receiving cache, and a lot more.

When process A executes the statement to create a socket, the operating system will create a socket structure (java object) managed by the file system.

This socket object contains

  • Send buffer
  • receive buffer
  • Waiting for queue members

socket 的等待队列是个非常重要的结构,它指向所有需要等待该 socket 事件的进程。

socket 结构图解

Socket的创建

服务端线程调用accept系统调用后开始阻塞,当有客户端连接上来并完成TCP三次握手后,内核会创建一个对应的Socket 作为服务端与客户端通信的内核接口。

在Linux内核的角度看来,一切皆是文件,Socket也不例外,当内核创建出Socket之后,会将这个Socket放到当前进程所打开的文件列表中管理起来。

下面,来看下进程管理这些打开的文件列表相关的内核数据结构是什么样的?

struct task_struct是内核中用来表示进程/线程的一个数据结构,它包含了进程的所有信息。

这里,只列出和文件管理相关的属性。

一个进程内打开的所有文件,是通过一个数组fd_array来进行组织管理,数组的下标即为我们常提到的文件描述符,数组中存放的是对应的文件数据结构struct file。

每打开一个文件,内核都会创建一个struct file与之对应,并在fd_array中找到一个空闲位置分配给它,数组中对应的下标,就是我们在用户空间用到的文件描述符。

对于任何一个进程,默认情况下,文件描述符 0表示 stdin 标准输入,文件描述符 1表示stdout 标准输出,文件描述符2表示stderr 标准错误输出。

前面讲到,一个Socket,也是一个文件描述符 ,socket的大致结构如下:

文件描述符,用于封装文件元信息,它的内核数据结构struct file中。

文件描述符,有一个private_data指针,指向具体的Socket结构。

struct file中的file_operations属性定义了文件的操作函数,不同的文件类型,对应的file_operations是不同的,针对Socket文件类型,这里的file_operations指向socket_file_ops。

我们在用户空间对Socket发起的读写等系统调用,进入内核首先会调用的是Socket对应的struct file中指向的socket_file_ops。

static const struct file_operations socket_file_ops = {
    
    
 .owner = THIS_MODULE,
 .llseek = no_llseek,
 .read_iter = sock_read_iter,
 .write_iter = sock_write_iter,
 .poll =  sock_poll,
 .unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
 .compat_ioctl = compat_sock_ioctl,
#endif
 .mmap =  sock_mmap,
 .release = sock_close,
 .fasync = sock_fasync,
 .sendpage = sock_sendpage,
 .splice_write = generic_splice_sendpage,
 .splice_read = sock_splice_read,
};

比如:对Socket发起write写操作,在内核中首先被调用的就是socket_file_ops中定义的sock_write_iter。Socket发起read读操作内核中对应的则是sock_read_iter。

static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
    
    
	struct file *file = iocb->ki_filp;
	struct socket *sock = file->private_data;
	struct msghdr msg = {
    
    .msg_iter = *from,
			     .msg_iocb = iocb};
	ssize_t res;
......
	res = sock_sendmsg(sock, &msg);
	*from = msg.msg_iter;
	return res;
}
Socket内核结构

在进行网络程序的编写时会首先创建一个Socket,然后基于这个Socket进行bind,listen,

这个Socket称作为监听Socket。

这里需要注意的是,监听的 socket并不是数据传输的 Socket,监听socket和数据传输的 Socket是两类 Socket,一个叫作监听 Socket,一个叫数据传输Socket。

当我们调用accept后,内核会基于监听Socket创建出来一个新的Socket,专门用于与客户端之间的网络通信,这就是数据传输Socket。

在创建好传输socket后,并将监听Socket中的Socket操作函数集合(inet_stream_ops)ops赋值到新的Socket的ops属性中。

const struct proto_ops inet_stream_ops = {
    
    
    .bind = inet_bind,
    .connect = inet_stream_connect,
    .accept = inet_accept,
    .poll = tcp_poll,
    .listen = inet_listen,
    .sendmsg = inet_sendmsg,
    .recvmsg = inet_recvmsg,
    ......
}

接着内核会为已连接的Socket创建struct file并初始化,并把Socket文件操作函数集合(socket_file_ops)赋值给struct file中的f_ops指针。然后将struct socket中的file指针指向这个新分配申请的struct file结构体。

内核会维护两个队列:

  • 一个是已经完成TCP三次握手,连接状态处于established的连接队列。内核中为icsk_accept_queue。
  • 一个是还没有完成TCP三次握手,连接状态处于syn_rcvd的半连接队列。

然后调用socket->ops->accept,这里其实调用的是inet_accept,该函数会在icsk_accept_queue中查找是否有已经建立好的连接,如果有的话,直接从icsk_accept_queue中获取已经创建好的struct sock,并将这个struct sock对象赋值给struct socket中的sock指针。

struct sock在struct socket中是一个非常核心的内核对象,在这里,定义了我们在介绍网络包的接收发送流程中提到的接收队列,发送队列,等待队列,数据就绪回调函数指针,内核协议栈操作函数集合

然后,根据创建Socket时发起的系统调用sock_create中的protocol参数(TCP参数值为SOCK_STREAM),查找到对于 tcp 定义的操作方法实现集合 inet_stream_ops 和tcp_prot,并把它们分别设置到socket->ops和sock->sk_prot上。

socket相关的操作接口定义在struct socket结构的 ops指针指向的 inet_stream_ops函数集合中,负责对上给用户提供接口。 这个对上,对外,对应用层。

而socket与内核协议栈之间的操作接口定义在struct sock中的sk_prot指针上,这里指向tcp_prot协议操作函数集合。 这个对下,对内,对协议栈。

struct proto tcp_prot = {
    
    
    .name      = "TCP",
    .owner      = THIS_MODULE,
    .close      = tcp_close,
    .connect    = tcp_v4_connect,
    .disconnect    = tcp_disconnect,
    .accept      = inet_csk_accept,
    .keepalive    = tcp_set_keepalive,
    .recvmsg    = tcp_recvmsg,
    .sendmsg    = tcp_sendmsg,
    .backlog_rcv    = tcp_v4_do_rcv,
    ......
}

之前提到的对Socket发起的系统IO调用,在内核中首先会调用Socket的文件结构struct file中的file_operations文件操作集合,然后调用struct socket中的ops指向的inet_stream_opssocket操作函数,最终调用到struct sock中sk_prot指针指向的tcp_prot内核协议栈操作函数接口集合。

系统IO调用结构

  • 将struct sock 对象中的sk_data_ready 函数指针设置为 sock_def_readable,在Socket数据就绪的时候内核会回调该函数。
  • struct sock中的等待队列sk_wq中存放的是系统IO调用发生阻塞的进程fd,以及相应的回调函数。记住这个地方,后边介绍会提到!

当struct file,struct socket,struct sock这些核心的内核对象创建好之后,最后就是把socket对象对应的struct file放到进程打开的文件列表fd_array中。随后系统调用accept返回socket的文件描述符fd给用户程序。

进程 A 从工作队列移动到该 socket 的等待队列中

阻塞IO的场景,当用户进程发起系统IO调用比如read时,用户进程会在内核态查看对应Socket接收缓冲区是否有数据到来。

  • Socket接收缓冲区有数据,则拷贝数据到用户空间,系统调用返回。
  • Socket接收缓冲区没有数据,则用户进程让出CPU进入阻塞状态,
  • 当数据到达接收缓冲区时,用户进程会被唤醒,从阻塞状态进入就绪状态,等待CPU调度。

这里关注 Socket接收缓冲区没有数据,则用户进程让出CPU,进入阻塞状态 。

用户进程让出CPU,进入阻塞状态 ,操作系统会将进程 A 从工作队列移动到该 socket 的等待队列中 。

进程/线程阻塞。(注意,内核里边, 进程和线程,是一回事,都是task任务)

  • 首先我们在用户进程中对Socket进行read系统调用时,用户进程会从用户态转为内核态。
  • 在进程的struct task_struct结构找到fd_array,并根据Socket的文件描述符fd找到对应的struct file,调用struct file中的文件操作函数结合file_operations,read系统调用对应的是sock_read_iter。
  • 在sock_read_iter函数中找到struct file指向的struct socket,并调用socket->ops->recvmsg,这里我们知道调用的是inet_stream_ops集合中定义的inet_recvmsg。
  • 在inet_recvmsg中会找到struct sock,并调用sock->skprot->recvmsg,这里调用的是tcp_prot集合中定义的tcp_recvmsg函数。

tcp_recvmsg内核函数中,将用户进程给阻塞掉的, 具体流程如下:

以上流程是linux的内核结构,和相关流程,大家不用太细致扣。

总之, 进程 task,进入到了 sk_wq 对待队列, 并将进程设置为可打断 INTERRUPTIBL

调用sk_wait_event让出CPU,进程进入睡眠状态。

cpu 进行其他的进程调度, 当前进程离开cpu 的工作队列了。

题外知识:被阻塞的线程,是不会占用CPU资源

上图中,由于工作队列只剩下了进程 B 和 C ,依据进程调度, cpu 会轮流执行这两个进程的程序,

不会执行进程 A 的程序。

所以进程 A 被阻塞,不会往下执行代码,也不会占用 cpu 资源。

什么又是CPU的工作队列?

CPU的工作队列

操作系统维度的知识,回去翻翻课本。

线程、CPU与工作队列之间的关系

操作系统维度的知识,回去翻翻课本。

内核接收数据全过程

当 socket 接收到数据后,操作系统将该 socket 等待队列上的进程重新放回到工作队列,该进程变成运行状态,继续执行代码。

由于 socket 的接收缓冲区已经有了数据, recv 可以返回接收到的数据。

内核接收数据全过程:

  • 计算机收到了对端传送的数据(步骤 ①)
  • 数据经由网卡传送到内存(步骤 ②)
  • 然后网卡通过中断信号通知 CPU 有数据到达,CPU 执行中断程序(步骤 ③)
  • 中断程序先将网络数据写入到对应 Socket 的接收缓冲区里面(步骤 ④)
  • 再唤醒进程 A(步骤 ⑤),重新将进程 A 放入工作队列中。

唤醒进程的过程如下图所示:

问题1:内核如何知道接收的网络数据时属于哪个socket?

socket数据包格式(源ip,源端口,协议,目的ip,目的端口)

一般通过目的ip,目的端口 就可以识别出来接收到的网络数据属于哪个socket。

如果目的ip,目的端口相同呢?

其实多个客户端与同一个服务端建立了连接,这个时候内核就会有多个socket。

并且为它们分配多个fd文件描述符。它们收到网络数据后无法通过目的端口来直接匹配socket,还需要再通过源ip和端口来确定属于哪个socket。

问题2:内核如何同时监控多个socket?

内核如何同时监控多个socket? 一个专用线程轮询所有socket,当某个socket有数据到达了,就通知用户io线程。

目前的经典解决办法:I/O 多路复用, 也就是IO multiplexing

多路复用在Linux内核代码迭代过程中依次支持了三种调用:

  • SELECT
  • POLL
  • EPOLL

IO多路复用

如何用尽可能少的线程去处理更多的连接,提升性能?

一个专用线程轮询所有socket,当某个socket有数据到达了,就通知用户io线程。

目前的经典解决办法:I/O 多路复用, 也就是IO multiplexing

IO multiplexing,涉及两个概念:

  • 多路:尽可能少的线程来处理尽可能多的连接,这里的 多路 , 指的就是 需要处理的 连接。
    bio ,阻塞式的,在阻塞IO模型中一个连接就需要分配一个独立的线程去专门处理这个连接上的读写,
    所以 bio 就是 单路
  • 复用:使用 尽可能少的线程 ,尽可能少的系统开销, 指的是线程的复用
    那么这里的复用指的就是用有限的资源,比如用一个线程或者固定数量的线程去处理众多连接上的读写事件。
    换句话说,到了IO多路复用模型中,多个连接可以复用这一个独立的线程去处理这多个连接上的读写。

如何让一个独立的线程去处理众多连接上的读写事件呢?

IO multiplexing 进行了 一次大解耦 , 把 IO 事件 (可以操作状态) ,和IO的操作,进行剥离

解耦出一个独立的 系统调用, 查询 IO 事件 (可以操作状态) ,

剩下的read、write等系统调用, 仅仅在有事件的情况下, 去进行 和IO的操作(读或者写)

多路复用在Linux内核代码迭代过程中依次支持了三种调用:

  • SELECT 系统调用
  • POLL 系统调用
  • EPOLL 系统调用

select 系统调用

Linux 内核、Windows内核都提供了 系统调用操作,可以把1024个文件描述符的IO事件轮询,简化为一次轮询,轮询发生在内核空间。

在如下的代码中,先准备一个数组 fds 存放着所有需要监视的 socket 。

然后调用 select ,如果 fds 中的所有 socket 都没有数据,select 会阻塞,直到有一个 socket 接收到数据, select 返回,唤醒进程。

int fds[] = 存放需要监听的 socket

while(1){
    
    

int n = select(..., fds, ...)  //fds 加入到socket 阻塞队列,select 返回后,唤醒进程

for(int i=0; i < fds.count; i++){
    
    

  if(FD_ISSET(fds[i], ...)){
    
    
       //fds[i]的数据处理}
   }
}

用户可以遍历 fds数组 ,通过 FD_ISSET 判断具体哪个 socket 收到数据,然后做出处理。

使用select的核心步骤:

  • 先准备了一个数组 fds,让 fds 存放着所有需要监视的 socket。
  • 然后调用 select,如果 fds 中的所有 socket 都没有数据,select 会阻塞,直到有一个 socket 接收到数据,select 返回,唤醒进程。
  • 用户可以遍历 fds,通过 FD_ISSET 判断具体哪个 socket 收到数据,然后做出处理。

Select系统调用的参数说明

select这个系统调用的原型如下

int select(int nfds, fd_set *readfds, fd_set *writefds
           ,fd_set *exceptfds, struct timeval *timeout); 
  • 第一个参数nfds用来告诉内核要扫描的socket fd的数量+1,select系统调用最大接收的数量是1024,但是如果每次都去扫描1024,实际上的数量并不多,则效率太低,这里可以指定需要扫描的数量。

The maximum number is 1024. If you need to modify this number, you need to recompile the Linux kernel source code . The default is 1024 for 32-bit machines and 2048 for 64-bit machines.

  • The 2nd, 3rd, and 4th parameters are readfds, writefds, and exceptfds respectively. The parameters passed should be references of the fd_set type.
  • fd_set *readset: The set of file descriptors to read the event.
  • fd_set *writeset: To check the set of file descriptors for writable events.
  • fd_set *exceptset: A set of file descriptors to check for exception events.

After returning, the kernel will detect the fd of each socket. If an event occurs, the event will be removed from the array.

  • If there is no read event, the corresponding fd will be removed from the fd_set passed in as the second parameter.
  • If there is no writing event, the corresponding fd will be removed from the fd_set of the second parameter.
  • If there is no abnormal event, the corresponding fd will be removed from the fd_set of the third parameter.

Note that a copy is passed here:

Here we should pass in a copy of the actual readfds, writefds, and exceptfds instead of the original reference, because if the original reference is passed, some sockets may have been lost.

The last parameter of select, timeout, is the waiting time, which is divided into three scenarios:

  • Passing in 0 means non-blocking,
  • Passing in >0 means waiting for a certain period of time,
  • Passing in NULL means blocking until a socket is ready.

Functions for some common operations on the BitMap structure

The earliest fd_set is an integer array

Define FD_SETSIZE as 1024. An integer occupies 4 bytes, which is 32 bits. Then an integer array containing 1024 elements is used to represent the file descriptor set.

In later versions, the performance of the fd_set file descriptor set is optimized, and bitmap is used instead of the integer array.

Therefore, the fd_set here is optimized from a file descriptor array to a BitMap structure.

After the kernel traverses the fd array and finds an IO-ready fd, it will set the value in the BitMap corresponding to the fd to 1. After the kernel processing is completed, the modified fd array will be returned to the user thread.

在用户线程中需要重新遍历fd数组,找出IO就绪的fd出来,然后发起真正的读写调用。

下面是 处理 fd数组的过程中,需要用到的API:

void FD_CLR(int fd, fd_set *set);  //将某个bit置0,fd传入bit的索引
int  FD_ISSET(int fd, fd_set *set); //判断某个bit是否被置1了,fd传入索引
void FD_SET(int fd, fd_set *set);  //将bitmap某个bit置1,fd传入bit的索引
void FD_ZERO(fd_set *set);  //将bitmap中的所有bit归0,一般用来进行初始化
  • FD_CLR()这个函数用来将bitmap(fd_set )中的某个bit清0,在客户端异常退出时就会用到这个函数,将fd从fd_set中删除。
  • FD_ISSET()用来判断某个bit是否被置1了,也就是判断某个fd是否在fd_set中。
  • FD_ZERO()这个函数将fd_set中的所有bit清0,一般用来进行初始化等。
  • FD_SET()这个函数用来将某个fd加入fd_set中,当客户端新加入连接时就会使用到这个函数。

注意,每次调用select之前都要通过FD_ZERO和FD_SET重新设置文件描述符,因为文件描述符集合会在内核中被修改。

这里的文件描述符数组其实是一个BitMap,BitMap下标为文件描述符fd,下标对应的值为:1表示该fd上有读写事件,0表示该fd上没有读写事件。

select例子

然后调用 select ,如果 fds 中的所有 socket 都没有数据,select 会阻塞,直到有一个 socket 接收到数据, select 返回,唤醒进程。

int fds[] = 存放需要监听的 socket
while(1){
    
    
int n = select(..., fds, ...)  //fds 加入到socket 阻塞队列,select 返回后,唤醒进程
for(int i=0; i < fds.count; i++){
    
    
   if(FD_ISSET(fds[i], ...)){
    
    
       //fds[i]的数据处理}
   }
}

假如程序同时监视如下图的 Sock1、Sock2 和 Sock3 三个 Socket,

那么在调用 Select 之后,操作系统把进程 A 分别加入这三个 Socket 的等待队列中。

当任何一个socket收到数据后,中断程序将唤起进程。

下图展示了sock2接收到了数据的处理流程。

所谓唤起进程,就是将进程从所有的等待队列中移除,加入到工作队列里面。如下图所示。

对于调用了select的进程A而言:

  1. A存在于多个socket的等待队列中
  2. 当某个socket被写入数据时,A也被唤醒并从多个socket的等待队列中移除后加入内核的工作队列
  3. 但是此时A并不知道是哪个socket被写入了数据,所以只能遍历所有socket
  4. After A finishes processing the task, it is removed from the kernel's work queue, but at this time it needs to traverse all sockets and add them to their waiting queues.

select execution process

Select is a system call provided by the operating system kernel for us to use. It solves the problem of user space and kernel space caused by the need to continuously initiate system IO calls to poll the Socket receiving buffer on each connection in the non-blocking IO model. System overhead of constant switching.

The select system call hands over the polling operation to the kernel to help us complete it, thus avoiding the system performance overhead caused by continuously initiating polling in user space.

  • First, the user thread will block on the select system call when it initiates the select system call. At this time, the user thread switches from user mode to kernel mode and completes a context switch.
  • The user thread passes the file descriptor fd array corresponding to the Socket that needs to be monitored to the kernel through the select system call. At this time, the user thread copies the file descriptor fd array in user space to kernel space.

Disadvantages of select:

Each call to select requires two steps: adding a process to the socket's waiting queue and blocking the process.

Therefore, the overhead of traversing the socket list twice is required:

  • The first time: When a process joins the waiting queue of a socket, it needs to traverse all sockets.
  • The second time: When process A is awakened, it needs to be removed from the waiting queue of all sockets after awakening.

It is precisely because of the high overhead of traversal operations and efficiency considerations that the maximum number of monitors for select is specified. By default, only 1024 sockets can be monitored.

The other is the cost of two copies of fd:

  • Each time select is called, the fds list needs to be passed to the kernel, which requires a copy, which has a certain overhead.
  • After each call to select is completed, the fd collection must be copied from the kernel space to the user space, which also has a certain overhead. This overhead will be very large when there are many fds;

Also: the user thread still has to traverse the file descriptor collection to find the specific IO readySocket

Although polling was originally initiated in user space, it has been optimized to initiate polling in kernel space.

But select will not tell the user thread which Sockets the IO ready event occurred on. It only marks the IO ready Socket. The user thread still has to traverse the file descriptor collection to find the specific IO ready Socket.

The time complexity is still O(n).

In short, select cannot solve the C10K problem.

以上select的不足所产生的性能开销都会随着并发量的增大而线性增长。

很明显select也不能解决C10K问题,只适用于1000个左右的并发连接场景。

所谓c10k问题,指的是:服务器如何支持10k个并发连接,也就是concurrent 10000 connection(这也是c10k这个名字的由来)。

poll 系统调用

poll的出现

1997 年,出现了 poll 作为 select 的替代者,最大的区别就是,poll 不再限制 socket 数量。

poll系统调用

poll其实内部实现基本跟select一样,区别在于它们底层组织fd[]的数据结构不太一样,从而实现了poll的最大文件句柄数量限制去除了。

poll的描述fd集合的方式不同,poll使用pollfd结构而不是select的fd_set结构,其他的都差不多,

poll管理多个描述符也是进行轮询,根据描述符的状态进行处理,但是poll没有最大文件描述符数量的限制

int poll(struct pollfd *fds, unsigned int nfds, int timeout)

pollfd 结构

struct pollfd {
    
    
    int   fd;         /* 文件描述符 */
    short events;     /* 需要监听的事件 */
    short revents;    /* 实际发生的事件 由内核修改设置 */
};

成员变量说明:
(1)fd:每一个 pollfd 结构体指定了一个被监视的文件描述符,可以传递多个结构体,指示 poll() 监视多个文件描述符。

(2)events:表示要告诉操作系统需要监测fd的事件(输入、输出、错误),每一个事件有多个取值

(3)revents:revents 域是文件描述符的操作结果事件,内核在调用返回时设置这个域。events 域中请求的任何事件都可能在 revents 域中返回。

events&revents的取值如下:

事件 描述 是否可作为输入(events) 是否可作为输出(revents)
POLLIN 数据可读(包括普通数据&优先数据)
POLLOUT 数据可写(普通数据&优先数据)
POLLRDNORM Ordinary data can be read yes yes
POLLRDBAND Priority with data readable (not supported by Linux) yes yes
POLL PRIZE High priority data can be read, such as TCP out-of-band data yes yes
POLLWRNORM Ordinary data can be written yes yes
POLLWRBAND Priority band data can be written yes yes
POLLRDHUP The TCP connection was closed by the peer, or the write operation was closed, introduced by GNU yes yes
POPPHUP hang no yes
POLLERR mistake no yes
POLLNVAL File descriptor not open no yes

Read event/socket read ready

When will the socket be readable?

  1. Scenario 1: The number of data bytes in the socket receive buffer is greater than or equal to the socket receive buffer low water mark SO_RCVLOWAT .
    For TCP and UDP sockets, the buffer low-water mark defaults to 1 . That means that, by default, as long as there is data in the buffer, it is readable.
    We can set the low water mark size of this socket by using the SO_RCVLOWAT socket option (see setsockopt function).
    When this descriptor is ready (readable), when we use read/recv to perform a read operation on the socket, the socket will not block, but will successfully return a value greater than 0 (i.e. read data size).
  2. Scenario 2: The read half of the connection is closed (that is, the TCP connection that received the FIN). Read operations on such a socket will not block, but will return 0 (that is, EOF).
  3. Scenario 3: The socket is a listening socket , and the number of completed connections so far is not 0. The accept operation on such a socket usually does not block.
  4. 场景4:有一个错误套接字待处理。对这样的套接字的读操作将不阻塞并返回-1(也就是返回了一个错误),同时把errno设置成确切的错误条件。这些待处理错误(pending error)也可以通过指定SO_ERROR套接字选项调用getsockopt获取并清除。

写事件/socket写就绪

什么时候,socket可写呢?

  1. 场景2:socket内核中, 发送缓冲区中的可用字节数(发送缓冲区的空闲位置大⼩), 大于等于低水位标记 SO_SNDLOWAT, 此时可以无阻塞的写, 并且返回值大于0。
    对于TCP和UDP而言,这个低水位SO_SNDLOWAT的值默认为2048,而套接字默认的发送缓冲区大小是8k,这就意味着一般一个套接字连接成功后,就是处于可写状态的。我们可以通过SO_SNDLOWAT套接字选项(参见setsockopt函数)来设置这个低水位。
    此种情况下,我们设置该套接字为非阻塞,对该套接字进行写操作(如write,send等),将不阻塞,并返回一个正值(例如由传输层接受的字节数,即发送的数据大小)。
  2. 场景3:该连接的写半部关闭(主动发送FIN包的TCP连接)。
    对这样的套接字的写操作将会产生SIGPIPE信号。所以我们的网络程序基本都要自定义处理SIGPIPE信号。因为SIGPIPE信号的默认处理方式是程序退出。
  3. 场景4:使用非阻塞的connect套接字已建立连接,或者connect已经以失败告终。即connect有结果了。
  4. Scenario 5: There is a bad socket pending. Write operations to such a socket will not block and return -1 (that is, an error was returned), with errno set to the exact error condition. These pending errors can also be obtained and cleared by calling getsockopt specifying the SO_ERROR socket option.
condition Readable? Can it be written? Abnormal?
There is data to read,
close the continuous reading half,
and prepare the listening socket for new connections.


There is space available for writing.
Close the write half of the connection.

Pending errors
TCP out-of-band data

The nature of the poll() system call

The select() and poll() system calls are essentially the same

The mechanism of poll() is essentially the same as select(). Every time it is called, the fd collection needs to be copied from user mode to kernel mode.

Both manage multiple descriptors by polling and processing them according to the status of the descriptors.

The file descriptor set used in select is the fd_set of the BitMap structure with a fixed length of 1024, and poll is replaced by an array with a pollfd structure without a fixed length, so there is no limit on the maximum number of descriptors (of course it will also be subject to system file descriptor limit)

Poll only improves the limit that select can only listen to 1024 file descriptors, but it does not improve performance.

There is not much difference in essence between poll and select.

  • It is also necessary to poll the file descriptor set in the kernel space and user space. The time complexity of finding the IO-ready Socket is still O(n).
  • It also requires an entire collection of file descriptors to be copied back and forth between user space and kernel space, regardless of whether the file descriptors are ready . Their overhead increases linearly as the number of file descriptors increases.
  • Select and poll need to transfer the entire new socket set to the kernel every time they add or delete a socket that needs to be monitored.

Poll is also not suitable for high concurrency scenarios. Still can't solve the C10K problem.

epoll underlying system call

Here comes the core point of the interview

Background: The underlying system call of epoll is the core knowledge point of the interview

To learn high concurrency, epoll is the foundation

This article continues to be updated, and I strive to use this article to give everyone a thorough introduction to epoll.

The importance of epoll

epoll is an essential technology for high-performance network servers under Linux.

Java NIO, nginx, redis, skynet and most game servers use this multiplexing technology.

Reasons why select /poll is inefficient

A select call combines the two steps of "maintaining the waiting queue (event monitoring)" and "blocking waiting (event query)" into one, tightly coupled

Each call to select requires two steps: adding a process to the socket's waiting queue and blocking the process.

select combines the two steps of "event registration" and "event query" into one, tightly coupled ,

select/poll requires two socket list traversals:

  • The first time: Every time you call select, you need to pass the fds list to the kernel, which has a certain overhead. When a process joins the waiting queue of a socket, it needs to traverse all sockets.

  • The second time: When process A is awakened, it needs to be removed from the waiting queue of all sockets after awakening.

It is precisely because of the high overhead of traversal operations and efficiency considerations that the maximum number of monitors for select is specified. By default, only 1024 sockets can be monitored.

epoll decouples the two steps of "event registration" and "event query" and divides them into two.

How to optimize: epoll separates these two operations, first uses epoll_ctl to maintain the waiting queue, and then calls epoll_wait to block the process .

Obviously, efficiency can be improved.

Optimization measures for epoll:

epoll was invented many years after select and poll appeared. It is an enhanced version of select and poll.

epoll improves efficiency through the following measures.

  • Functional decoupling: decouple the two steps of "event registration" and "event query" and divide them into two
  • Space for time: The ready list rdlist is introduced to store file descriptors in which io events have occurred

Three methods of epoll

  • epoll_create: The kernel will create an eventpoll object (a dedicated file descriptor, which is the object represented by epfd in the program). The
    eventpoll object is also a member of the file system. Like socket, it will also have a waiting queue.
  • epoll_ctl: Event registration, add socket to be monitored.
    If you add monitoring of sock1, sock2 and sock3 through epoll_ctl, the kernel will add the three sockets to the eventpoll listening queue.
  • epoll_wait: Event query, blocking waiting.
    After process A runs the epoll_wait statement, process A will wait for the eventpoll waiting queue.

Optimization measure 1 of epoll: Functional decoupling

One of the reasons why select is inefficient is that the two steps of "maintaining the waiting queue" and "blocking the process" are combined into one.

In most application scenarios, the sockets that need to be monitored are relatively fixed and do not need to be modified every time.

epoll separates these two operations, first uses epoll_ctl to maintain the waiting queue, and then calls epoll_wait to block the process.

Obviously, there is no need to copy a large amount of data every time a query is made, and the efficiency can be improved.

epoll waiting list

Optimization measure 2 of epoll: ready list rdlist

Another reason why select is inefficient is that the program does not know which sockets receive data and can only traverse one by one.

If the kernel maintains a "ready list" rdlist, referencing the socket that received the data, it can avoid traversal.

In the following code, epoll_create is first used to create an epoll object epfd, and then the socket to be monitored is added to the dedicated waiting list of epfd through epoll_ctl. Finally, epoll_wait is called to wait for data and return the ready socket in the rdlist list.

int epfd = epoll_create(...);
epoll_ctl(epfd, ...); //第一步:将所有需要监听的 socket 添加到 epfd 中等待队列

while(1){
    
    
   int n = epoll_wait(...)  //第二步:阻塞进程,等待事件
   for(接收到数据的 socket){
    
    
     //处理
   }
}

Assume that process A and process B are running on the computer. At a certain moment, process A runs the epoll_wait statement.

The kernel will put process A into the waiting queue of eventpoll, blocking the process.

When the socket receives data, the interrupt program does two jobs:

  • On the one hand, modify rdlist

  • On the other hand, wake up eventpoll to wait for the processes in the queue, and process A enters the running state again.

Also because of the existence of rdlist, process A can know which sockets have changed.

Combined with the kernel source code, delve into the principles and source code of Epoll

A simple example using epoll

int main(){
    
    
    listen(lfd, ...);

    cfd1 = accept(...);
    cfd2 = accept(...);
  
    efd = epoll_create(...);

    epoll_ctl(efd, EPOLL_CTL_ADD, cfd1, ...);
    epoll_ctl(efd, EPOLL_CTL_ADD, cfd2, ...);
    
    epoll_wait(efd, ...)
}

The functions related to epoll are as follows:

  • epoll_create: Create an epoll object
  • epoll_ctl: Add the connections to be managed to the epoll object
  • epoll_wait: Waiting for IO events on the connections it manages

The core structure of epoll

The meanings of several members of the core structure of epoll are as follows:

  • `wq: Waiting queue list. Inside is the user process waiting for events. When the IO time is ready, the kernel will use wq to find the user process blocked on the epoll object.
  • rbr: A red-black tree. Manage all socket connections being monitored. In order to support efficient search, insertion and deletion of massive connections, eventpoll uses a red-black tree internally. All socket connections added under the user process are managed through this tree.
  • rdlist: A linked list of ready descriptors. When some connections are ready, the kernel will put the ready connections into the rdllist list. In this way, the application process only needs to judge the linked list to find the ready process, without having to traverse the entire tree.

struct eventpoll source file:

// file:fs/eventpoll.c
struct eventpoll {
    
    

    //sys_epoll_wait用到的等待队列
    wait_queue_head_t wq;

    //接收就绪的描述符都会放到这里
    struct list_head rdllist;

    //每个epoll对象中都有一颗红黑树
    struct rb_root rbr;

    ......
}

Super bottom layer 1: The super bottom principle of epoll_create in the first step of using Epoll:

When a process calls the epoll_create method, the kernel creates an eventpoll object (epfd file descriptor)

epoll_create is a system call provided by the kernel to create epoll objects.

epoll_create, open an epoll file descriptor.

#include <sys / epoll.h>

nfd = epoll_creat(max_size);

epoll_create() creates an epoll instance. The parameter max_size identifies the maximum number of listeners. Starting from Linux 2.6.8, the max_size parameter will be ignored, but must be greater than zero.

其中nfd为epoll句柄,epoll_create()返回引用新epoll实例的文件描述符。该文件描述符用于随后的所有对epoll的调用接口。

每创建一个epoll句柄,会占用一个fd,因此当不再需要时,应使用close关闭epoll_create()返回的文件描述符,否则可能导致fd被耗尽。当所有文件描述符引用已关闭的epoll实例,内核将销毁该实例并释放关联的资源以供重用。

返回值:

成功时,这些系统调用将返回非负文件描述符。如果出错,则返回-1,并且将errno设置为指示错误。

错误errno:

  • EINVAL大小不为正。
  • EMFILE遇到了每个用户对/ proc / sys / fs / epoll / max_user_instances施加的epoll实例数量的限制。
  • ENFILE已达到打开文件总数的系统限制。
  • ENOMEM没有足够的内存来创建内核对象。
    当我们在用户进程中调用epoll_create时,内核会为我们创建一个struct eventpoll对象,

并且也有相应的struct file与之关联,同样需要把这个struct eventpoll对象所关联的struct file放入进程打开的文件列表fd_array中管理。

struct eventpoll对象关联的struct file中的file_operations 指针指向的是eventpoll_fops操作函数集合。

static const struct file_operations eventpoll_fops = {
    
    
     .release = ep_eventpoll_release;
     .poll = ep_eventpoll_poll,
}

当然eventpoll结构被申请完之后,在 ep_alloc 方法做一点点的初始化工作

//file: fs/eventpoll.c
static int ep_alloc(struct eventpoll **pep)
{
    
    
    struct eventpoll *ep;

    //申请 epollevent 内存
    ep = kzalloc(sizeof(*ep), GFP_KERNEL);

    //初始化等待队列头
    init_waitqueue_head(&ep->wq);

    //初始化就绪列表
    INIT_LIST_HEAD(&ep->rdllist);

    //初始化红黑树指针
    ep->rbr = RB_ROOT;

    ......
}

和socket一类似,eventpoll也会有等待队列。

  • wait_queue_head_t wq:epoll中的等待队列,队列里存放的是阻塞在epoll上的用户进程。在IO就绪的时候epoll可以通过这个队列找到这些阻塞的进程/线程并唤醒它们,从而执行IO调用读写Socket上的数据。

Pay attention here to distinguish it from the waiting queue in Socket! What is added to the Socket waiting queue are threads and processes blocked by io operations on the Socket.

  • struct list_head rdllist: The ready queue in epoll. The queue stores IO-ready Sockets. The awakened user process can directly read this queue to obtain IO-active Sockets. No need to iterate through the entire Socket collection again.

rdllist is what makes epoll more efficient than select/poll.

select and poll return all socket connections. If there are 100W connections, they all need to be returned here, and then traversed and checked.

epoll passes rdllist here and only returns ready sockets. User processes can directly perform IO operations.

  • struct rb_root rbr: Use red-black tree to manage all sockets. If it is a 100W connection, it is also capable of management here.

    Since the red-black tree is the best in terms of comprehensive performance such as search, insertion, and deletion, epoll uses a red-black tree internally to manage a large number of Socket connections.

Different from epoll, select uses an array to manage all socket connections, and poll uses a linked list to manage all socket connections.

Super bottom layer 2: The underlying principle of epoll_ctl

Next, let’s look at step 2 of using Epoll, the underlying principle of epoll_ctl

What is the function of the epoll_ctl system call? You can use epoll_ctl to add or delete the socket you want to monitor. epoll_ctl, used to operate the instance generated by the epoll function.

#include <sys / epoll.h>

int epoll_ctl(int epfd,int op,int fd,struct epoll_event * event);

This system call performs control operations on the epoll instance referenced by the file descriptor epfd. It requires operation op to be performed on the target file descriptor fd.

Valid values ​​for the op parameter are:

  • EPOLL_CTL_ADD: Register the target file descriptor fd on the epoll instance referenced by the file descriptor epfd, and link the event event with the internal file to the fd.

  • EPOLL_CTL_MOD:Change the event event associated with the target file descriptor fd.

  • EPOLL_CTL_DEL: Remove (unregister) the target file descriptor fd from the epoll instance referenced by epfd. This event is ignored and can be NULL (but see error below).

eg: If you add monitoring of sock1, sock2 and sock3 through epoll_ctl,

内核会将eventpoll添加到这三个socket的等待队列,具体的做法是,在socket的等待队列中,增加 ep_poll_callback 回调事件

在使用 epoll_ctl 注册每一个 socket 的时候,内核会做如下三件事情

  • 1.分配一个红黑树节点对象 epitem,
  • 2.添加等待事件到 socket 的等待队列中,其回调函数是 ep_poll_callback
  • 3.将 epitem 插入到 epoll 对象的红黑树里

通过 epoll_ctl 添加两个 socket 以后,这些内核数据结构最终在进程中的关系图大致如下:

我们来详细看看 socket 是如何添加到 epoll 对象里的,找到 epoll_ctl 的源码。

// file:fs/eventpoll.c
SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
        struct epoll_event __user *, event)
{
    
    
    struct eventpoll *ep;
    struct file *file, *tfile;

    //根据 epfd 找到 eventpoll 内核对象
    file = fget(epfd);
    ep = file->private_data;

    //根据 socket 句柄号, 找到其 file 内核对象
    tfile = fget(fd);

    switch (op) {
    
    
    case EPOLL_CTL_ADD:
        if (!epi) {
    
    
            epds.events |= POLLERR | POLLHUP;
            error = ep_insert(ep, &epds, tfile, fd);
        } else
            error = -EEXIST;
        clear_tfile_check_list();
        break;
}

在 epoll_ctl 中首先根据传入 fd 找到 eventpoll、socket 相关的内核对象 。

对于 EPOLL_CTL_ADD 操作来说,会然后执行到 ep_insert 函数。

所有的注册都是在这个函数中完成的。

//file: fs/eventpoll.c
static int ep_insert(struct eventpoll *ep,
                struct epoll_event *event,
                struct file *tfile, int fd)
{
    
    
    //3.1 分配并初始化 epitem
    //分配一个epi对象
    struct epitem *epi;
    if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
        return -ENOMEM;

    //对分配的epi进行初始化
    //epi->ffd中存了句柄号和struct file对象地址
    INIT_LIST_HEAD(&epi->pwqlist);
    epi->ep = ep;
    ep_set_ffd(&epi->ffd, tfile, fd);

    //3.2 设置 socket 等待队列
    //定义并初始化 ep_pqueue 对象
    struct ep_pqueue epq;
    epq.epi = epi;
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);

    //调用 ep_ptable_queue_proc 注册回调函数
    //实际注入的函数为 ep_poll_callback
    revents = ep_item_poll(epi, &epq.pt);

    ......
    //3.3 将epi插入到 eventpoll 对象中的红黑树中
    ep_rbtree_insert(ep, epi);
    ......
}

为socket分配并初始化 epitem

对于每一个 socket,调用 epoll_ctl 的时候,都会为之分配一个 epitem。

epitem结构的主要数据如下:

//file: fs/eventpoll.c
struct epitem {
    
    

    //红黑树节点
    struct rb_node rbn;

    //socket文件描述符信息
    struct epoll_filefd ffd;

    //所归属的 eventpoll 对象
    struct eventpoll *ep;

    //等待队列
    struct list_head pwqlist;
}

epoll采用一颗红黑树来管理这些海量socket连接。所以struct epitem是一个红黑树节点。

首先要在epoll内核中创建一个表示Socket连接的数据结构struct epitem,

对 epitem 进行了一些初始化,首先在 epi->ep = ep 这行代码中将其 ep 指针指向 eventpoll 对象。

另外用要添加的 socket 的 file、fd 来填充 epitem->ffd。

其中使用到的 ep_set_ffd 函数如下。

static inline void ep_set_ffd(struct epoll_filefd *ffd,
                        struct file *file, int fd)
{
    
    
    ffd->file = file;
    ffd->fd = fd;
}

ep_item_poll 在socket 等待队列,插入事件回调

在创建 epitem 并初始化之后,ep_insert 中第二件事情就是设置 socket 对象上的sk_wq等待任务队列。

并把函数 fs/eventpoll.c 文件下的 ep_poll_callback 设置为数据就绪时候的回调函数。

在内核中创建完表示Socket连接的数据结构struct epitem后,我们就需要在Socket中的等待队列sk_wq上,创建等待项wait_queue_t,并且注册epoll的回调函数ep_poll_callback。

Socket中io事件一旦发生,就会执行这个 回调函数ep_poll_callback, 把epitem 加入到rdlist 就绪队列中。

epoll的回调函数ep_poll_callback正是epoll同步IO事件通知机制的核心所在,也是区别于select,poll采用内核轮询方式的根本性能差异所在。

在socket 等待队列,插入事件回调 是通过 ep_item_poll 方法完成的

static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
{
    
    
    pt->_key = epi->event.events;

    return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
}

这里调用到了 socket 下的 file->f_op->poll()

对于 socket 结构, 这个函数实际上是 sock_poll。

/* No kernel lock held - perfect */
static unsigned int sock_poll(struct file *file, poll_table *wait)
{
    
    
    ...
    return sock->ops->poll(file, sock, wait);
}

对于 socket 结构, sock->ops->poll 其实指向的是 tcp_poll。

//file: net/ipv4/tcp.c
unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
    
    
    struct sock *sk = sock->sk;

    sock_poll_wait(file, sk_sleep(sk), wait);
}

在 sock_poll_wait 的第二个参数传参前,先调用了 sk_sleep 函数。

在这个函数里它获取了 sock 对象下的等待队列列表头 wait_queue_head_t,待会等待队列项就插入这里

这里稍微注意下,这里插入的目标,是 socket 的等待队列,不是 epoll 对象的等待队列。

来看 sk_sleep 源码:

//file: include/net/sock.h
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
    
    
    BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
    return &rcu_dereference_raw(sk->sk_wq)->wait;
}

接着真正进入 sock_poll_wait。

static inline void sock_poll_wait(struct file *filp,
        wait_queue_head_t *wait_address, poll_table *p)
{
    
    
    poll_wait(filp, wait_address, p);
}

static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
    
    
    if (p && p->_qproc && wait_address)
        p->_qproc(filp, wait_address, p);
}

这里的 qproc 是个函数指针,它在前面的 init_poll_funcptr 调用时被设置成了 ep_ptable_queue_proc 函数。

static int ep_insert(...)
{
    
    
    ...
    init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
    ...
}

//file: include/linux/poll.h
static inline void init_poll_funcptr(poll_table *pt,
    poll_queue_proc qproc)
{
    
    
    pt->_qproc = qproc;
    pt->_key   = ~0UL; /* all events enabled */
}

在 ep_ptable_queue_proc 函数中,

  • Create a new waiting queue item eppoll_entry, and register its callback function as the ep_poll_callback function.
  • Then add this waiting item eppoll_entry to the socket's waiting queue.
//file: fs/eventpoll.c
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
                                 poll_table *pt)
{
    
    
    struct eppoll_entry *pwq;
    f (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
    
    
        //初始化回调方法
        init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);

        //将ep_poll_callback放入socket的等待队列whead(注意不是epoll的等待队列)
        add_wait_queue(whead, &pwq->wait);

    }

q->privateUser points to the waiting user process.

The socket is managed by epoll, and there is no need to wake up the process when a socket is ready, so it q->privateis set to NULL here.

//file:include/linux/wait.h
static inline void init_waitqueue_func_entry(
    wait_queue_t *q, wait_queue_func_t func)
{
    
    
    q->flags = 0;
    q->private = NULL;

    //ep_poll_callback 注册到 wait_queue_t对象上
    //有数据到达的时候调用 q->func
    q->func = func;
}

As above, the callback function q->func is only set to ep_poll_callback in the waiting queue item.

We will see in the following section 5, Data is coming, that after the kernel soft interrupt receives the data into the socket's receiving queue, it will call back through the registered ep_poll_callback function, and then notify the epoll object.

eppoll_entry structure

During the process of registering the socket, a data structure struct eppoll_entry appears. What is its function?

We know that the type in the socket->sock->sk_wq waiting queue is wait_queue_t. We need to register the epoll callback function ep_poll_callback on the waiting queue of the socket represented by struct epitem.

In this way, when the data reaches the receiving queue in the socket, the kernel will call back sk_data_ready to wake up the blocked user process.

The sk_data_ready function pointer will point to the sk_def_readable function. In sk_def_readable, the waiting item wait_queue_t -> func callback function ep_poll_callback registered in the waiting queue will be called back.

You need to find the epitem in ep_poll_callback, and put the IO-ready epitem into the rdlist ready queue in epoll.

The problem is that the type of socket waiting queue is wait_queue_t and cannot be associated with epiitem.

So the struct eppoll_entry structure appeared, its function is to associate the waiting items wait_queue_t and epiitem in the Socket waiting queue.

Therefore, eppoll_entry is a glue structure, a bridge structure

struct eppoll_entry {
    
     

    //指向关联的epitem
    struct epitem *base; 

    // 关联监听socket中等待队列中的等待项 (private = null  func = ep_poll_callback)
    wait_queue_t wait;   

    // 监听socket中等待队列头指针
    wait_queue_head_t *whead; 
    .........
}; 

In this way, in the ep_poll_callback callback function, you can find the eppoll_entry through the container_of macro based on the wait item wait in the Socket waiting queue, and then find the epiitem.

It should be noted here that the private setting in wait_queue_t is null this time, because the Socket here is managed by epoll, and the process blocked on the Socket is also awakened by epoll.

The func registered in the waiting item wait_queue_t is ep_poll_callback, not the previous autoremove_wake_function. Here, the blocked process does not need autoremove_wake_function to wake up, so private is set to null here.

If it is select, the blocked user process here is associated with wait_queue_t->private.

The epiitem corresponding to the socket is inserted into the red-black tree

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

When the data comes, when the IO event occurs

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Execute ep_poll_callback ready callback function

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Execute epoll ready notification

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Super bottom layer 3: Using the underlying principle of epoll_wait in step 3 of Epoll

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Use of epoll_wait

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Create a wait queue item wait_queue_t

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Determine whether there are events on the ready queue that are ready

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Define wait events and associate them with the current process

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Add to waiting queue

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Give up the CPU and actively enter sleep state

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

epoll_wait is similar to select blocking principle

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The entire workflow of epoll

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Analysis of two wait queues involved in epoll()

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Linux epoll API used in the application layer:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

epoll_ctl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The structure of struct epoll_event is as follows:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

epoll events event type

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

epoll_event structure data member variables:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Linux epoll API: epoll_wait

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

eventpoll data structure

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The huge advantages of epoll

To understand the huge advantages of epoll, you must first look at the shortcomings and advantages of select

select缺点和优点:

select本质上是通过设置或者检查存放fd标志位的数据结构数据结构来进行下一步的处理,时间复杂度:O(n)

缺点1:每一次使用select,内核至少需要两次socket列表遍历

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

缺点2: 每次调用select都需要将fds列表传递给内核,涉及到fd的复制,这是一笔开销;

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

缺点3:对socket进行扫描是线性扫描。时间复杂度:O(n)
优点:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

poll缺点和优点:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll改进:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll改进的根本措施:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll的核心优点

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll的缺点:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

美团二面:epoll性能那么高,为什么?

说在前面

在40岁老架构师 尼恩的读者交流群(50+)中,最近有小伙伴拿到了一线互联网企业如美团、拼多多、极兔、有赞、希音的面试资格,遇到一几个很重要的面试题:

  • 说说epoll的数据结构
  • 说说epoll的实现原理
  • 协议栈如何与epoll通信?
  • epoll线程安全如何加锁?
  • 说说ET与LT的实现
  • ……

这里尼恩给大家做一下系统化、体系化的梳理,使得大家可以充分展示一下大家雄厚的 “技术肌肉”,让面试官爱到 “不能自已、口水直流”

也一并把这个题目以及参考答案,收入咱们的 《尼恩Java面试宝典》V88版本,供后面的小伙伴参考,提升大家的 3高 架构、设计、开发水平。

最新《尼恩 架构笔记》《尼恩高并发三部曲 》《尼恩Java面试宝典》 的PDF文件,请到公号【技术自由圈】获取

详细内容在下面文章里:

美团二面:epoll性能那么高,为什么?

说在最后

Linux相关面试题,是非常常见的面试题。

以上的内容,如果大家能对答如流,如数家珍,基本上 面试官会被你 震惊到、吸引到。

最终,让面试官爱到 “不能自已、口水直流”。offer, 也就来了。

学习过程中,如果有啥问题,大家可以来 找 40岁老架构师尼恩交流。

彻底明白:高速ET模式与Netty的高速Selector

背景

epoll事件的两种触发模式

epoll有EPOLLLT和EPOLLET两种触发模式

  • LT是默认的模式
  • ET是“高速”模式。

ET和LT,来自电子的概念

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

执行epoll_wait时LT(水平触发)

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll_wait时ET和LT对比

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

ET和LT的区别

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Java中的selector触发模式

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Netty的Selector

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

如何使用netty自己的Linux epoll实现

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

如何使用netty自己的Linux epoll实现 step2:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

如何使用netty自己的Linux epoll实现,详细地址:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

为什么Netty使用NIO而不是AIO?

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

IO线程模型

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Reactor

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

单Reactor单线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

单Reactor多线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

主从Reactor多线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Proactor

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Reactor与Proactor对比

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Netty的IO模型

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

配置单Reactor单线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

配置单Reactor多线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

配置主从Reactor多线程

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

彻底解密:IO事件在Java、Native之间的翻译

背景:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Java的IO事件在类型

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey中的事件常量

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

poll() 系统调用的事件

pollfd结构

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

poll的events事件类型

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

socket读就绪条件(读事件):

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

socket写就绪条件(写事件):

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

poll() 系统调用的本质

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Net.java通过JNI加载不同平台的poll事件的定义值

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Net.c中的事件的值

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

NIO事件到JNI事件的翻译

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SocketChannelImpl#translateInterestOps

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SocketChannelImpl#translateReadyOps

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll() 系统调用的事件

struct epoll_event结构如下:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll的events事件类型

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

epoll_event结构data成员变量:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

彻底解密:Java NIO SelectionKey (选择键) 核心原理

Java 底层 NIO , 是基于select 系统调用的

但是其原理还是值得我们学习 ,值得我们去深入分析的。

背景

SelectionKey 与IO事件,紧密相关

选择键的使用

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey介绍

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey对象代表着Channel和Selector的关联关系

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKeyImpl继承关系

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

向通道注册事件

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey中使用了四个常量来代表事件类型:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey维护两个set集合成员:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey#interestOps()方法

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey除开构造器方法,有13个方法:

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

AbstractSelectionKey的属性

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

AbstractSelectionKey的方法

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKeyImpl

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKeyImpl的属性

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

感兴趣的事件interestOps还可直接通过interestOps()方法设置

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

就绪的事件readyOps是通道实际发生的操作事件

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

判断事件是否注册

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

检测channel中什么事件已经就绪

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

访问这个ready set就绪的事件

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

使用的比特掩码来检测就绪事件

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey#isAcceptable()

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

获取已经就绪的事件集合SelectionKeyImpl#readyOps()

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

新连接就绪事件的简单处理

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey#isAcceptable()的源码

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

读就绪事件的简单处理

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

SelectionKey#isReadable()的源码

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

selectionKey如何失效

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

检查selectionKey是否仍然有效

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

selectionKey的附件

由于字数限制,此处省略

完整内容,请参见尼恩的《NIO学习圣经》,pdf 找尼恩获取

Attachment operation of selectionKey

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Add attachment to SelectionKey

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Add attachments when registering Channel

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Processor attachments in EchoServerReactor

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Processor attachments in EchoServerReactor

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

AbstractNioChannel.doRegister in Netty

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Example of using selectionKey

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectKey results after forcibly stopping Selector#select()

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Method to forcefully stop the Selector#select() operation

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Thoroughly understand: the core principles of Selector

Introduction to Selector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Creation of Selector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Register Channel to Selector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectionKey

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Selector#select() method introduction:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The selector maintains three set collections:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Call the Selector's selectedKeys() method to access the selected key collection

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Each Selector also maintains two views, publicKeys and publicSelectedKeys, for client use.

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Collections.unmodifiableSet & ungrowableSet

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Method to stop Selector#select() selection operation

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Different Selector implementation classes

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Different Selector implementation classes for different operating system platforms

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The most powerful secret: Selector.open() The underlying principle of selector opening

Use of Selector.open() method

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Selector.open();

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Create a provider

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

What is created here in the windows environment is WindowsSelectorProvider

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Then call WindowsSelectorProvider.openSelector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The implementation of SelectorProvider is related to the operating system

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Inheritance diagram of WindowsSelectorProvider

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectorImpl will initialize selectedKeys and keys

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

AbstractSelector will initialize the provider

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl constructor

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorProvider.open()

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl structure

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

FdMap structure

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SubSelector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

pollArray event polling queue in PollArrayWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

pollArray file descriptor poll event polling queue elements

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Members of PollArrayWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The values ​​of events&revents are as follows:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Constructor of PollArrayWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

NativeObject allocates direct memory

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Properties of SubSelector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SubSelector methods

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Call the JNI local poll0 method to query events

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The general query process of selector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The most powerful secret: Selector.register() The underlying principle of registration

Code for registering channel to selector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Channel registration diagram

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

General registration process

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectableChannel#register();

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The second parameter of the channel.register() method.

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Use the bitwise OR operator to connect multiple constants

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

AbstractSelectableChannel.register(selector, ops,att)

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

AbstractSelector.register

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl#implRegister implements registration

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

addEntry of pollWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The element content of pollArray

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

ReviewAbstractSelector.register

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectionKeyImpl#interestOps(ops)

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectionKeyImpl.nioInterestOps

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

channel#translateAndSetInterestOps

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

in putEventOps of WindowsSelectorImpl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Key point: Place the event of interest at the location specified by i in pollArray

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl#implRegister implements registration

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Expand channelArray

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

pollWrapper.grow(newSize) expands PollArrayWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Management of auxiliary threads

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Multi-threading under WindowsSelectorImpl improves poll performance

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Performance issues with WindowsSelectorImpl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Review: Overall Registration Process

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The most powerful secret: the underlying principle of Selector.select() event query

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The core process of the select method:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Select method calling process

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectorImpl#select

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectorImpl#lockAndDoSelect

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl#doSelect

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Inner class subSelector#poll()

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl#poll0() in C language

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The main functions of the select method of c

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Select method prototype on windows:

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SubSelector.processSelectedKeys

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SubSelector.processFDSet

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

ServerSocketChannelImpl.translateReadyOps

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Selector's select() only uses a single thread

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The most powerful secret: the underlying principle of Selector.wakeup() wake-up

How to wake up a process blocked by system call?

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

wakeup method

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The principle of WindowsSelector wake-up

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Implementation of WindowsSelector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl loopback connection

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The sender and receiver of the wake-up message

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

this.pollWrapper.addWakeupSocket(this.wakeupSourceFd, 0);

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Members of PipeImpl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Simple example of pipe

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

PipeImpl#LoopbackConnector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The process of creating a pipe, pipe.open(), opens the local channel

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The process of creating a pipe, link verification of two sockets

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Return to the WindowsSelectorImpl construction method

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Nagel algorithm

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Why is the Nagle algorithm in TCP disabled?

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Return to the WindowsSelectorImpl construction method

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The source end is placed in pollWrapper

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Review: The structure of pollArray

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

elements of pollArray

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Reviewing members of WindowsSelectorImpl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Wakeup message sending

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Wakeup implementation of WindowsSelectorImpl

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

WindowsSelectorImpl#setWakeupSocket0

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

The event identifier value of the poll function

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

Key points of wakeup():

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectorClose

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

SelectorImpl#implCloseSelector

Due to word limit, omitted here

For complete content, please see Nien's "NIO Study Bible", pdf, get it from Nien

references

https://www.jianshu.com/p/336ade82bdb0

https://www.zhihu.com/question/48510028

https://blog.csdn.net/daaikuaichuan/article/details/83862311

https://zhuanlan.zhihu.com/p/93369069

http://gityuan.com/2019/01/06/linux-epoll/

https://blog.51cto.com/7666425/1261446

https://zhuanlan.zhihu.com/p/340666719

https://zhuanlan.zhihu.com/p/116901360

https://www.cnblogs.com/wanpengcoder/p/11749319.html

https://zhuanlan.zhihu.com/p/64746509

https://zhuanlan.zhihu.com/p/34280875

https://blog.csdn.net/mrpre/article/details/24670659

Recommended reading

" Ten billions of visits, how to design a cache architecture "

" Multi-level cache architecture design "

" Message Push Architecture Design "

" Alibaba 2: How many nodes do you deploy?" How to deploy 1000W concurrency?

" Meituan 2 Sides: Five Nines High Availability 99.999%. How to achieve it?"

" NetEase side: Single node 2000Wtps, how does Kafka do it?"

" Byte Side: What is the relationship between transaction compensation and transaction retry?"

" NetEase side: 25Wqps high throughput writing Mysql, 100W data is written in 4 seconds, how to achieve it?"

" How to structure billion-level short videos? "

" Blow up, rely on "bragging" to get through JD.com, monthly salary 40K "

" It's so fierce, I rely on "bragging" to get through SF Express, and my monthly salary is 30K "

" It exploded...Jingdong asked for 40 questions on one side, and after passing it, it was 500,000+ "

" I'm so tired of asking questions... Ali asked 27 questions while asking for his life, and after passing it, it's 600,000+ "

" After 3 hours of crazy asking on Baidu, I got an offer from a big company. This guy is so cruel!"

" Ele.me is too cruel: Face an advanced Java, how hard and cruel work it is "

" After an hour of crazy asking by Byte, the guy got the offer, it's so cruel!"

" Accept Didi Offer: From three experiences as a young man, see what you need to learn?"

"Nien Architecture Notes", "Nien High Concurrency Trilogy", "Nien Java Interview Guide" PDF, please go to the following official account [Technical Freedom Circle] to get ↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/133186797