Detailed explanation of Linux kernel and kernel optimization scheme

1. History of Linux

1、Unix

The relationship between UNIX and Linux is an interesting topic. Among the current mainstream server-side operating systems, UNIX was born in the late 1960s, Windows was born in the mid-1980s, and Linux was born in the early 1990s. It can be said that UNIX is the "big brother" in the operating system. Both Windows and Linux later referenced UNIX.

The UNIX operating system was invented by Ken Thompson and Dennis Ritchie. Part of its technical origins can be traced back to the Multics engineering program started in 1965, which was initiated by Bell Labs, the Massachusetts Institute of Technology and General Electric Company with the goal of developing an interactive, multi-programming program. A time-sharing operating system capable of replacing the batch operating system that was widely used at the time.

Description: The time-sharing operating system enables a computer to serve multiple users at the same time, and the terminal users connected to the computer issue commands interactively. The CPU time is divided into several segments, called time slices). The operating system takes time slices as a unit and serves each end user in turn, one time slice at a time.

Unfortunately, the Multics project was so large and complex that its developers had no idea what it was going to look like, and it ended in failure.

Bell Labs researchers headed by Ken Thompson learned the lessons of the failure of the Multics project and realized the prototype of a time-sharing operating system in 1969, which was officially named UNIX in 1970.

Interestingly, Ken Thompson's original intention for developing UNIX was to run a computer game he wrote, Space Travel, which simulates the movement of celestial bodies in the solar system, with players piloting spaceships, viewing the scenery and trying to land on various planets and moons. He successively tested on multiple systems, but the results were not satisfactory, so he decided to develop his own operating system, and in this way, UNIX was born.

Since 1970, the UNIX system has gradually become popular among programmers inside Bell Labs. In 1971-1972, Ken Thompson's colleague Dennis Ritchie invented the legendary C language, which is a high-level language suitable for writing system software. Its birth was an important milestone in the development of UNIX systems. In the development of operating systems, assembly language is no longer the dominant language.

By 1973, most of the source code of the UNIX system was rewritten in C language, which laid the foundation for improving the portability of the UNIX system (the previous operating systems mostly used assembly language, which was highly dependent on hardware). It creates conditions for improving the development efficiency of system software. It can be said that the UNIX system and the C language are twin brothers and have an inseparable relationship.

In the early 1970s, there was another great invention in the computer world - the TCP/IP protocol, which was a network protocol developed after the US Department of Defense took over ARPAnet. The U.S. Department of Defense bundled the TCP/IP protocol with the UNIX system and the C language, and was issued a non-commercial license by AT&T to various universities in the United States, which kicked off the development of the UNIX system, the C language, and the TCP/IP protocol. They have influenced the three fields of operating system, programming language, and network protocol respectively. Ken Thompson and Dennis Ritchie received the Turing Award, the highest award in computer science, in 1983 for their outstanding contributions to the field of computing.

Subsequently, various versions of UNIX systems appeared, such as Sun Solaris, FreeBSD, IBM AIX, HP-UX and so on.

2. Solaris and FreeBSD

Solaris is an important branch of UNIX system. Solaris can run on x86 CPU platforms in addition to SPARC CPU platforms. In the server market, Sun's hardware platform has high availability and high reliability, and is the dominant UNIX system in the market. Solaris x86 is a server used for actual production applications. Solaris x86 can be used for free for study, research or commercial applications, subject to Sun's relevant license terms.

FreeBSD originated from the UNIX version developed by the University of California, Berkeley. It is developed and maintained by volunteers from all over the world, providing different levels of support for computer systems with different architectures. FreeBSD is released under the BSD license agreement, which allows anyone to use and distribute freely on the premise of retaining copyright and license agreement information, and does not restrict the distribution of FreeBSD code under another agreement, so commercial companies can freely integrate FreeBSD code into in their products. Apple's OS X is an operating system based on FreeBSD.

A considerable part of the user groups of FreeBSD and Linux overlap, the hardware environments supported by the two are also relatively consistent, and the software used is relatively similar. The biggest feature of FreeBSD is stability and efficiency, and it is a good choice as a server operating system; however, its hardware support is not as complete as that of Linux, so it is not suitable as a desktop system.

3. The birth of Linux

The Linux kernel was originally written by Linus Torvalds as a hobby when he was a student at the University of Helsinki. At that time, he felt that Minix, a mini version of the UNIX operating system for teaching, was too difficult to use, so he decided to develop his own an operating system. Version 1 was released in September 1991 with only 10,000 lines of code.

Linus Torvalds did not keep the copyright of the Linux source code, made the code public, and invited others to improve Linux together. Unlike Windows and other proprietary operating systems, Linux is open source and free for anyone to use.

It is estimated that only 2% of the Linux kernel code is now written by Linus Torvalds himself, although he still owns the Linux kernel (the core part of the operating system) and retains new ways of selecting new code and needing to merge the final ruling right. The Linux that everyone uses now, I prefer to say that it was jointly developed by Linus Torvalds and many Linux experts who joined later.

Open source software is a model that is different from commercial software. Literally, it means open source code. You don't have to worry about any tricks in it, which will bring software innovation and security.

 Linux is very popular among computer enthusiasts for two main reasons:

  1. It belongs to open source software, users can obtain it and its source code without paying a fee, and can make necessary modifications to it according to their own needs, use it free of charge, and continue to spread it without restriction;
  2. It has all the features of UNIX, and anyone who uses UNIX operating system or wants to learn UNIX operating system can benefit from Linux.

In addition, open source is not actually equivalent to free, but a new software profit model. At present, many software are open source software, which has a profound impact on the computer industry and the Internet.

2. Introduction to Linux kernel

1. The composition of the computer system

A computer system is a symbiosis of hardware and software, which are interdependent and inseparable.

The hardware of the computer, including peripherals, processors, memory, hard disks and other electronic devices, constitutes the engine of the computer, but without software to operate and control it, it cannot work by itself.

The software that completes this control work is called the operating system. The operating system is the system software that manages computer hardware and software resources, and is also the kernel and cornerstone of the computer system. The operating system needs to handle basic tasks such as managing and configuring memory, prioritizing the supply and demand of system resources, controlling input and output devices, operating the network, and managing the file system. The operating system also provides an interface for the user to interact with the system.

The composition of the operating system:

Bootloader: It is mainly responsible for the boot process of the device.

Shell: Shell is a programming language that can control other files, processes, and all other programs.

Kernel: It is the main component of the operating system, managing memory, CPU and other related components.

Desktop Environment: This is the environment with which users typically interact.

Graphical server: It is a subsystem of the operating system that displays graphics on the screen

Applications: These are assemblies that perform different user tasks such as word, excel, etc.

· Daemons: backend service providers.

2. What is the Kernel?

The kernel is a key part of the operating system because it controls all the programs in the system. It acts as a bridge between applications and data processing at the hardware level with the help of inter-process communication and system calls.

When the device boots, the operating system is loaded into memory, at which point the kernel goes through an initialization process that takes care of the memory allocation part and keeps it there until the operating system shuts down. And creates an environment in which to run applications, where the kernel takes care of low-level tasks such as task management, memory management, risk management, etc.

The kernel acts as a service provider, so a program can ask the kernel to do multiple tasks, such as requesting the use of a disk, network card, or other hardware, and the kernel sets interrupts for the CPU to enable multitasking. It protects the computing environment by preventing erroneous programs from entering the operational functions of other programs. It blocks the entry of unauthorized programs by not allowing storage space and limits the amount of money they consume

CPU time.

in short:

1. From a technical point of view, the kernel is an intermediate layer between hardware and software. The role is to pass the request of the application layer to the hardware, and act as the underlying driver to address various devices and components in the system.

2. From the application level, the application has no connection with the hardware, but only with the kernel, which is the lowest layer in the hierarchy that the application knows. In practice, the kernel abstracts the relevant details.

3. The kernel is a resource manager. Responsible for allocating available shared resources (CPU time, disk space, network connection, etc.) to various system processes.

4. The kernel is like a library that provides a set of system-oriented commands. System calls are just like calling ordinary functions to an application.

3. Classification of kernels

There are generally three categories of kernels:

1. Monolithic kernel: It contains a number of device drivers that create a communication interface between the hardware and software of a device.

It is the kernel widely used by operating systems. In a monolithic architecture, the kernel consists of various modules that can be dynamically loaded and unloaded. This architecture will extend the functionality of the OS and allow easy extension of the kernel.

With a monolithic architecture, the maintenance of the kernel is made easy because it allows related modules to be loaded and unloaded when a bug in a particular module needs to be fixed. So it removes the tedious work of lowering and recompiling the entire kernel for small changes. In a monolithic kernel, it is easier to unload modules that are no longer used.

2. Micro kernel: It can only perform basic functions.

Microkernels have been developed as an alternative to monolithic kernels to address the growing problem of kernel code that monolithic kernels cannot. This architecture allows certain basic services (such as protocol stacks, device driver management, file systems, etc.) to run in user space. This enhances the functionality of the OS with minimal code, improves security and ensures stability.

It limits damage to affected areas by allowing the rest of the system to function without any interruption. In a microkernel architecture, all essential OS services are available to programs through inter-process communication (IPC). Microkernels allow direct interaction between device drivers and hardware.

3. Hybrid kernel: It combines various aspects of monolithic kernel and microkernel.

A hybrid kernel can decide what to run in user mode and supervisor mode. Typically, in a mixed kernel environment, things like device drivers, filesystem I/O will run in user mode, while server calls and IPC remain in supervisor mode.

4. Kernel design genre

1. Microkernel. The most basic functions are implemented by a central kernel (microkernel). Windows NT uses the microkernel architecture. For the microkernel architecture characteristics, the core part of the operating system is a small kernel that implements some basic services, such as creating and deleting processes, memory management, interrupt management, and so on. The other parts of the file system, network protocols, etc. all run in user space outside the microkernel. These functions are delegated to a few independent processes that communicate with the central kernel through well-defined communication interfaces.

2. Macro (single) kernel. All code for the kernel, including subsystems (such as memory management, file management, device drivers) are packaged into a single file. All functions are done together, all in the kernel, that is, the whole kernel is a single very large program. Every function in the kernel has access to all other parts of the kernel. At present, dynamic loading and unloading (cropping) of modules is supported, and the Linux kernel is implemented based on this strategy.

An operating system using a microkernel has good scalability and the kernel is very small, but such an operating system is inefficient due to the cost of message passing between different layers. For a single-architecture operating system, all modules are integrated together, the speed and performance of the system are good, but the scalability and maintainability are relatively poor.

Logically speaking, the Linux microkernel structure is implemented, but it is not, Linux is a single-kernel (monolithic) structure. This means that while Linux is divided into multiple subsystems that control various components of the system (such as memory management and process management), all subsystems are tightly integrated to form the entire kernel.

In contrast, a microkernel operating system provides a minimal set of functions, and all other operating system layers execute process-wise on top of the microkernel. Microkernel operating systems are generally less efficient due to message passing between layers, but such operating systems are very scalable.

Fundamentally speaking, it is one of the design philosophies of linux to decompose a thing into small problems, and then each small problem is only responsible for one task. The Linux kernel can be extended through modules.

A module is a program that runs in the kernel space, which is actually a kind of object object file. It is not linked and cannot be run independently, but its code can be linked into the system at runtime to run as part of the kernel or taken from the kernel, so that it can be Dynamically extend the functionality of the kernel.

This object code usually consists of a set of functions and data structures used to implement a file system, a driver, or other upper-kernel functionality. The full name of the module mechanism should be a dynamically loadable kernel module (Loadable Kernel Module) or LKM, generally referred to as a module. Unlike the aforementioned processes running in the external user space of the microkernel system operating system, the module is not executed as a process, but like other statically linked kernel functions, it executes on behalf of the current process in kernel mode. Due to the introduction of the module mechanism, the Linux kernel can be minimized, that is, some basic functions are implemented in the kernel, such as the interface from the module to the kernel, the way the kernel manages all modules, etc., and the scalability of the system is left to the module to complete .

Modules have kernel features that provide the benefits of a microkernel without the extra overhead.

5. Kernel functions

The Linux kernel implements many important architectural properties. At a higher or lower level, the kernel is divided into subsystems.

Linux can also be seen as a whole because it integrates all these basic services into the kernel. This is different from the microkernel architecture, which provides some basic services such as communication, I/O, memory and process management, and more specific services are plugged into the microkernel layer.

The main tasks of the kernel are:

· Process management for application execution.

· Memory and I/O (input/output) management.

· System call control (core behavior of the kernel).

· Device management with device drivers.

· Provide a running environment for the application.

6. The core functions of the kernel

The main functions of the Linux kernel are: storage management, CPU and process management , file system, device management and driver, network communication, and system initialization (boot), system calls, etc. 

The main functions are as follows:

  • System memory management
  • software program management
  • hardware device management
  • file system management

1) System memory management

One of the main functions of the operating system kernel is memory management. The kernel manages not only the available physical memory on the server, but also creates and manages virtual memory (that is, memory that doesn't actually exist).

The kernel implements virtual memory through storage space on the hard disk, which is called swap space. The kernel constantly swaps the contents of virtual memory between swap space and actual physical memory. This makes the system think it has more memory available than physical memory.

A memory storage unit is divided into groups into blocks called pages. The kernel places each memory page in physical memory or swap space. The kernel then maintains a memory page table indicating which pages are in physical memory and which pages are swapped to disk.

2) Software program management

The Linux operating system refers to running programs as processes. Processes can run in the foreground, displaying output on the screen, or in the background, hiding behind the scenes. The kernel controls how the Linux system manages all the processes running on the system.

The kernel creates the first process (called the init process) to start all other processes on the system. When the kernel starts, it
loads the init process into virtual memory. When the kernel starts any other process, it allocates a dedicated area in virtual memory for the new process to store the data and code used by that process.

Some Linux distributions use a table to manage processes to start automatically when the system is powered on. On Linux systems, this table is usually located in the special file /etc/inittab.

Other systems (such as the popular Ubuntu Linux distribution) use the /etc/init.d directory, where scripts that start
or stop an application at boot are placed. These scripts are started through entries in the /etc/rcX.d directory, where X stands for run level.

The init system of the Linux operating system adopts the run level. The run level determines that the init process runs certain types of processes defined in the /etc/inittab file or the /etc/rcX.d directory. The Linux operating system has 5 boot runlevels.

  • At runlevel 1, only basic system processes and a console terminal process are started. We call it single-user mode. Single-user mode is typically used to perform urgent file system maintenance when there is a problem with the system. Obviously, in this mode, only one person (usually the system administrator) can log on to the system to manipulate data.
  • The standard startup runlevel is 3. At this run level, most applications, such as network support programs, start. Another common runlevel in Linux is 5. At this run level, the system starts the graphical X Window System, allowing users to log in to the system through a graphical desktop window.

You can use the ps command to view the processes currently running on a Linux system.

3) Hardware device management

Another responsibility of the kernel is to manage hardware devices. Any device that a Linux system needs to communicate with needs
to include its driver code in the kernel code. The driver code acts as a middleman between the application and the hardware device, allowing data to be exchanged between the kernel and the device. There are two methods for inserting device driver code in the Linux kernel:

  • Device driver code compiled into the kernel
  • kernel-pluggable device driver module

Previously, the only way to insert device driver code was to recompile the kernel. Every time a new device is added to the system, the kernel code must be recompiled. As the Linux kernel supports more and more hardware devices, this process becomes increasingly inefficient. Fortunately, Linux developers have devised a better way to insert driver code into a running kernel.

The developers came up with the concept of kernel modules. It allows driver code to be inserted into a running kernel without recompiling the
kernel. At the same time, kernel modules can also be removed from the kernel when the device is no longer in use. This approach greatly simplifies and expands the use of hardware devices on Linux.

The Linux system treats hardware devices as special files called device files. There are 3 categories of device files:

  • Character device file: Refers to devices that process data one character at a time, such as most types of modems and
    terminals.
  • Block device file: Refers to a device that can process large blocks of data each time when processing data, such as a hard disk.
  • Network device file: refers to the device that uses data packets to send and receive data, including various network cards and a special loopback device.

4) File system management

Unlike some other operating systems, the Linux kernel supports reading and writing data from the hard disk through different types of file systems. In addition
to its own many file systems, Linux also supports
reading and writing data from the file systems of other operating systems (such as Microsoft Windows). The kernel must be compiled with support for all possible filesystems. The following table lists the standard file systems that Linux systems use to read and write data.

All hard drives accessed by the Linux server must be formatted with one of the file system types listed in the table above.

The Linux kernel is highly efficient in memory and CPU usage and is very stable over time. But what's most interesting about Linux is the portability of this size and complexity. Linux is compiled to run on a large number of processors and platforms with different architectural constraints and requirements. An example is that Linux can run on a processor with a memory management unit (MMU), or on those processors that do not provide an MMU. The uClinux port of the Linux kernel provides support for non-MMUs.

Third, the overall architecture of the Linux kernel

1. Linux Kernel Architecture 

The UNIX/Linux system can be roughly abstracted into three levels, the bottom layer is the system kernel (Kernel); the middle layer is the Shell layer, that is, the command interpretation layer; the top layer is the application layer.

(1) Kernel layer

The kernel layer is the core and foundation of the UNIX/Linux system. It is directly attached to the hardware platform, controls and manages various resources (hardware resources and software resources) in the system, and effectively organizes the operation of the process, thereby expanding the functions of the hardware. Improve the utilization efficiency of resources and provide users with a convenient, efficient, safe and reliable application environment.

(2) Shell layer

The Shell layer is the interface that directly interacts with the user. The user can enter the command line at the prompt, and the Shell interprets it and outputs the corresponding results or related information, so we also call the Shell a command interpreter, which can quickly and easily complete many tasks by using the rich commands provided by the system.

(3) Application layer

The application layer provides a graphical environment based on the X Window protocol. The X Window Protocol defines the functions that a system must have.

The Linux kernel is just a part of the Linux operating system. Right, it manages all hardware devices of the system; right, it provides interfaces to Library Routine (such as C library) or other applications through system calls.

1) Kernel space:

The kernel space includes system calls, the kernel, and code related to the platform architecture. The kernel is in an elevated system state that includes a protected memory space and full access to the device hardware. This system state and memory space is collectively referred to as kernel space. Within kernel space, core access to hardware and system services is managed and provided as services to the rest of the system.

2) User space:

The user space also contains, the user's application, the C library. User space or user domain is the code that runs outside the operating system kernel environment, user space is defined as the various applications or programs or libraries used by the operating system to interface with the kernel.

The user's applications are executed in user space, and they can access a portion of the computer's available resources through kernel system calls. By using the core services provided by the kernel, user-level applications such as games or office software can be created.

The kernel provides a set of interfaces for applications running in user mode to interact with the system. Also known as system calls, these interfaces allow applications to access hardware and other kernel resources. System calls not only provide an abstracted hardware level for applications, but also ensure the security and stability of the system.

Most applications do not use system calls directly. Instead, an application programming interface (API) is used for programming. Note that there is no correlation between API and system calls. APIs are provided to applications as part of a library file, and these APIs are typically implemented through one or more system calls.

2. Linux kernel architecture

In order to manage the above various resources and devices, the Linux kernel proposes the following architecture: 

According to the core functions of the kernel, the Linux kernel proposes 5 subsystems:

1. Process Scheduler, also known as process management, process scheduling. Responsible for managing CPU resources so that each process can access the CPU in the fairest possible way.

2. Memory Manager, memory management. Responsible for managing Memory resources so that processes can safely share the machine's memory resources. In addition, memory management will provide a virtual memory mechanism, which allows the process to use more memory than the system can use. The unused memory will be stored in the external non-volatile memory through the file system, and when it needs to be used, it will be retrieved. in memory.

3. VFS (Virtual File System), virtual file system. The Linux kernel abstracts external devices with different functions, such as Disk devices (hard disks, disks, NAND Flash, Nor Flash, etc.), input and output devices, display devices, etc., into a unified file operation interface (open, close, read, etc.) write, etc.) to access. This is the embodiment of "everything is a file" in the Linux system (in fact, Linux does not do it completely, because CPU, memory, network, etc. are not files yet. If you really need everything to be a file, it depends on what Bell Labs is developing. " Plan 9 ").

4. Network, network subsystem. Responsible for managing the network equipment of the system and implementing a variety of network standards.

5. IPC (Inter-Process Communication), inter-process communication. IPC does not manage any hardware, it is mainly responsible for the communication between processes in the Linux system.

Process Scheduler

Process scheduling is the most important subsystem in the Linux kernel, which mainly provides access control to the CPU. Because in the computer, the CPU resources are limited, and many applications use the CPU resources, so the "process scheduling subsystem" is required to schedule and manage the CPU.

The process scheduling subsystem includes 4 sub-modules (see the figure below), and their functions are as follows:

  1. Scheduling Policy, the strategy for implementing process scheduling, which determines which (or which) processes will have the CPU.
  2. Architecture-specific Schedulers, the architecture-related part, is used to abstract the control of different CPUs into a unified interface. These controls are mainly used in suspend and resume processes, involving CPU register access, assembly instruction operations, and so on.
  3. Architecture-independent Scheduler, the architecture-independent part. It will communicate with the "Scheduling Policy module" to decide which process to execute next, and then resume the specified process through the "Architecture-specific Schedulers module".
  4. System Call Interface, the system call interface. The process scheduling subsystem opens up the interface that needs to be provided to the user space through the system call interface, and at the same time shields the details that do not need to be concerned by the user space program.

Memory Manager (MM)

Memory management is also the most important subsystem in the Linux kernel, which mainly provides access control to memory resources. The Linux system will establish a mapping relationship between the hardware physical memory and the memory used by the process (called virtual memory). This mapping is in units of processes, so different processes can use the same virtual memory, and these same The virtual memory can be mapped to different physical memory.

The memory management subsystem includes 3 sub-modules (see the figure below), and their functions are as follows:

  1. Architecture Specific Managers, architecture-related parts. Provides a virtual interface for accessing hardware memory.
  2. Architecture Independent Manager, the architecture independent part. Provides all memory management mechanisms, including: process-based memory mapping; virtual memory Swapping.
  3. System Call Interface, the system call interface. Through this interface, functions such as memory allocation, release, and file map are provided to user space programs and applications.

Virtual Filesystem (VFS)

A file system in the traditional sense is a method of storing and organizing computer data. It abstracts cold data blocks on computer disks, hard disks and other devices in an easy-to-understand, human-friendly way (file and directory structure), making them easy to find and access. Therefore, the essence of the file system is "the method of storing and organizing data", and the manifestation of the file system is "reading data from a certain device and writing data to a certain device".

As computer technology advances, so do the methods of storing and organizing data, resulting in various types of file systems, such as FAT, FAT32, NTFS, EXT2, EXT3, and more. In order to be compatible, the operating system or kernel should support multiple types of file systems in the same form, which extends the concept of virtual file system (VFS). The function of VFS is to manage various file systems, shield their differences, and provide user programs with an interface to access files in a unified manner.

We can read or write data from disks, hard drives, NAND Flash and other devices, so the original file systems were built on these devices. This concept can also be extended to other hardware devices, such as memory, display (LCD), keyboard, serial port and so on. Our access control to hardware devices can also be summarized as reading or writing data, so it can be accessed with a unified file operation interface. The Linux kernel does just that, abstracting away device filesystems, in-memory filesystems, and more, in addition to traditional disk filesystems. These logics are implemented by the VFS subsystem.

The VFS subsystem includes 6 sub-modules (see the figure below), and their functions are as follows:

  1. Device Drivers, device drivers, are used to control all external devices and controllers. Since there are a large number of hardware devices (especially embedded products) that are not compatible with each other, there are also a lot of device drivers. Therefore, nearly half of the source codes in the Linux kernel are device drivers, and most of the Linux bottom-level engineers (especially domestic enterprises) are writing or maintaining device drivers, and have no time to estimate other content (they are precisely the essence of the Linux kernel). where).
  2. Device Independent Interface, this module defines a unified way to describe hardware devices (unified device model), all device drivers comply with this definition, which can reduce the difficulty of development. At the same time, the interface can be provided upward with a consistent situation.
  3. Logical Systems, each file system corresponds to a Logical System (logical file system), which implements specific file system logic.
  4. System Independent Interface, this module is responsible for representing hardware devices and logical file systems with a unified interface (fast device and character device), so that the upper-layer software no longer cares about the specific hardware form.
  5. System Call Interface, the system call interface, provides the user space with a unified interface for accessing the file system and hardware devices.

Network Subsystem (Net)

The network subsystem in the Linux kernel is mainly responsible for managing various network devices, implementing various network protocol stacks, and finally realizing the function of connecting other systems through the network. In the Linux kernel, the network subsystem is almost self-contained, it includes 5 sub-modules (see the figure below), and their functions are as follows:

  1. Network Device Drivers, the drivers of network devices, are the same as the device drivers in the VFS subsystem.
  2. Device Independent Interface, which is the same as in the VFS subsystem.
  3. Network Protocols, which implements various network transmission protocols, such as IP, TCP, UDP, etc.
  4. Protocol Independent Interface, shields different hardware devices and network protocols, and provides interfaces (sockets) in the same format.
  5. System Call interface, the system call interface, provides user space with a unified interface for accessing network devices.

IPC subsystem, please refer to: 

Linux process management and scheduled tasks and system backup and recovery - Wespten's Blog - CSDN Blog

3. Directory structure of Linux kernel source code

The Linux kernel source code consists of three main parts:

1. Kernel core code, including various subsystems and sub-modules, and other supporting subsystems, such as power management, Linux initialization, etc.

2. Other non-core codes, such as library files (because the Linux kernel is a self-contained kernel, that is, the kernel does not depend on any other software and can be compiled by itself), firmware collections, KVM (virtual machine technology), etc.

3. Compilation scripts, configuration files, help documents, copyright instructions and other auxiliary files

The following kernel directory uses the linux-3.14 kernel as an explanation:

1. documentation:

Provide documentation assistance. For some explanatory information about the kernel, there will be a help manual in this directory.

比如linux-3.14-fs4412/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt

This file explains the detailed description of the cell of the device interrupts property of the device number node.

Just according to the folder name, we can find the documentation we need.

2. arch:

arch is an abbreviation for architecture. All architecture-related code is in this directory starting with

include/asm-*/ directories. Each architecture supported by Linux has a corresponding directory in the arch directory, and further

The steps are decomposed into subdirectories such as boot, mm, kernel, etc.:

 |--arm arm and subdirectories for compatible architectures

    |--boot Bootloader, and implementation of the memory manager used to start the kernel on this hardware platform.

         |--compressed kernel decompression

 |--tools program to generate compressed kernel image

         | --kernel: Holds implementations that support architecture-specific features such as semaphore handling and SMP.

         | --lib: Holds architecture-specific implementations of common functions such as strlen and memcpy.

         | --mm: Holds the implementation of the architecture-specific memory manager.

In addition to these three subdirectories, most architectures also have a boot subdirectory, if necessary, that contains the implementation of the memory manager used to start the kernel on such hardware platforms.

3. drivers:

Driver code, a driver is a piece of software that controls hardware. This directory is the largest directory in the kernel, and drivers for graphics cards, network cards, SCSI adapters, PCI buses, USB buses, and any other peripherals or buses supported by Linux can be found here.

4. fs:

The code for the virtual file system (VFS), and the code for each of the different file systems are in this directory. All file systems supported by Linux have a corresponding subdirectory under the fs directory. For example, the ext2 file system corresponds to the fs/ext2 directory.

A file system is the medium between the storage device and the processes that need to access the storage device. Storage devices may be locally physically accessible, such as hard disks or CD-ROM drives, which use the system ext2/ext3 and isofs file systems, respectively.

There are also some virtual filesystems (procs), which are a standard filesystem present. However, the files in it only exist in memory and do not occupy disk space.

5. include:

This directory contains most of the header files in the kernel and is grouped in the following subdirectories. To modify the processor architecture simply edit the kernel's makefile and rerun the Linux kernel configuration program.

       | include/asm-*/ Each subdirectory corresponding to an arch, such as include/asm-alpha,

Include/asm-arm etc. The files in each subdirectory define the preprocessing and inline functions necessary to support a given architecture, most of which are full or partial assembly language implementations.

| include/linux The platform-independent header files are all in this directory, it is usually linked to the directory                                             

 /usr/include/linux (or all files inside it will be copied to/usrinclude/linux directory)      

6. heat:    

     Kernel initialization code. Includes main.c, code to create early user space, and other initialization code.

7. ipc:

IPC (Inter-Process Communication). It contains shared memory, semaphores, and other forms of IPC code.

8. kernel:

The core part of the kernel, including the scheduling of the process (sched.c), as well as the creation and cancellation of the process (fork.c and exit.c), and another part of the core code related to the platform are in the arch/*/kernel directory.

9. mm

This directory contains part of the memory management code that is independent of the architecture. Architecture-dependent memory management code is located under arch/*/mm.

10. net

    The core network code implements various common network protocols, such as TCP/IP, IPX, etc.

11. lib

    This directory contains the core library code. Implements a generic subset of the standard C library, including functions for string and memory manipulation (strlen, mmcpy, etc.) and functions related to the sprintf and atoi series. Different from the code under arch/lib, the library code here is written in C and can be used directly in the new ported version of the kernel. The library code related to the processor architecture is placed in arch/mm.

12. block:

    Block device drivers include IDE (in ide.c) drivers. A block device is a device that receives and sends data in blocks. The initial block layer code is partly in the drivers directory and partly in the fs directory. Since 2.6.15, the core code of the block layer has been extracted and placed in the top-level block directory. If you want to look for the initialization process of these devices that can contain filesystems, it should be device_setup() in drivers/block/genhd.c. When installing an nfs file system, not only the hard disk but also the network must be initialized. Block devices include IDE and SCSI devices.

13. firmware

Fireware contains code that allows computers to read and understand signals sent from devices. For example, a camera manages its own hardware, but the computer must understand the signals the camera sends to the computer. Linux systems use vicam firmware to understand camera communication. Otherwise, without the firmware, the Linux system would not know what to do with the information from the camera. In addition, the firmware also helps to send messages from the Linux system to the device. This way the Linux system can tell the camera to readjust or turn off the camera.

13. usr:

Implement cpio etc for packing and compression. The code in this folder creates these files after the kernel has been compiled.

14. securtity:

This directory contains code for different Linux security models. It is important to keep your computer safe from viruses and hackers. Otherwise, the Linux system may be damaged.

15. crypto:

The encryption API used by the kernel itself implements commonly used encryption and hashing algorithms, as well as some compression and CRC check algorithms. Example: "sha1_generic.c" This file contains the code for the SHA1 encryption algorithm.

16. scripts:

There is no kernel code in this directory, just the script files used to configure the kernel. When running commands such as make menuconfig or make xconfig to configure the kernel, the user interacts with scripts located in this directory.

17. sound:

Sound card driver and other sound-related source code.

18. samples

Some examples of kernel programming

19. virt

This folder contains virtualization code, which allows users to run multiple operating systems at once. With virtualization, the guest operating system runs just like any other application running on the Linux host.

20. tools

This folder contains tools for interacting with the kernel.

COPYING: License and authorization information. The Linux kernel is licensed under the GPLv2 license. This license grants anyone the right to use, modify, distribute and share source and compiled code free of charge. However, no one can sell the source code.

CREDITS : list of contributors

Kbuild : This is a script that sets some kernel settings. For example, this script sets an ARCH variable, which is the type of processor the developer wants to generate for the kernel to support.

Kconfig: This script will be used by developers when configuring the kernel

MAINTAINERS : This is a list of current maintainers, their email addresses, home pages, and specific parts or files of the kernel they are responsible for developing and maintaining. This is useful when a developer finds a problem in the kernel and wants to be able to report it to a maintainer who can deal with it.

Makefile : This script is the main file for compiling the kernel. This file passes compilation parameters and files and necessary information required for compilation to the compiler.

README : This document provides information for developers who want to know how to compile the kernel.

REPORTING-BUGS : This document provides information on how to report bugs.

The code of the kernel is a file with the extension ".c" or ".h". The ".c" extension indicates that the kernel is written in C, one of the many programming languages, and the "h" files are header files, and they are also written in C. Header files contain a lot of code that ".c" files need to use because they can import existing code instead of rewriting it, which saves programmers time. Otherwise, a set of code that performs the same action will exist in many or all "c" files. This also consumes and wastes hard disk space. (Annotation: Header files can not only save repeated coding, but also code reuse will reduce the chance of code errors)

Summary of the overall architecture of the Linux kernel:

Linux Kernel Architecture:

(1) System call interface

The SCI layer provides some mechanisms to perform function calls from user space to the kernel. As discussed earlier, this interface is architecture-dependent, even within the same processor family. SCI is actually a very useful function call multiplexing and demultiplexing service. You can find the SCI implementation in ./linux/kernel and the architecture-dependent parts in ./linux/arch.

(2) Process management

The focus of process management is the execution of the process. In the kernel, these processes are called threads and represent separate processor virtualization (thread code, data, stack, and CPU registers). In user space, the term process is often used, but the Linux implementation does not distinguish between the two concepts (process and thread). The kernel provides an application programming interface (API) through SCI to create a new process (fork, exec, or Portable Operating System Interface [POSIX] functions), stop processes (kill, exit), and communicate and synchronize between them (signal or POSIX mechanism).

Process management also includes dealing with the need to share CPU among active processes. The kernel implements a new type of scheduling algorithm that operates in constant time regardless of how many threads are competing for the CPU. This algorithm is called an O(1) scheduler, and the name implies that it takes the same amount of time to schedule multiple threads as it does to schedule one thread. The O(1) scheduler can also support multiple processors (called symmetric multiprocessing or SMP). You can find the source code for process management in ./linux/kernel and the architecture-dependent source in ./linux/arch.

(3) Memory management

Another important resource managed by the kernel is memory. For efficiency, if virtual memory is managed by hardware, memory is managed in so-called memory pages (4KB for most architectures). Linux includes ways to manage available memory, as well as hardware mechanisms used for physical and virtual mapping. But memory management can manage more than 4KB buffers. Linux provides abstractions for 4KB buffers, such as the slab allocator. This mode of memory management uses a 4KB buffer as a base, then allocates structures from it, and keeps track of memory page usage, such as which memory pages are full, which are not fully used, and which are empty. This allows the mode to dynamically adjust memory usage based on system needs. In order to support multiple users using memory, sometimes the available memory is exhausted. For this reason, pages can be moved out of memory and placed on disk. This process is called swapping because pages are swapped from memory to disk. The source code for memory management can be found in ./linux/mm.

(4) Virtual file system

The Virtual File System (VFS) is a very useful aspect of the Linux kernel because it provides a common interface abstraction to the file system. VFS provides a swap layer between the SCI and the filesystems supported by the kernel.

File system hierarchy:

On top of VFS, is a generic API abstraction for functions such as open, close, read, and write. Underneath the VFS is the file system abstraction, which defines how the upper-level functions are implemented. They are plugins for a given filesystem (more than 50). The source code for the filesystem can be found in ./linux/fs. Below the file system layer is the buffer cache, which provides a general set of functions for the file system layer (agnostic to the specific file system). This caching layer optimizes access to physical devices by retaining data for a period of time (or reading it in advance so that it is available when needed). Below the buffer cache is the device driver, which implements an interface to a specific physical device.

(5) Network stack

The network stack is designed to follow a layered architecture that mimics the protocol itself. Recall that the Internet Protocol (IP) is the core network layer protocol underlying the transport protocol (often called the Transmission Control Protocol or TCP). Above TCP is the socket layer, which is called through SCI. The socket layer is the standard API of the network subsystem, which provides a user interface for various network protocols. From raw frame access to IP Protocol Data Units (PDUs) to TCP and User Datagram Protocol (UDP), the socket layer provides a standardized way to manage connections and move data between endpoints. The network source code in the kernel can be found in ./linux/net.

(6) Device driver

A lot of the code in the Linux kernel is in device drivers, which are able to run specific hardware devices. The Linux source tree provides a driver subdirectory, which is further divided into various supporting devices such as Bluetooth, I2C, serial, etc. The code for the device drivers can be found in ./linux/drivers.

(7) Architecture-dependent code

Although Linux is largely independent of the architecture on which it runs, some elements must be architecture-considered in order to operate properly and achieve greater efficiency. The ./linux/arch subdirectory defines the architecture-dependent portion of the kernel source code, and contains various architecture-specific subdirectories (which together make up the BSP). For a typical desktop system, the x86 directory is used. Each architecture subdirectory contains many other subdirectories, each focusing on a specific aspect in the kernel, such as booting, kernel, memory management, etc. These architecture-dependent codes can be found in ./linux/arch.

If the portability and efficiency of the Linux kernel wasn't good enough, Linux also provides some other features that don't fit into the categories above. As a production operating system and open source software, Linux is a good platform for testing new protocols and their enhancements. Linux supports a number of network protocols, including typical TCP/IP, and extensions for high-speed networks (greater than 1 Gigabit Ethernet [GbE] and 10 GbE). Linux can also support protocols such as Stream Control Transmission Protocol (SCTP), which provides many more advanced features than TCP (the successor to the transport layer protocol).

Linux is also a dynamic kernel that supports the dynamic addition or removal of software components. Known as dynamically loadable kernel modules, they can be inserted by the user at boot time as needed (currently the particular device requires this module) or at any time.

One of the latest enhancements to Linux is an operating system (called a hypervisor) that can be used as an operating system for other operating systems. More recently, there have been modifications to the kernel called Kernel-Based Virtual Machines (KVM). This modification enables a new interface for user space that allows other operating systems to run on top of the KVM-enabled kernel. In addition to running other instances of Linux, Microsoft Windows can also be virtualized. The only limitation is that the underlying processor must support the new virtualization instructions.

Fourth, the overall architecture design of the kernel

1. Kernel mechanism

Each function is made into different kernel subsystems. If the subsystems want to communicate, a mechanism must be designed to allow the subsystems to communicate with each other in a safe, reliable and efficient manner.

Linux has absorbed the design experience of microkernel in the step-by-step development. Although it is a single kernel, it has the characteristics of microkernel.

Linux uses a modular kernel design to have both microkernel characteristics, but such a modular design is not a subsystem like a microkernel, but a kernel composed of a core and peripheral functional modules. The micro-kernel subsystems all run independently and can work without relying on other parts. Linux modules must depend on the core, but can be loaded when they are in use and dynamically unloaded when not in use. The module under linux is externally represented as a library file of the type program, but the program library file is .so, and the kernel module is .ko (kernel object), which is called by the kernel.  

Assuming that if the driver is provided by the kernel, imagine compiling a kernel and installing it on the host, in case it is later found that it cannot drive the new hardware device we added later. Since all kinds of hardware are driven by the kernel, and the kernel does not provide this program, it is very troublesome for users and manufacturers to recompile the kernel.

The modular design can avoid this situation. Each manufacturer develops its own driver in a modular form, and only needs to develop its own driver program for a specific device.

One of the things the Linux kernel developers have done is make kernel modules loadable and unloadable at runtime, which means you can dynamically add or remove features of the kernel. Not only can this add hardware capabilities to the kernel, it can also include modules to run server processes, such as low-level virtualization, but it can also replace the entire kernel without requiring a computer restart in some cases.

We can compile these modules when needed. When a function is not needed, it can be disassembled by itself without affecting the operation of the core. Imagine if you could upgrade to a Windows service pack without rebooting, that's one of the benefits and advantages of modularity.

1) CPU working mechanism

CPU working mode

Modern CPUs usually implement different working modes. Take ARM as an example: ARM implements 7 working modes. In different modes, the instructions that the CPU can execute or the registers that can be accessed are different:

(1) User mode usr

(2) System mode sys

(3) Management mode svc

(4) Fast interrupt fiq

(5) External interrupt irq

(6) Data access termination abt

(7) Undefined instruction exception

Take (2) X86 as an example: X86 implements 4 different levels of permissions, Ring0—Ring3; Ring0 can execute privileged instructions and access IO devices; Ring3 has many restrictions

Therefore, from the perspective of CPU, in order to protect the security of the kernel, Linux divides the system into user space and kernel space.

User space and kernel space are two different states of program execution. We can complete the transfer from user space to kernel space through "system calls" and "hardware interrupts".

Applications run on the kernel, it's just the logical case. However, it actually works directly on the hardware. Any application data is in memory, and data processing is all CPU, but they cannot be used at will, and must be managed by the kernel.

But there is only one CPU. When the application program is working, the kernel is suspended, and the application program is also in the memory space. Once the application program wants to access other hardware resources, that is, when it wants to execute I/O instructions, it cannot be executed. Because the application program cannot see the hardware, the application program is a program based on system calls. When the application program needs to access hardware resources, it initiates a privilege request to the CPU. Once the CPU receives the privilege request, the CPU will wake up the kernel and execute the in-kernel operation. A piece of code (not a complete kernel program), then returns the result to the application, then the kernel code exits and the kernel program is suspended.

During this time, the CPU switches from user mode to kernel mode, which seems to be a privileged mode.

All applications are directly executed on the hardware, and are only managed and monitored by the kernel when necessary, so the kernel is also a monitor, a monitoring program, and a monitoring program for resources and processes.

The kernel has no productivity, and the productivity is generated by a called application, so we should try to let the system run in application mode, so the less time the kernel occupies, the better. The kernel mainly occupies time in related functions such as process switching and interrupt processing. The purpose of mode switching is to complete production, but process switching and production have no meaning. Interrupt processing can be considered to be related to production itself, because the application needs to execute I/O.

The main purpose of the kernel is to complete hardware management, and there is an idea in Linux that each process is derived from its parent process and fork() from the parent process, then who will fork() and manage these processes, so With the big housekeeper program init, it manages all the processes in the user space as a whole.

The management of user space will not be performed by the kernel, so after we start the kernel, we need to start init first if we want to start the user space, so the PID number of init is always 1. Init is also derived from its parent process fork(), which is a mechanism in kernel space to specifically guide user space processes. init is an application, under /sbin/init, an executable file.

CPU time

Because each process in the memory directly monopolizes the CPU, the kernel virtualizes the CPU and provides it to the process. The CPU is virtualized at the kernel level. By cutting the CPU into time slices, it is completed with the passage of time. In allocating computing power among processes, the CPU provides its computing power in terms of time.

The greater the computing power that can be provided per unit time, the faster the speed must be, otherwise the time can only be extended. That's why we need a faster CPU to save time.

Computational characteristics of CPU

 I/O is the slowest device. Our CPU spends a lot of time waiting for I/O to complete. In order to avoid idle and meaningless waiting, when we need to wait, let the CPU run other processes or threads.

We should squeeze the computing power of the CPU to the maximum, because the computing power of the CPU is oscillating with the oscillator of the clock frequency over time, and it is running whether you use it or not.

If you let the CPU idle, it still consumes power, and over time, the computing power is lost, so you can make the CPU work at 80-90% utilization, which means that its production capacity is fully play. The CPU is not bad, there is no wear and tear, it is an electrical equipment, except that the power is large, the heat is large, and the heat dissipation is sufficient. For electrical equipment, it will be damaged if it is not used.

9 Synchronization Mechanisms in the Linux Kernel

Linux often uses a hash table to implement a cache (Cache), which is information that needs to be accessed quickly.

After the operating system introduces the concept of process, after the process becomes the scheduling entity, the system has the ability to execute multiple processes concurrently, but it also leads to resource competition and sharing among the various processes in the system.

In addition, due to the introduction of interrupts, exception mechanisms, and kernel state preemption, these kernel execution paths (processes) run in an interleaved manner. For the kernel paths executed by these interleaved paths, if necessary synchronization measures are not taken, some key data structures will be accessed and modified interleaved, resulting in inconsistent states of these data structures, which in turn leads to system crashes. Therefore, in order to ensure the efficient, stable and orderly operation of the system, linux must adopt a synchronization mechanism.

In the Linux system, we call the code segment that accesses the shared resources as the critical section. What causes multiple processes to access the same shared resource is called a concurrent source.

The main sources of concurrency under Linux systems are:

Interrupt processing: For example, when a process is interrupted while accessing a critical resource, and then enters the interrupt handler, if in the interrupt handler, the critical resource is also accessed. Although it is not strictly concurrency, it will also cause a race condition for the resource.

Kernel state preemption: For example, when a process accesses a critical resource, kernel state preemption occurs, and then enters a high-priority process. If the process also accesses the same critical resource, it will cause a process-to-process conflict. concurrency.

Concurrency of multiple processors: There is a strict concurrency between processes on a multiprocessor system. Each processor can schedule and run a process independently, and multiple processes are running at the same time.

As mentioned above, it can be seen that the purpose of using the synchronization mechanism is to prevent multiple processes from concurrently accessing the same critical resource.

9 synchronization mechanisms:

1) Per CPU variable

The main form is an array of data structures, one element of the array for each CPU in the system.

Use case: Data should be logically independent

Usage Guidelines: Per-CPU variables should be accessed with preemption disabled in the kernel control path.

2) Atomic operations

Principle: It is realized by means of the assembly instructions that are atomic for "read-modify-write" in assembly language instructions.

3) Memory barriers

Rationale: Use a memory barrier primitive to ensure that an operation preceding the primitive has completed before the operation following the primitive begins.

4) Spin lock

Mainly used in multiprocessor environments.

Rationale: If a kernel control path finds that the requested spinlock is already "locked" by a kernel control path running on another CPU, it executes a loop instruction repeatedly until the lock is released.

Description: Spinlocks are generally used to protect critical sections that are prohibited from being preempted by the kernel.

On a single processor, spinlocks only disable or enable kernel preemption.

5) Sequence lock

A sequential lock is very similar to a spin lock, except that the writer in a sequential lock has a higher priority than the reader, which means that the writer is allowed to continue running even when the reader is reading.

6)RCU

Mainly used to protect data structures that are read by multiple CPUs.

Multiple readers and writers are allowed to run at the same time, and RCU is lock-free.

Usage restrictions:

1) RCU only protects data structures that are dynamically allocated and referenced by pointers

2) In a critical section protected by RCU, no kernel control path can sleep.

principle:

When the writer wants to update the data, it makes a copy of the entire data structure by referencing the pointer, and then makes modifications to this copy. After the modification, the writer changes the pointer to the original data structure so that it points to the modified copy, (the modification of the pointer is atomic).

7) Semaphore:

Principle: When the kernel control path tries to acquire the busy resource protected by the kernel semaphore, the corresponding process is suspended; only when the resource is released, the process becomes runnable again.

Usage restrictions: Only functions that can sleep can acquire kernel semaphores;

Neither interrupt handlers nor deferrable functions can use kernel semaphores.

8) Local interrupt disable

Principle: Local interrupt prohibition can ensure that even if the hardware device generates an IRQ signal, the kernel control path will continue to run, so that the data structure accessed by the interrupt processing routine is protected.

Disadvantage: Disabling local interrupts does not limit concurrent access to shared data structures by interrupt handlers running on another CPU,

So in a multiprocessor environment, disabling local interrupts needs to be used in conjunction with spinlocks.

9) Prohibition of local soft interrupt

method 1:

Since softirq starts running at the end of the hardware interrupt handler, the easiest way is to disable interrupts on that CPU.

Because no interrupt handler routine is activated, the softirq has no chance to run.

Method 2:

Softirq can be activated or disabled on the local CPU by manipulating the softirq counter stored in the preempt_count field of the current thread_info descriptor. Because the kernel sometimes just needs to disable softirq without disabling interrupts.

2) Memory mechanism

The memory mechanism of Linux includes address space, physical memory, memory mapping, paging mechanism and switching mechanism.

address space

One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The size of virtual memory can be many times the size of the physical memory in the system. Each process in the system has its own virtual address space, which is completely independent of each other. A process running an application will not affect other processes, and applications are also protected from each other. The virtual address space is mapped to physical memory by the operating system. From the application's point of view, this address space is a linear flat address space; however, the kernel handles user virtual address space very differently.

The linear address space is divided into two parts: the user address space and the kernel address space. The user address space does not change every time a context switch occurs, while the kernel address space always remains the same. The amount of space allocated for user space and kernel space mainly depends on whether the system is a 32-bit or 64-bit architecture. For example, x86 is a 32-bit architecture and it only supports 4GB of address space, of which 3GB is reserved for user space and 1GB is allocated for kernel address space. The specific partition size is determined by the kernel configuration variable PAGE_OFFSET.

physical memory

To support multiple architectures, Linux uses an architecture-independent way to describe physical memory.

Physical memory can be organized into memory banks (banks), each at a specific distance from the processor. This type of memory layout has become very common as more and more machines adopt Nonuniform Memory Access (NUMA) technology. Linux VMs represent this arrangement as nodes. Each node is divided into a number of memory blocks called management zones, which represent a range of addresses in memory. There are three different management zones: ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM. For example, x86 has the following memory management areas:

ZONE_DMA The first 16MB of the memory address
ZONE_NORMAL 16MB~896MB
ZONE_HIGHMEM 896MB~the end of the memory address
Each management area has its own purpose. Some previous ISA devices had restrictions on which addresses could perform I/O operations, and ZONE_DMA removes these restrictions.
ZONE_NORMAL is used for all kernel operations and allocations. It is extremely important for system performance.
ZONE_HIGHMEM is the rest of the memory in the system. It should be noted that ZONE_HIGHMEM cannot be used for kernel allocation and data structures, and can only be used to save user data.

memory map

In order to better understand the mapping mechanism of kernel memory, the following is an example of x86. As mentioned earlier, the kernel only has 1GB of virtual address space available, and the other 3GB is reserved for user space. The kernel maps physical memory in ZONE_DMA and ZONE_NORMAL directly into its address space. This means that the first 896MB of physical memory in the system is mapped into the kernel virtual address space, leaving only 128MB of virtual address space. This 128MB of virtual address space is used for operations such as vmalloc and kmap.

This mapping mechanism works well if the physical memory capacity is small (less than 1GB). However, all current servers support tens of gigabytes of memory. Intel has introduced a physical address extension (PAE) mechanism in its Pentium processor, which can support up to 64GB of physical memory. The aforementioned memory-mapping mechanism makes handling physical memory up to tens of gigabytes a major source of problems for x86 Linux. The Linux kernel handles high-end memory (all memory above 896MB) as follows: When the Linux kernel needs to address a page in the high-end memory, it maps the page to a small virtual address space window through the kmap operation. perform an operation on the page, and then unmap the page. The address space of a 64-bit architecture is huge, so this type of system does not have this problem.

paging mechanism

Virtual memory can be implemented in a variety of ways, the most efficient of which is a hardware-based solution. The virtual address space is divided into fixed-size blocks of memory, called pages. Virtual memory accesses are translated into physical memory addresses through page tables. To support various architectures and page sizes, Linux uses a three-level paging mechanism. It provides the following three page table types:

Page Global Directory (PGD)
, Page Middle Directory (PMD), and
Page Table (PTE)
address translation provides a method to separate a process's virtual address space from its physical address space. Each virtual memory page can be marked as "present" or "absent" in main memory. If a process accesses a virtual memory address that does not exist, the hardware will generate a page fault and the kernel will handle it. The page is placed in main memory when the kernel handles the error. During this process, the system may need to replace an existing page to make room for the new page.

The replacement strategy is one of the most critical aspects of the paging system. The Linux 2.6 release fixes issues with previous Linux releases regarding various page selections and replacements.

exchange mechanism

Swapping is the process of moving an entire process into or out of secondary storage when main memory is running out of capacity. Due to the high overhead of context switching, many modern operating systems, including Linux, do not use this approach, but instead use a paging mechanism. In Linux, swapping is performed at the page level rather than the process level. The main advantage of swapping is that it expands the address space available to a process. When the kernel needs to free memory to make room for new pages, some less-used or unused pages may need to be discarded. Some pages are not easily freed because they are not backed up by disk and need to be copied to the backing store (swap) and read back from the backing store when necessary. The main disadvantage of the swap mechanism is that it is slow. Disk reads and writes are usually very slow, so swapping should be eliminated as much as possible.

3) Process mechanism

Processes, tasks and kernel threads

A task is just a "general description of the work that needs to be done", which can be a light-weight thread or a full-fledged process.

A thread is the lightest task instance. The cost of creating a thread in the kernel can be high or low, depending on the characteristics the thread needs to have. In the simplest case, a thread shares all resources with its parent thread, including code, data, and many internal data structures, with only a slight difference in distinguishing that thread from other threads.

A process in Linux is a "heavyweight" data structure. If necessary, multiple threads can run in a single process (and share some resources of that process). In Linux, a process is just a thread with full weight characteristics. Threads and processes are scheduled in the same way by the scheduler.

A kernel thread is a thread that always runs in kernel mode and has no user context. Kernel threads usually exist for a specific function, and it is easy to handle it in the kernel. Kernel threads often have the desired role: being able to be scheduled like any other process; when other processes need that functionality to function, provide those processes with a target thread that implements that functionality (by sending a signal).

Scheduling and context switching

Process scheduling is the science (some call it the art) of ensuring that each process gets a fair share of the CPU. There is always disagreement about the definition of "fairness" because schedulers often make choices based on information that is not obvious and visible.

It should be noted that many Linux users believe that a scheduler that is mostly correct all the time is more important than a scheduler that is completely correct most of the time, i.e. a slow-running process is preferable to an overly carefully chosen scheduler. A process that stops running due to policy or error. Linux's current scheduler program follows this principle.

When a process stops running and is replaced by another process, it is called a context switch. Typically, this operation is expensive, and kernel programmers and application programmers always try to minimize the number of context switches the system performs. A process can actively stop running because it is waiting for some event or resource, or passively give up because the system decides that the CPU should be allocated to another process. For the first case, the CPU may actually go into an idle state if there are no other processes waiting to execute. In the second case, the process is either replaced by another waiting process, or assigned a new runtime slice or time period to continue execution.

Even while a process is being scheduled and executed in an orderly fashion, it can be interrupted by other higher priority tasks. For example, if the disk is ready with data for a disk read, it sends a signal to the CPU and expects the CPU to fetch the data from the disk. The kernel must handle this situation in a timely manner, otherwise it will reduce the transfer rate of the disk. Signals, interrupts, and exceptions are different asynchronous events, but are similar in many ways, and they must all be handled quickly even if the CPU is already busy.

For example, a data-ready disk can cause an outage. The kernel calls the interrupt handler for that particular device, interrupting the currently running process and using its many resources. When the execution of the interrupt handler ends, the currently running process resumes execution. This effectively steals the CPU time of the currently running process, because current versions of the kernel only measure the time that has elapsed since the process entered the CPU, ignoring the fact that interrupts consume precious time for the process.

Interrupt handlers are usually very fast and compact, and thus can be processed and cleared quickly to allow subsequent data to come in. But sometimes an interrupt may need to handle more work than is expected in the interrupt handler in a short amount of time. Interrupts also need a well-defined environment to do their work (remember, interrupts utilize the resources of some random process). In this case, gather enough information to delay submitting the work to the bottom half handler for processing. The bottom half handler is scheduled to execute from time to time. Although the bottom half mechanism was commonly used in earlier versions of Linux, its use is discouraged in current Linux versions.

4) Linux-driven platform mechanism

Compared with the traditional device_driver mechanism, the platform driver mechanism of Linux has a very obvious advantage in that the platform mechanism registers its own resources into the kernel, which is managed by the kernel uniformly. When these resources are used in the driver, the standard interface provided by platform_device is used. Apply and use. This improves the independence of drivers and resource management, and has better portability and security. The following is a schematic diagram of the SPI driver hierarchy. The SPI bus in Linux can be understood as the bus derived from the SPI controller:

Like traditional drivers, the platform mechanism is also divided into three steps:

1. Bus registration stage:

Kernel_init()→do_basic_setup()→driver_init()→platform_bus_init()→bus_register(&platform_bus_type) in the main.c file during kernel startup initialization registers a platform bus (virtual bus, platform_bus).

2. Add equipment stage:

When the device is registered, Platform_device_register()→platform_device_add()→ (pdev→dev.bus = &platform_bus_type)→device_add(), just hang the device on the virtual bus.

3. Driver registration stage:

Platform_driver_register() →driver_register() →bus_add_driver() →driver_attach()→bus_for_each_dev(), do __driver_attach()→driver_probe_device() for each device hanging on the virtual platform bus, judge drv→bus→match() Whether the execution is successful, execute platform_match→strncmp(pdev→name , drv→name , BUS_ID_SIZE ) through the pointer at this time, if it matches, call really_probe (actually execute the platform_driver→probe(platform_device) of the corresponding device.) Start the real detection, if If the probe is successful, the device is bound to the driver.

As can be seen from the above, the platform mechanism finally calls the three key functions of bus_register(), device_add(), and driver_register().

Here are a few structures:

struct platform_device 
(/include/linux/Platform_device.h)
{ 
const char * name; 
int id; 
struct device dev; 
u32 num_resources; 
struct resource * resource;
};

The Platform_device structure describes a device of a platform structure, which contains the structure of the general device struct device dev; the resource structure of the device struct resource * resource; and the name of the device const char * name. (Note that this name must be the same as the platform_driver.driver àname later, the reason will be explained later.)

The most important thing in this structure is the resource structure, which is why the platform mechanism is introduced.

struct resource 
( /include/linux/ioport.h)
{ 
resource_size_t start; 
resource_size_t end; 
const char *name; 
unsigned long flags; 
struct resource *parent, *sibling, *child;
};

The flags bit indicates the type of the resource, and start and end indicate the start address and end address of the resource respectively (/include/linux/Platform_device.h):

struct platform_driver 
{ 
int (*probe)(struct platform_device *); 
int (*remove)(struct platform_device *); 
void (*shutdown)(struct platform_device *); 
int (*suspend)(struct platform_device *, pm_message_t state); 
int (*suspend_late)(struct platform_device *, pm_message_t state); 
int (*resume_early)(struct platform_device *); 
int (*resume)(struct platform_device *); 
struct device_driver driver;
};
Platform_driver

The Platform_driver structure describes the driver of a platform structure. In addition to some function pointers, there is also a general driver device_driver structure.

Reasons for having the same name:

The driver mentioned above will call the function bus_for_each_dev() when it is registered, and do __driver_attach()→driver_probe_device() for each device hanging on the virtual platform bus, in this function, dev and drv will be initially matched. , which calls the function pointed to by drv->bus->match. In the platform_driver_register function, drv->driver.bus = &platform_bus_type, so drv->bus->matc is platform_bus_type→match, which is the platform_match function. The function is as follows:

static int platform_match(struct device * dev, struct device_driver * drv) 
{ 
struct platform_device *pdev = container_of(dev, struct platform_device, dev);
return (strncmp(pdev->name, drv->name, BUS_ID_SIZE) == 0);
}

It is to compare the names of dev and drv. If they are the same, they will enter the really_probe() function, so as to enter the probe function written by themselves for further matching. So dev→name and driver→drv→name must be filled in the same during initialization.

Different types of drivers have different match functions. The platform driver compares the names of dev and drv. Remember the match in the usb driver? It compares Product ID and Vendor ID.

The benefits of the Platform mechanism:

1. Provide a bus of type platform_bus_type, and add those soc devices that are not bus types to this virtual bus. So, the bus-device-driven model can be popularized.

2. Provide data structures of type platform_device and platform_driver, embed traditional device and driver data structures in them, and add resource members to facilitate integration with the new type of bootloader and kernel that dynamically transfers device resources such as Open Firmware.

2. Write your own operating system

Like various applications, the kernel is also an application, except that this application directly operates the hardware. The kernel directly faces the hardware, calls the hardware interface, and is developed through the instruction set provided by a hardware manufacturer and a CPU manufacturer. It is much simpler to develop applications in the face of the kernel, system calls, or library calls.

In order to write kernel-level applications, and to avoid being too low-level, there are many inherent library files that can be used when the kernel is compiled.

The kernel is directly hardware-oriented, so the available resources have great authority, but the kernel works in a limited address space. As far as linux is concerned, on a 32-bit system, in the linear address space, the kernel only thinks that it has 1G, although it can Master 4G, but you can only use 1G for your own operation, and the remaining 3G is for other applications. win is each 2G. Therefore, the available memory space when we develop the kernel is very limited, especially when developing drivers, we must understand that our available space is very limited, so it needs to be efficient.

The architecture of the kernel is also very clear, from the hardware layer, hardware abstraction layer, kernel basic modules (process scheduling, memory management, network protocol stack, etc.) to the application layer, which is basically the basic design of the system architecture that combines various software and hardware. For example, IoT systems (from small embedded systems such as single-chip microcomputers, MCUs, to smart homes, smart communities and even smart cities) can be referenced at the access end devices.

Linux originally ran on a PC, and the x86 architecture processor used was relatively powerful, and all kinds of instructions and modes were relatively complete. For example, the user mode and kernel mode we see are not available on general small embedded processors. Its advantage is to protect the code and data in the kernel mode by giving different permissions to the code and data segments ( Including hardware resources) must be accessed through a similar system call (SysCall) to ensure the stability of the kernel.

The process of writing an operating system:

Imagine if you were required to write an operating system, what factors would you need to consider?

Process management: how to allocate CPU time slices according to the scheduling algorithm in a multitasking system.

Memory management: how to map virtual memory and physical memory, allocate and reclaim memory.

File system: how to organize the sectors of the hard disk into a file system to realize the operation of reading and writing files.

Device Management: How to address, access, read, and write device configuration information and data.

Process management

Some processes are called processes and some are called tasks in different operating systems. The process data structure in the operating system contains many elements, often connected by a linked list.

Process-related content mainly includes: virtual address space, priority, life cycle (blocking, ready, running, etc.), occupied resources (such as semaphores, files, etc.).

The CPU checks the processes in the ready queue (traverses the process structure in the linked list) when each system tick interrupt is generated. If there is a new process that conforms to the scheduling algorithm, it needs to be switched, and saves the information of the currently running process (including the stack information, etc.), suspend the current process and select a new process to run, which is process scheduling.

The difference in priority of processes is the basic basis for CPU scheduling. The ultimate goal of scheduling is to allow high-priority activities to immediately obtain CPU computing resources (instant response), and low-priority tasks to be fairly allocated to CPU resources. Because it is necessary to save the process context, etc., the process switching itself has a cost, and the scheduling algorithm also needs to consider the efficiency of the process switching frequency.

In the early Linux operating system, the time slice round-robin algorithm (Round-Robin) is mainly used, and the kernel selects the high-priority process in the ready process queue to run, and each runs for an equal time. The algorithm is simple and intuitive, but it still causes some low-priority processes to be unscheduled for a long time. In order to improve the fairness of scheduling, after Linux 2.6.23, a completely fair scheduler called CFS (Completely Fair Scheduler) was introduced.

The CPU can only run one program at any time. When the user is using the Youku APP to watch videos, they are typing and chatting in WeChat at the same time. Youku and WeChat are two different programs. Why do they seem to be running at the same time? The goal of CFS is to make all programs appear to be running at the same speed on multiple parallel CPUs, i.e. nr_running processes, each running concurrently at a speed of 1/nr_running, for example, if there are 2 possible running tasks, then each executes concurrently at 50% of the physical power of the CPU.

CFS introduces the concept of "virtual runtime", which is represented by p->se.vruntime (nanosec-unit), by which it records and measures the "CPU time" that a task should get. In an ideal scheduling situation, all tasks should have the same value of p->se.vruntime at all times (the ones mentioned above run at the same speed). Because each task is executed concurrently, no task will exceed the ideal CPU time it should occupy. The logic of CFS selecting tasks to run is based on the value of p->se.vruntime, which is very simple: it always picks the task with the smallest value of p->se.vruntime to run (the least scheduled task).

CFS uses a time-based red-black tree to schedule the execution of future tasks. All tasks are sorted by the p->se.vruntime keyword. CFS selects the leftmost task to execute from the tree. As the system runs, executed tasks are placed on the right side of the tree, progressively giving each task a chance to be the leftmost task, thereby gaining CPU resources for a determinable amount of time.

To sum up, CFS first runs a task. When the task is switched (or when the Tick interrupt occurs), the CPU time used by the task will be added to p->se.vruntime, when the value of p->se.vruntime gradually When another task becomes the leftmost task of the red-black tree (at the same time, a small granularity distance is added between the task and the leftmost task to prevent excessive switching of tasks and affect performance), the leftmost task is selected for execution , the current task is preempted.

CFS red-black tree

In general, the scheduler handles individual tasks and tries to give each task a fair amount of CPU time. At some point, it may be desirable to group tasks and give each group a fair share of CPU time. For example, the system may allocate an average CPU time to each user, and then allocate an average CPU time to each task for each user.

memory management

The memory itself is an external storage device. The system needs to address the memory area, find the corresponding memory cell, and read and write the data in it.

The memory area is addressed by pointers, and the byte length of the CPU (32bit machine, 64bit machine) determines the largest addressable address space. The maximum addressing space on a 32-bit machine is 4GBtyes. On a 64-bit machine there are theoretically 2^64Bytes.

The largest address space has nothing to do with how much physical memory the actual system has, so it is called the virtual address space. For all processes in the system, it seems that each process occupies this address space independently, and it cannot perceive the memory space of other processes. The fact that the operating system lets applications do not need to pay attention to other applications, it seems that each task is the only process running on this computer.

Linux divides the virtual address space into kernel space and user space. The virtual space for each user process ranges from 0 to TASK_SIZE. The area from TASK_SIZE to 2^32 or 2^64 is reserved for the kernel and cannot be accessed by user processes. TASK_SIZE can be configured, the default configuration of Linux system is 3:1, the application uses 3GB of space, and the kernel uses 1GB of space. This division does not depend on the actual RAM size. On a 64-bit machine, the virtual address space can be very large, but only 42 or 47 bits of it (2^42 or 2^47) are actually used.

virtual address space

In the vast majority of cases, the virtual address space is larger than the physical memory (RAM) available to the actual system, and the kernel and CPU must consider how to map the actual available physical memory into the virtual address space.

One way is to map virtual addresses to physical addresses through the Page Table. Virtual addresses are related to user & kernel addresses used by the process, and physical addresses are used to address the actual RAM used.

As shown in the figure below, the virtual address spaces of processes A and B are divided into equal-sized parts called pages. Physical memory is also divided into pages (page frames) of equal size.

Virtual and physical address space mapping

The first memory page of process A is mapped to the fourth page of physical memory (RAM); the first memory page of process B is mapped to the fifth page of physical memory. The fifth memory page of process A and the first memory page of process B are both mapped to the fifth page of physical memory (the kernel can decide which memory space is shared by different processes).

As shown in the figure, not every page in the virtual address space is associated with a page frame. The page may not be used or the data has not been loaded into physical memory (not needed for the time being), or it may be because the physical memory page is It is replaced to the hard disk, and then replaced back to the memory when it is actually needed later.

The page table maps the virtual address space to the physical address space. The easiest way to do this is to use an array to map virtual pages to physical pages one-to-one, but doing so may require consuming the entire RAM itself to hold this page table, assuming that each page is 4KB in size and the virtual address space is 4GB in size, you need a An array of 1 million elements to hold the page table.

Because most areas of the virtual address space are not actually used, and these pages are not actually associated with the page frame, the introduction of multilevel paging can greatly reduce the memory used by the page table and improve query efficiency. For the detailed description of the multi-level table, please refer to xxx.

Memory mapping is an important abstraction that is used in many places such as the kernel and user applications. Mapping is to transfer data from a data source (it can also be an I/O port of a device, etc.) to the virtual memory space of a process. The operations on the mapped address space can use the method of dealing with ordinary memory (direct reading and writing of the address content). Any changes to memory will be automatically transferred to the original data source, such as mapping the content of a file into memory, you only need to read the memory to get the content of the file, and write the changes to the memory to modify the content of the file, The kernel ensures that any changes are automatically reflected in the file.

In addition, in the kernel, when implementing device drivers, the input and output areas of peripherals (external devices) can be mapped to virtual address spaces, and reading and writing these spaces will be redirected to the device by the system, so that the device can be operated, greatly Simplified driver implementation.

The kernel has to keep track of which physical pages have been allocated and which are still free to avoid two processes using the same area of ​​RAM. The allocation and release of memory are very frequent tasks. The kernel must ensure that the speed of completion is as fast as possible. The kernel can only allocate the entire page frame. It assigns the task of dividing the memory into smaller parts to the user space, the user space program library The page frame received from the kernel can be divided into smaller regions and allocated to the process.

virtual file system

Unix systems are built on some insightful ideas, a very important metaphor is:

Everything is a file.

That is, almost all resources of the system can be regarded as files. In order to support different local file systems, the kernel includes a layer of virtual file system (Virtual File System) between the user process and the file system implementation. Most of the functions provided by the kernel can be accessed through the file interface defined by VFS (Virtual File System). For example, kernel subsystems: character and block devices, pipes, network sockets, interactive input and output terminals, etc.

In addition, the device files used to operate character and block devices are real files in the /dev directory. When the read and write operations are performed, their content will be dynamically created by the corresponding device driver.

VFS system

In a virtual file system, inodes are used to represent files and file directories (for the system, a directory is a special kind of file). The elements of inode include two categories: 1. Metadata is used to describe the state of the file, such as read and write permissions. 2. The data segment used to save the contents of the file.

Each inode has a special number for unique identification, and the association between filename and inode is based on the number. Take the kernel to find /usr/bin/emacs as an example to explain how inodes form the directory structure of the file system. The search starts from the root inode (that is, the root directory '/'), which is represented by an inode. The data segment of the inode has no ordinary data, but only contains some file/directory items stored in the root directory. These items can represent files or other Directory, each item contains two parts: 1. The inode number where the next data item is located 2. The file or directory name

First scan the data area of ​​the root inode until an entry named 'usr' is found, looking for inodes in the subdirectory usr. Find the associated inode by the 'usr' inode number. Repeat the above steps to find the data item named 'bin', then search for the data item named 'emacs' in the inode corresponding to 'bin' of its data item, and the returned inode represents a file instead of a directory. The file content of the last inode is different from the previous ones. The first three each represent a directory, containing its subdirectories and file listings. The inode associated with the emacs file holds the actual content of the file in its data segment.

Although the steps to find a file in VFS are the same as described above, there are some differences in the details. For example, because frequently opening files is a slow operation, introducing a cache can speed up lookups.

Find a file through the inode mechanism:

 

device driver

Communication with peripherals often refers to input and output operations, or I/O for short. An I/O core implementing a peripheral must handle three tasks: First, the hardware must be addressed in different ways for different device types. Second, the kernel must provide methods for user applications and system tools to operate different devices, and a uniform mechanism is needed to ensure as little programming effort as possible, and to ensure that applications can interact with each other even with different hardware methods. Third, user space needs to know what devices are in the kernel.

The hierarchical relationship with peripherals is as follows:

Device Communication Hierarchy Diagram

Most of the external devices are connected to the CPU through the bus, and the system often has more than one bus, but a collection of buses. Many PC designs include two PCI buses connected by a bridge. Some buses such as USB cannot be used as the main bus and need to pass data to the processor through a system bus. The diagram below shows how the different buses are connected to the system.

 

System bus topology

The system interacts with peripherals mainly in the following ways:

I/O port: In the case of using I/O port communication, the kernel sends data through an I/O controller, each receiving device has a unique port number, and forwards the data to the hardware attached to the system. There is a separate virtual address space managed by the processor to manage all I/O addresses.

The I/O address space is not always associated with normal system memory, which is often difficult to understand considering that ports can be mapped into memory.

There are different types of ports. Some are read-only, some are write-only, and in general they can operate in both directions, and data can be exchanged in both directions between the processor and peripherals.

In the IA-32 architecture, the port's address space contains 2^16 different 8-bit addresses, which can be uniquely identified by numbers from 0x0 to 0xFFFFH. Each port has a device assigned to it, or is idle and unused, and multiple peripherals cannot share a port. In many cases, 8-bit data exchange is insufficient, and for this reason, two consecutive 8-bit ports can be bound into a single 16-bit port. Two consecutive 16-bit ports can be treated as a 32-bit port, and the processor can perform input and output operations by assembling statements.

Different processor types implement operation ports differently, the kernel must provide a suitable abstraction layer, such as outb (write a byte), outw (write a word) and inb (read a byte). These commands can be used to Operation port.

I/O memory mapping: Many devices must be able to be addressed like RAM memory. Therefore, the processor provides the I/O port corresponding to the peripheral device to be mapped into the memory, so that the device can be operated like ordinary memory. For example, graphics cards use such a mechanism, and PCI is often addressed by mapped I/O addresses.

To implement memory mapping, the I/O ports must first be mapped into normal system memory (using processor-specific functions). Because implementations vary widely between platforms, the kernel provides an abstraction layer to map and unmap I/O regions.

Besides how to access the peripheral, when will the system know if the peripheral has data to access? There are two main ways: polling and interrupts.

Polling periodically accesses the query device to see if it has data ready, and if so, fetches the data. This method requires the processor to continuously access the device even when the device has no data, wasting CPU time slices.

Another way is to interrupt. Its idea is that after the peripheral has finished something, it will actively notify the CPU. The interrupt has the highest priority and will interrupt the current process of the CPU. Each CPU provides an interrupt line (which can be shared by different devices), each interrupt is identified by a unique interrupt number, and the kernel provides a service method for each interrupt used (ISR, Interrupt Service Routine, that is, after the interrupt occurs , the processing function called by the CPU), the interrupt itself can also set the priority.

Interrupts suspend normal system work. The peripheral fires an interrupt when data is ready to be used by the kernel or indirectly by an application. Using interrupts ensures that the system only notifies the processor when a peripheral needs the processor to intervene, effectively improving efficiency.

Controlling devices via the bus: Not all devices are addressed and operated directly via I/O statements, but in many cases via a bus system.

Not all device types can be directly attached to all bus systems, such as hard disks attached to SCSI interfaces, but not graphics cards (graphics cards can be attached to the PCI bus). The hard disk must be connected to the PCI bus indirectly through IDE.

Bus types can be divided into system bus and expansion bus. Implementation differences in hardware are not important to the core, only how the bus and its attached peripherals are addressed. For a system bus, such as the PCI bus, I/O statements and memory maps are used to communicate with the bus, as well as with the devices it is attached to. The kernel also provides some commands for device drivers to call bus functions, such as accessing the list of available devices, and reading and writing configuration information in a uniform format.

Expansion buses such as USB, SCSI exchange data and commands with attached devices via a clearly defined bus protocol. The kernel communicates with the bus through I/O statements or memory maps, and the bus communicates with attached devices through platform-independent functions.

Communication with a bus-attached device does not necessarily need to be done through a driver in kernel space, but in some cases can also be done in user space. A prime example is the SCSI Writer, addressed by the cdrecord tool. This tool generates the required SCSI commands, sends the commands to the corresponding device through the SCSI bus with the help of the kernel, and processes and responds to the information generated or returned by the device.

Block devices (block) and character devices (character) differ significantly in 3 ways:

Data in a block device can be manipulated at any point, while a character device cannot.

Block device data transfers always use fixed-size blocks. The device driver always gets a complete block from the device even when only one byte is requested. In contrast, character devices are capable of returning a single byte.

Reading and writing to a block device uses the cache. For read operations, data is cached in memory and can be revisited when needed. In terms of write operations, it will also be cached to delay writing to the device. Using a cache is not reasonable for character devices (eg keyboards), each read request must be reliably interacted with the device.

The concept of blocks and sectors: A block is a sequence of bytes of a specified size that is used to save data transferred between the kernel and the device. The size of the block can be set. A sector is a fixed size, the smallest amount of data that can be transferred by a device. A block is a contiguous segment of sectors, and the block size is an integer multiple of the sector.

network

Linux's network subsystem provides a solid foundation for the development of the Internet. The network model is based on ISO's OSI model, as shown in the right half of the figure below. However, in specific applications, the corresponding layers are often combined to simplify the model. The left half of the figure below is the TCP/IP reference model used by Linux. (Because there is a lot of information about the Linux network part, in this article only a brief introduction to the big level is given, and no explanation is given.)

The Host-to-host layer (Physical Layer and Data link layer, that is, the physical layer and the data link layer) is responsible for transferring data from one computer to another. This layer handles the electrical and codec properties of the physical transmission medium and also splits the data stream into fixed-size data frames for transmission. If multiple computers share a transmission route, the network adapter (network card, etc.) must have a unique ID (ie MAC address) to distinguish it. From the kernel's point of view, this layer is implemented through the device driver of the network card.

The network layer of the OSI model is called the network layer in the TCP/IP model. The network layer enables the computers in the network to exchange data, and these computers are not necessarily directly connected.

If there is no direct connection physically, there is no direct data exchange. The task of the network layer is to find routes for communication between machines in the network.

network connected computer

The network layer is also responsible for dividing the packet to be transmitted into the specified size, because the maximum packet size supported by each computer on the transmission path may be different. During transmission, the data stream is divided into different packets, and then at the receiving end. are combined.

The network layer assigns unique network addresses to computers on the network so that they can communicate with each other (as opposed to hardware MAC addresses, since networks are often made up of subnetworks). In the Internet, the network layer consists of IP networks, and there are V4 and V6 versions.

The task of the transport layer is to regulate the transfer of data between applications running on two connected computers. For example, client and server programs on two computers, including TCP or UDP connections, identify communicating applications by port numbers. For example, port number 80 is used for the web server, and the client of the browser must send requests to this port to get the required data. The client also needs to have a unique port number so that the web server can send replies to it.

This layer is also responsible for providing a reliable connection (in the case of TCP) for data transmission.

The application layer in the TCP/IP model is included in the OSI model (session layer, presentation layer, application layer). When a communication connection is established between two applications, this layer is responsible for the actual content transmission. For example, the protocol and data transmitted between the web server and its client are different from those between the mail server and its client.

Most network protocols are defined in RFC (Request for Comments).

Network implementation layered model: The implementation of the network layer by the kernel is similar to the TCP/IP reference model. It is implemented by C code, and each layer can only communicate with its upper and lower layers. The advantage of this is that different protocols and transmission mechanisms can be combined. As shown below:

4. Detailed explanation of Linux kernel modules

The kernel isn't magic, but it's essential to any functioning computer. The Linux kernel differs from OS X and Windows in that it contains kernel-level drivers and makes many things work "out of the box".

What if Windows already installed all available drivers and you just need to open the driver you need? This is essentially what kernel modules do for Linux. Kernel modules, also known as loadable kernel modules (LKMs), are essential to keep the kernel working with all hardware without consuming all available memory. 

Modules typically add functionality such as devices, filesystems, and system calls to the base kernel. The file extension of lkm is .ko and is usually stored in the /lib/modules directory. Due to the nature of modules, you can easily customize the kernel by setting the module to load or not load at boot time using the menuconfig command, or by editing the /boot/config file, or dynamically loading and unloading modules using the modprobe command.

Third-party and closed source modules are available in some distributions, such as Ubuntu, and may not be installed by default because the source code for these modules is not available. The developers of this software (i.e. nVidia, ATI, etc.) do not provide the source code, but build their own modules and compile the required .ko files for distribution. While these modules are free like beer, they are not free like speech and are therefore not included in some distributions because maintainers believe it "pollutes" the kernel by providing non-free software.

Advantages of using modules:

1. Make the kernel more compact and flexible
2. When modifying the kernel, it is not necessary to recompile the entire kernel, which can save a lot of time and avoid manual errors. If you need to use a new module in the system, you only need to compile the corresponding module and then insert the module using a specific user space program.
3. Modules may not depend on a fixed hardware platform.
4. Once the module's object code is linked into the kernel, its function is exactly the same as the statically linked kernel object code. Therefore, no explicit message passing is required when calling a function of a module.

However, the introduction of kernel modules also brings certain problems:

1. Since the memory occupied by the kernel will not be swapped out, the modules linked into the kernel will bring certain performance and memory utilization losses to the entire system.
2. The module loaded into the kernel becomes a part of the kernel and can modify other parts of the kernel. Therefore, improper use of the module will cause the system to crash.
3. In order for a kernel module to access all kernel resources, the kernel must maintain a symbol table and modify the symbol table when modules are loaded and unloaded.
4. Modules will require the use of functions of other modules, so the kernel needs to maintain dependencies between modules.

Modules run in the same address space as the kernel, and module programming is kernel programming in a sense. But modules are not available everywhere in the kernel. Modules are generally used in device drivers, file systems, etc., but for extremely important places in the Linux kernel, such as process management and memory management, it is still difficult to achieve through modules, and usually the kernel must be modified directly.

In the Linux kernel source program, the functions that are often implemented by kernel modules include file systems, SCSI advanced drivers, most SCSI drivers, most CD-ROM drivers, Ethernet drivers and so on.

1. Compile and install the Linux kernel

Components of the Linux kernel:

  • kernel: kernel core, usually bzImage, usually in the /boot directory

    vmlinuz-VERSION-RELEASE
  • kernel object: kernel object, generally placed in

    /lib/modules/VERSION-RELEASE/
  • Auxiliary file: ramdisk

    initrd-VERSION-RELEASE.img:从CentOS 5 版本以前
    initramfs-VERSION-RELEASE.img:从CentOS6 版本以后

Check the kernel version:

uname -r
-r 显示VERSION-RELEASE
-n  打印网络节点主机名
-a  打印所有信息

kernel module commands

System calls are of course a viable way to insert kernel modules into the kernel. But it's too low level. In addition, there are two ways to achieve this in the Linux environment. One method is slightly more automatic, which can be automatically loaded when needed and unloaded when not needed. This method requires the execution of the modprobe program.

The other is to use the insmod command to manually load the kernel module. In the previous analysis of the helloworld example, we mentioned that the role of insmod is to insert the module that needs to be inserted into the kernel in the form of object code. Note that only superusers can use this command.

Most of the system calls provided by the Linux kernel module mechanism are used by the modutils program. It can be said that the combination of the Linux kernel module mechanism and modutils provides the programming interface of the module. modutils (modutils-xyztar.gz) can be obtained wherever the kernel source code is obtained. Select the highest level patchlevel xyz equal to or less than the current kernel version. After installation, there will be insmod, rmmod, ksyms, lsmod in the /sbin directory , modprobe and other utilities. Of course, usually when we load the Linux kernel, modutils has already been loaded.

lsmod command:

  • show kernel modules that have been loaded by the kernel
  • The displayed content comes from: /proc/modules file

In fact, the function of this program is to read the information in the file /proc/modules in the /proc file system. So this command is equivalent to cat /proc/modules. Its format is:

[root@centos8 ~]#lsmod 
Module                 Size Used by
uas                    28672  0
usb_storage            73728  1 uas
nls_utf8               16384  0
isofs                  45056  0 #显示:名称、大小,使用次数,被哪些模块依赖

ksyms command:

Displays information about kernel symbols and module symbol tables, and can read the /proc/kallsyms file.

modinfo command:

Function: manage kernel modules

Configuration file:

/etc/modprobe.conf, /etc/modprobe.d/*.conf
  • Displays detailed description information of the module
modinfo [ -k kernel ]  [ modulename|filename... ]

Common options:

-n:只显示模块文件路径
-p:显示模块参数
-a:作者
-d:描述

Case:

lsmod |grep xfs 
modinfo  xfs

insmod command:

Specify the module file, do not automatically resolve dependent modules. A simple program to insert a module into the Linux kernel.

grammar:

insmod [ filename ]  [ module options... ]

Case:

insmod 
modinfo –n exportfs

lnsmod 
modinfo –n xfs

insmod is actually a modutils module utility. When we use this command as a superuser, this program completes the following series of tasks:

1. Read the name of the module to be linked from the command line, usually an object file with the extension ".ko" and elf format.
2. Determine the location of the file where the module object code is located. Usually this file is in a subdirectory of lib/modules.
3. Calculate the memory size required to store the module code, module name, and module object.
4. Allocate a memory area in user space, copy the module object, module name, and module code relocated for the running kernel into this memory. Among them, the init field in the module object points to the address reassigned by the entry function of this module; the exit field points to the address reassigned by the exit function.
5. Call init_module() and pass it the address of the user-mode memory area created above. We have analyzed the implementation process in detail.
6. Release the user mode memory, and the whole process ends.

modprobe command:

  • Adding and Removing Modules in the Linux Kernel
modprobe [ -C config-file ] [ modulename ] [ module parame-ters... ] modprobe [ -r ] modulename…

Common options:

-C:使用文件
-r:删除模块

usage: 

装载:modprobe 模块名 
卸载: modprobe -r 模块名 # rmmod命令:卸载模块

modprobe is a program provided by modutils that automatically inserts modules based on dependencies between modules. The module loading method of on-demand loading mentioned earlier will call this program to realize the function of on-demand loading. For example, if module A depends on module B, and module B is not loaded into the kernel, when the system requests to load module A, the modprobe program will automatically load module B into the kernel.

Similar to insmod, the modprobe program also links a module specified on the command line, but it can also recursively link other modules referenced by the specified module. In terms of implementation, modprobe just checks the module dependencies, and the real loading work is still implemented by insmod. So, how does it know the dependencies between modules? Simply put, modprobe learns about this dependency through another modutils program, depmod. And depmod finds all modules in the kernel and writes the dependencies between them into a file named modules.dep in the /lib/modules/2.6.15-1.2054_FC5 directory.

kmod command:

In previous versions of the kernel, the automatic loading of the module mechanism was implemented by a user process kerneld. The kernel communicated with the kernel through IPC, and sent the information of the module to be loaded to the kerneld, and then the kerneld called the modprobe program to load the module. But in recent versions of the kernel, another method kmod is used to achieve this function. Compared with kerneld, the biggest difference between kmod is that it is a process running in the kernel space. It can directly call modprobe in the kernel space, which greatly simplifies the whole process.

depmod command:

A tool for generating kernel module dependency files and system information mapping files, generating module .dep and map files.

rmmod command:

Uninstall module, a simple program to remove a module from the Linux kernel.

The rmmod program will remove the module that has been inserted into the kernel from the kernel, and rmmod will automatically run the exit function defined by the kernel module itself. Its format is:

rmmod xfs
rmmod exportfs

Of course, it is ultimately implemented via the delete_module() system call. 

compile the kernel

Compile and install the kernel preparation:

(1) Prepare the development environment; 

(2) Obtain the relevant information of the hardware device on the target host;

(3) Obtain relevant information about the function of the target host system, for example: the corresponding file system needs to be enabled;

(4) Obtain the kernel source code package, www.kernel.org;

Compile preparation

Information about the target host hardware device

CPU:

cat /proc/cpuinfo
x86info -a
lscpu

PCI devices: lspci -v , -vv:

[root@centos8 ~]#lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)

USB device: lsusb -v, -vv:

[root@centos8 ~]#dnf install usbutils -y
[root@centos8 ~]#lsusb
Bus 001 Device 004: ID 0951:1666 Kingston Technology DataTraveler 100 G3/G4/SE9 G2
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
[root@centos8 ~]#lsmod |grep usb
usb_storage            73728  1 uas

lsblk block device

All hardware device information: hal-device: CentOS 6

Development environment related packages

gcc make ncurses-devel flex bison openssl-devel elfutils-libelf-devel
  • Download source files
  • Prepare the text configuration file /boot/.config
  • make menuconfig: configure kernel options
  • make [-j #] or do it in two steps:
    make -j # bzImage
    make -j # modules
  • Install modules: make modules_install
  • Install kernel related files: make install
    • Install bzImage as /boot/vmlinuz-VERSION-RELEASE
    • Generate initramfs file
    • Edit grub's configuration file

Compile and install the internal verification case:

[root@centos7 ~]#yum -y install gcc gcc-c++ make ncurses-devel flex bison openssl-devel elfutils-libelf-devel
[root@centos7 ~]#tar xvf linux-5.15.51.tar.xz -C /usr/local/src
[root@centos7 ~]#cd /usr/local/src
[root@centos7 src]#ls
linux-5.15.51
[root@centos7 src]#du -sh *
1.2G	linux-5.15.51
[root@centos7 src]#cd linux-5.15.51/
[root@centos7 linux-5.15.51]#ls
arch   COPYING  Documentation  include  Kbuild   lib          Makefile  README   security  usr
block  CREDITS  drivers        init     Kconfig  LICENSES     mm        samples  sound     virt
certs  crypto   fs             ipc      kernel   MAINTAINERS  net       scripts  tools
[root@centos7 linux-5.15.51]#cp /boot/config-3.10.0-1160.el7.x86_64 .config
[root@centos7 linux-5.15.51]#vim .config
#修改下面三行
#CONFIG_MODULE_SIG=y    #注释此行
CONFIG_SYSTEM_TRUSTED_KEYRING=""    #修改此行
#CONFIG_DEBUG_INFO=y    #linux-5.8.5版本后需要注释此行

#升级gcc版本,可以到清华的镜像站上下载相关的依赖包
#https://mirrors.tuna.tsinghua.edu.cn/gnu/gcc/gcc-9.1.0/
#https://mirrors.tuna.tsinghua.edu.cn/gnu/gmp/
#https://mirrors.tuna.tsinghua.edu.cn/gnu/mpc/
#https://mirrors.tuna.tsinghua.edu.cn/gnu/mpfr/

[root@centos7 linux-5.15.51]#cd ..
[root@centos7 src]#tar xvf gcc-9.1.0.tar.gz
[root@centos7 src]#tar xvf gmp-6.1.2.tar.bz2 -C gcc-9.1.0/
[root@centos7 src]#cd gcc-9.1.0/
[root@centos7 gcc-9.1.0]#mv gmp-6.1.2 gmp
[root@centos7 gcc-9.1.0]#cd ..
[root@centos7 src]#tar xvf mpc-1.1.0.tar.gz -C gcc-9.1.0/
[root@centos7 src]#cd gcc-9.1.0/
[root@centos7 gcc-9.1.0]#mv mpc-1.1.0 mpc
[root@centos7 gcc-9.1.0]#cd ..
[root@centos7 src]#tar xvf mpfr-4.0.2.tar.gz -C gcc-9.1.0/
[root@centos7 src]#cd gcc-9.1.0/
[root@centos7 gcc-9.1.0]#mv mpfr-4.0.2 mpfr

#编译安装gcc
[root@centos7 gcc-9.1.0]#./configure --prefix=/usr/local/ --enable-checking=release --disable-multilib --enable-languages=c,c++ --enable-bootstrap
[root@centos7 gcc-9.1.0]#make -j 2  #CPU核数要多加,不然编译会很慢
[root@centos7 gcc-9.1.0]#make install

[root@centos7 gcc-9.1.0]#cd ..
[root@centos7 src]#cd linux-5.15.51/
[root@centos7 linux-5.15.51]#make help
[root@centos7 linux-5.15.51]#make menuconfig

Enter [General Settings] and press Enter:

Add the kernel version and press Enter:

Enter the custom kernel version and press Enter:

Press【Tab】, select【Exit】to exit:

Select [File System] and press Enter: 

Select [NT File System] and press Enter: 

Select [NTFS file system], press [Spacebar], M means modular mode: 

Select [Support debugging and writing], press [Spacebar] to select, press [Tab], select [Exit], and press Enter to exit:

Press [Tab], select [Exit], and press Enter to exit:

Press [Tab], select [Exit], and press Enter to exit:

Save the configuration and press Enter: 

[root@centos7 linux-5.15.51]#grep -i ntfs .config
CONFIG_NTFS_FS=m
CONFIG_NTFS_DEBUG=y
CONFIG_NTFS_RW=y
# CONFIG_NTFS3_FS is not set

[root@centos7 linux-5.15.51]#make -j 2  #CPU核数要多加,不然编译会很慢
[root@centos7 linux-5.15.51]#pwd
/usr/local/src/linux-5.15.51
[root@centos7 linux-5.15.51]#du -sh .
3.0G	.
[root@centos7 linux-5.15.51]#make modules_install
[root@centos7 linux-5.15.51]#ls /lib/modules
3.10.0-1160.el7.x86_64  5.15.51-150.el7.x86_64
[root@centos7 linux-5.15.51]#du -sh /lib/modules/*
45M	/lib/modules/3.10.0-1160.el7.x86_64
224M	/lib/modules/5.15.51-150.el7.x86_64
[root@centos7 linux-5.15.51]#make install
[root@centos7 linux-5.15.51]#ls /boot/
config-3.10.0-1160.el7.x86_64                            System.map
efi                                                      System.map-3.10.0-1160.el7.x86_64
grub                                                     System.map-5.15.51-150.el7.x86_64
grub2                                                    vmlinuz
initramfs-0-rescue-afe373e8a26e45c681032325645782c8.img  vmlinuz-0-rescue-afe373e8a26e45c681032325645782c8
initramfs-3.10.0-1160.el7.x86_64.img                     vmlinuz-3.10.0-1160.el7.x86_64
initramfs-5.15.51-150.el7.x86_64.img                     vmlinuz-5.15.51-150.el7.x86_64
symvers-3.10.0-1160.el7.x86_64.gz

Select the Linux5.15 kernel to start:

[root@centos7 linux-5.15.51]#reboot  
[root@centos7 ~]#uname -r
5.15.51-150.el7.x86_64

Kernel compilation instructions

Configure kernel options:

Support "update" mode for configuration, make help:

(a) make config:基于命令行以遍历的方式配置内核中可配置的每个选项
(b) make menuconfig:基于curses的文本窗口界面
(c) make gconfig:基于GTK (GNOME)环境窗口界面
(d) make xconfig:基于QT(KDE)环境的窗口界面

Supports "fresh configuration" mode for configuration:

(a) make defconfig:基于内核为目标平台提供的“默认”配置进行配置
(b) make allyesconfig: 所有选项均回答为“yes“
(c) make allnoconfig: 所有选项均回答为“no“

compile the kernel

  • Full compilation:
make [-j #]
  • Compile part of the function of the kernel:
    (a) Only compile the relevant code in a subdirectory
cd /usr/src/linux
make dir/

(b) compile only one specific module

cd /usr/src/linux
make dir/file.ko

 Compile the driver for e1000 only:

make drivers/net/ethernet/intel/e1000/e1000.ko

cross compile the kernel

The target platform for compilation is not the same as the current platform:

make ARCH=arch_name

To get help using a specific target platform:

make ARCH=arch_name help

Recompilation requires prior cleanup operations:

make clean:清理大多数编译生成的文件,但会保留.config文件等
make mrproper: 清理所有编译生成的文件、config及某些备份文件
make distclean:包含 make mrproper,并清理patches以及编辑器备份文件

Uninstall the kernel:

1. Delete the unnecessary kernel source code in the /usr/src/linux/ directory;

2. Delete the unnecessary kernel library files in the /lib/modules/ directory;

3. Delete the kernel and kernel image files started in the /boot directory;

4. Change the grub configuration file and delete the unnecessary kernel boot list grub2-mkconfig -o /boot/grub2/grub.cfg
CentOS 8 also needs to delete /boot/loader/entries/5b85fc7444b240a992c42ce2a9f65db5-new kernel version.conf;

2. Implementation mechanism of Linux kernel module

Before diving into modules, it's worth reviewing the differences between kernel modules and our familiar applications.

The main point, we must be clear, kernel modules run in "kernel space", and applications run in "user space". Kernel space and user space are the two most basic concepts in operating systems. Maybe you don't know the difference between them, so let's review them together.

One of the functions of the operating system is to provide resource management for applications, so that all applications can use the hardware resources they need. However, the current norm is that hosts often have only one set of hardware resources; modern operating systems can take advantage of this set of hardware to support multi-user systems. In order to ensure that the kernel is not disturbed by the application program, the multi-user operating system implements authorized access to hardware resources, and the realization of this authorized access mechanism benefits from the realization of different operational protection levels inside the CPU. Take INTEL's CPU as an example. At any time, it always runs at one of the four privilege levels. If it needs to access the storage space of high privilege level, it must pass through a limited number of privilege gates. The Linux system is designed to take full advantage of this hardware feature, and it uses only two levels of protection (although the i386 series microprocessors provide a four-level mode).

In a Linux system, the kernel runs at the highest level. At this level, access to any device is possible. And applications run at the lowest level. At this level, the processor prohibits programs from direct access to hardware and unauthorized access to kernel space. Therefore, corresponding to the kernel program running at the highest level, the memory space where it is located is the kernel space. And corresponding to the application running at the lowest level, the memory space where it is located is the user space. Linux completes the conversion from user space to kernel space through system calls or interrupts. The kernel code that executes the system call runs in the context of the process, which completes the operation in the kernel space on behalf of the calling process, and can also access the data in the user address space of the process. But for interrupts, it does not exist in any process context, but is run by the kernel.

Well, now we can more specifically analyze the similarities and differences between kernel modules and applications. Let's look at Table 6-1.

A comparison of the way applications and kernel modules are programmed:

In this table, we see that the kernel module must tell the system, "I'm coming" through the init_module() function; "I'm leaving" through the cleanup_module() function. This is the biggest feature of modules, which can be dynamically loaded and unloaded. insmod is a command for loading modules into the kernel in the kernel module manipulation tool set modutils, which we will introduce in detail later. Because of the address space, kernel modules cannot freely use function libraries defined in user space such as libc, such as printf(), as applications can; modules can only use those resource-constrained functions defined in kernel space, such as printk( ). The source code of the application can call functions that are not defined by itself, and only need to resolve those external references with the corresponding function library during the connection process. The application-callable function printf() is declared in stdio.h, and there is target linkable code in libc. However, for the kernel module, it cannot use this print function, but can only use the printk() function defined in the kernel space. The printk() function does not support the output of floating-point numbers, and the amount of output data is limited by the available memory space of the kernel.

Another difficulty with kernel modules is that kernel failures are often fatal to the entire system or to the current process. During application development, segment faults do not cause any harm. We can use debuggers to easily track down to the wrong place. Therefore, special care must be taken in the process of kernel module programming.

Let's take a look at how the kernel module mechanism is implemented in detail.

kernel symbol table

First, let's understand the concept of the kernel symbol table. The kernel symbol table is a special table used to store those symbols and their corresponding addresses that all modules can access. Linking of modules is the process of inserting modules into the kernel. Any global symbols declared by a module become part of the kernel symbol table. Kernel modules obtain the addresses of symbols from kernel space according to the system symbol table to ensure correct operation in kernel space.

This is a public symbol table that we can read textually from the file /proc/kallsyms. The format for storing data in this file is as follows:

Memory Address Attribute Symbol Name [Module to which it belongs]

In module programming, you can use the symbol name to retrieve the address of the symbol in memory from this file, and then directly access the memory to obtain the kernel data. For the symbols exported by the kernel module, the fourth column "belonging module" will be included, which is used to mark the name of the module to which the symbol belongs; and for the symbols released from the kernel, there is no data in this column.

The kernel symbol table is located in the _ksymtab part of the kernel code segment, and its start address and end address are specified by two symbols generated by the C compiler: __start___ksymtab and __stop___ksymtab.

module dependencies

The kernel symbol table records the symbols and corresponding addresses that all modules can access. After a kernel module is loaded, the symbols it declares will be recorded in this table, and these symbols may of course be referenced by other modules. This leads to the problem of module dependencies.

When a module A references a symbol exported by another module B, we say that module B is referenced by module A, or that module A is loaded on top of module B. If you want to link module A, you must link module B first. Otherwise, references to those symbols exported by module B cannot be linked into module A. This interrelationship between modules is called module dependencies.

Kernel code analysis

Source code implementation of the kernel module mechanism, contributed by Richard Henderson. After 2002, rewritten by Rusty Russell. Newer versions of the Linux kernel, adopt the latter.

1) Data structure

The data structure related to the module is stored in include/linux/module.h. Of course, the first struct module is recommended:

include/linux/module.h
 
232  struct module
 
233  {
 
234        enum module_state state;
 
235
 
236        /* Member of list of modules */
 
237        struct list_head list;
 
238
 
239        /* Unique handle for this module */
 
240        char name[MODULE_NAME_LEN];
 
241
 
242        /* Sysfs stuff. */
 
243        struct module_kobject mkobj;
 
244        struct module_param_attrs *param_attrs;
 
245        const char *version;
 
246        const char *srcversion;
 
247
 
248        /* Exported symbols */
 
249        const struct kernel_symbol *syms;
 
250        unsigned int num_syms;
 
251        const unsigned long *crcs;
 
252
 
253        /* GPL-only exported symbols. */
 
254        const struct kernel_symbol *gpl_syms;
 
255        unsigned int num_gpl_syms;
 
256        const unsigned long *gpl_crcs;
 
257
 
258        /* Exception table */
 
259        unsigned int num_exentries;
 
260        const struct exception_table_entry *extable;
 
261
 
262        /* Startup function. */
 
263        int (*init)(void);
 
264
 
265        /* If this is non-NULL, vfree after init() returns */
 
266        void *module_init;
 
267
 
268        /* Here is the actual code + data, vfree'd on unload. */
 
269        void *module_core;
 
270
 
271        /* Here are the sizes of the init and core sections */
 
272        unsigned long init_size, core_size;
 
273
 
274        /* The size of the executable code in each section.  */
 
275        unsigned long init_text_size, core_text_size;
 
276
 
277        /* Arch-specific module values */
 
278        struct mod_arch_specific arch;
 
279
 
280        /* Am I unsafe to unload? */
 
281        int unsafe;
 
282
 
283        /* Am I GPL-compatible */
 
284        int license_gplok;
 
285       
 
286        /* Am I gpg signed */
 
287        int gpgsig_ok;
 
288
 
289  #ifdef CONFIG_MODULE_UNLOAD
 
290        /* Reference counts */
 
291        struct module_ref ref[NR_CPUS];
 
292
 
293        /* What modules depend on me? */
 
294        struct list_head modules_which_use_me;
 
295
 
296        /* Who is waiting for us to be unloaded */
 
297        struct task_struct *waiter;
 
298
 
299        /* Destruction function. */
 
300        void (*exit)(void);
 
301  #endif
 
302
 
303  #ifdef CONFIG_KALLSYMS
 
304        /* We keep the symbol and string tables for kallsyms. */
 
305        Elf_Sym *symtab;
 
306        unsigned long num_symtab;
 
307        char *strtab;
 
308
 
309        /* Section attributes */
 
310        struct module_sect_attrs *sect_attrs;
 
311  #endif
 
312
 
313        /* Per-cpu data. */
 
314        void *percpu;
 
315
 
316        /* The command line arguments (may be mangled).  People like
317          keeping pointers to this stuff */
 
318        char *args;
 
319  };
 

In the kernel, each kernel module information is described by such a module object. All module objects are linked together by a list. The first element of the linked list is established by static LIST_HEAD(modules), see line 65 of kernel/module.c. If you read the LIST_HEAD macro definition in include/linux/list.h, you will quickly understand that the modules variable is a structure of type struct list_head, and the next pointer and prev pointer inside the structure point to the modules themselves when initialized. Operations on the modules linked list are protected by module_mutex and modlist_lock.

Here are some descriptions of some important fields in the module structure:

234 state表示module当前的状态,可使用的宏定义有:

MODULE_STATE_LIVE

MODULE_STATE_COMING

MODULE_STATE_GOING

240 name数组保存module对象的名称。

244 param_attrs指向module可传递的参数名称,及其属性

248-251 module中可供内核或其它模块引用的符号表。num_syms表示该模块定义的内核模块符号的个数,syms就指向符号表。

300  init和exit 是两个函数指针,其中init函数在初始化模块的时候调用;exit是在删除模块的时候调用的。
294 struct list_head modules_which_use_me,指向一个链表,链表中的模块均依靠当前模块。

After introducing the module{} data structure, you may still feel that you don't understand it, because there are many concepts and related data structures that you don't understand yet.

For example kernel_symbol{} (see include/linux/module.h):

struct kernel_symbol

{

       unsigned long value;

       const char *name;

};

This structure is used to hold kernel symbols in object code. When compiling, the compiler writes the kernel symbols defined in the module into the file, and reads the symbol information contained in it through this data structure when reading the file and loading the module.

value defines the entry address of the kernel symbol;

name points to the name of the kernel symbol;

implement function

Next, we have to study several important functions in the source code. As mentioned in the previous paragraph, when the operating system is initialized, static LIST_HEAD(modules) has established an empty linked list. After that, each time a kernel module is loaded, a module structure is created and linked to the modules list.

We know that from the perspective of the operating system kernel, it provides user services through the only interface called system calls. So, what about services related to kernel modules? Please refer to arch/i386/kernel/syscall_table.S, the 2.6.15 version of the kernel, load the kernel module through the system call init_module, unload the kernel module through the system call delete_module, there is no other way. Now, code reading has become easier.

kernel/module.c:

1931 asmlinkage long
 
1932 sys_init_module(void __user *umod,
 
1933              unsigned long len,
 
1934              const char __user *uargs)
 
1935 {
 
1936       struct module *mod;
 
1937       int ret = 0;
 
1938
 
1939       /* Must have permission */
 
1940       if (!capable(CAP_SYS_MODULE))
 
1941             return -EPERM;
 
1942
 
1943       /* Only one module load at a time, please */
 
1944       if (down_interruptible(&module_mutex) != 0)
 
1945              return -EINTR;
 
1946
 
1947       /* Do all the hard work */
 
1948       mod = load_module(umod, len, uargs);
 
1949       if (IS_ERR(mod)) {
 
1950             up(&module_mutex);
 
1951             return PTR_ERR(mod);
 
1952       }
 
1953
 
1954       /* Now sew it into the lists.  They won't access us, since
1955         strong_try_module_get() will fail. */
 
1956       stop_machine_run(__link_module, mod, NR_CPUS);
 
1957
 
1958       /* Drop lock so they can recurse */
 
1959       up(&module_mutex);
 
1960
 
1961       down(&notify_mutex);
 
1962       notifier_call_chain(&module_notify_list, MODULE_STATE_COMING, mod);
 
1963       up(&notify_mutex);
 
1964
 
1965       /* Start the module */
 
1966       if (mod->init != NULL)
 
1967             ret = mod->init();
 
1968       if (ret < 0) {
 
1969             /* Init routine failed: abort.  Try to protect us from
1970               buggy refcounters. */
 
1971             mod->state = MODULE_STATE_GOING;
 
1972             synchronize_sched();
 
1973             if (mod->unsafe)
 
1974                   printk(KERN_ERR "%s: module is now stuck!\n",
 
1975                         mod->name);
 
1976             else {
 
1977                   module_put(mod);
 
1978                   down(&module_mutex);
 
1979                   free_module(mod);
 
1980                   up(&module_mutex);
 
1981             }
 
1982             return ret;
 
1983       }
 
1984
 
1985       /* Now it's a first class citizen! */
 
1986       down(&module_mutex);
 
1987       mod->state = MODULE_STATE_LIVE;
 
1988       /* Drop initial reference. */
 
1989       module_put(mod);
 
1990       module_free(mod, mod->module_init);
 
1991       mod->module_init = NULL;
 
1992       mod->init_size = 0;
 
1993       mod->init_text_size = 0;
 
1994       up(&module_mutex);
 
1995
 
1996       return 0;
 
1997 }
 

The function sys_init_module() is the implementation of the system call init_module(). The entry parameter umod points to the location of the kernel module image in user space. The image is saved in the executable file format of ELF. The first part of the image is the elf_ehdr type structure, and the length is indicated by len. uargs point to arguments from user space. The syntax prototype of the system call init_module( ) is:

long sys_init_module(void *umod, unsigned long len, const char *uargs);

illustrate:

1940-1941 调用capable( )函数验证是否有权限装入内核模块。

1944-1945 在并发运行环境里,仍然需保证,每次最多只有一个module准备装入。这通过down_interruptible(&module_mutex)实现。

1948-1952 调用load_module()函数,将指定的内核模块读入内核空间。这包括申请内核空间,装配全程量符号表,赋值__ksymtab、__ksymtab_gpl、__param等变量,检验内核模块版本号,复制用户参数,确认modules链表中没有重复的模块,模块状态设置为MODULE_STATE_COMING,设置license信息,等等。

1956      将这个内核模块插入至modules链表的前部,也即将modules指向这个内核模块的module结构。

1966-1983 执行内核模块的初始化函数,也就是表6-1所述的入口函数。

1987      将内核模块的状态设为MODULE_STATE_LIVE。从此,内核模块装入成功。

/kernel/module.c: 

573  asmlinkage long
 
574  sys_delete_module(const char __user *name_user, unsigned int flags)
 
575  {
 
576        struct module *mod;
 
577        char name[MODULE_NAME_LEN];
 
578        int ret, forced = 0;
 
579
 
580        if (!capable(CAP_SYS_MODULE))
 
581              return -EPERM;
 
582
 
583        if (strncpy_from_user(name, name_user, MODULE_NAME_LEN-1) < 0)
 
584              return -EFAULT;
 
585        name[MODULE_NAME_LEN-1] = '\0';
 
586
 
587        if (down_interruptible(&module_mutex) != 0)
 
588               return -EINTR;
 
589
 
590        mod = find_module(name);
 
591        if (!mod) {
 
592              ret = -ENOENT;
 
593              goto out;
 
594        }
 
595
 
596        if (!list_empty(&mod->modules_which_use_me)) {
 
597              /* Other modules depend on us: get rid of them first. */
 
598              ret = -EWOULDBLOCK;
 
599              goto out;
 
600        }
 
601
 
602        /* Doing init or already dying? */
 
603        if (mod->state != MODULE_STATE_LIVE) {
 
604               /* FIXME: if (force), slam module count and wake up
605                 waiter --RR */
 
606              DEBUGP("%s already dying\n", mod->name);
 
607              ret = -EBUSY;
 
608              goto out;
 
609        }
 
610
 
611        /* If it has an init func, it must have an exit func to unload */
 
612        if ((mod->init != NULL && mod->exit == NULL)
 
613            || mod->unsafe) {
 
614                forced = try_force_unload(flags);
 
615                if (!forced) {
 
616                    /* This module can't be removed */
 
617                    ret = -EBUSY;
 
618                    goto out;
 
619              }
 
620        }
 
621
 
622        /* Set this up before setting mod->state */
 
623        mod->waiter = current;
 
624
 
625        /* Stop the machine so refcounts can't move and disable module. */
 
626        ret = try_stop_module(mod, flags, &forced);
 
627        if (ret != 0)
 
628             goto out;
 
629
 
630        /* Never wait if forced. */
 
631        if (!forced && module_refcount(mod) != 0)
 
632             wait_for_zero_refcount(mod);
 
633
 
634        /* Final destruction now noone is using it. */
 
635        if (mod->exit != NULL) {
 
636              up(&module_mutex);
 
637              mod->exit();
 
638              down(&module_mutex);
 
639        }
 
640        free_module(mod);
 
641
 
642  out:
 
643        up(&module_mutex);
 
644        return ret;
 
645  }

The function sys_delete_module() is the implementation of the system call delete_module(). The effect of calling this function is to delete a kernel module that has been loaded by the system. The entry parameter name_user is the name of the module to delete.

illustrate:

580-581 调用capable( )函数,验证是否有权限操作内核模块。

583-585 取得该模块的名称

590-594 从modules链表中,找到该模块

597-599 如果存在其它内核模块,它们依赖该模块,那么,不能删除。

635-638 执行内核模块的exit函数,也就是表6-1所述的出口函数。

640     释放module结构占用的内核空间。

The content of the source code can be seen here. There are some other functions in the kernel/module.c file.

Try to analyze, the process name displayed by the top command contains the meaning of square brackets "[]"

 When executing top/ pscommand, in COMMANDone column, we will find that some process names are []enclosed, for example:

  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
 1542   928 root     R     1064   2%   5% top
    1     0 root     S     1348   2%   0% /sbin/procd
  928     1 root     S     1060   2%   0% /bin/ash --login
  115     2 root     SW       0   0%   0% [kworker/u4:2]
    6     2 root     SW       0   0%   0% [kworker/u4:0]
    4     2 root     SW       0   0%   0% [kworker/0:0]
  697     2 root     SW       0   0%   0% [kworker/1:3]
  703     2 root     SW       0   0%   0% [kworker/0:3]
   15     2 root     SW       0   0%   0% [kworker/1:0]
   27     2 root     SW       0   0%   0% [kworker/1:1]

Application code logic analysis

keyword: COMMAND

After obtaining the source code of busybox, try simple and rude search keywords:

[GMPY@12:22 busybox-1.27.2]$grep "COMMAND" -rnw *

It turns out that too many matching data:

applets/usage_pod.c:79: printf("=head1 COMMAND DESCRIPTIONS\n\n");
archival/cpio.c:100:      --rsh-command=COMMAND  Use remote COMMAND instead of rsh
docs/BusyBox.html:1655:<p>which [COMMAND]...</p>
docs/BusyBox.html:1657:<p>Locate a COMMAND</p>
docs/BusyBox.txt:93:COMMAND DESCRIPTIONS
docs/BusyBox.txt:112:        brctl COMMAND [BRIDGE [INTERFACE]]
docs/BusyBox.txt:612:    ip  ip [OPTIONS] address|route|link|neigh|rule [COMMAND]
docs/BusyBox.txt:614:        OPTIONS := -f[amily] inet|inet6|link | -o[neline] COMMAND := ip addr
docs/BusyBox.txt:1354:        which [COMMAND]...
docs/BusyBox.txt:1356:        Locate a COMMAND
......

At this point, I found that there are a lot of non-source files in the first match, so there are many, so can I only retrieve C files?

[GMPY@12:25 busybox-1.27.2]$find -name "*.c" -exec grep -Hn --color=auto "COMMAND" {} \;

This time the result is only 71 lines. After simply scanning the matching files, there is an interesting discovery:

......
./shell/ash.c:9707:         if (cmdentry.u.cmd == COMMANDCMD) {
./editors/vi.c:1109:    // get the COMMAND into cmd[]
./procps/lsof.c:31: * COMMAND    PID USER   FD   TYPE             DEVICE     SIZE       NODE NAME
./procps/top.c:626:     " COMMAND");
./procps/top.c:701:     /* PID PPID USER STAT VSZ %VSZ [%CPU] COMMAND */
./procps/top.c:841: strcpy(line_buf, HDR_STR " COMMAND");
./procps/top.c:854:     /* PID VSZ VSZRW RSS (SHR) DIRTY (SHR) COMMAND */
./procps/ps.c:441:  { 16                 , "comm"  ,"COMMAND",func_comm  ,PSSCAN_COMM    },
......

In busybox, each command is a separate file. This code has a good logical structure. We directly enter the 626 line of the procps/top.c file.

Function: display_process_list

Line 626 of procps/top.c belongs to the function display_process_list. Just look at the code logic:

static NOINLINE void display_process_list(int lines_rem, int scr_width)
{
    ......
    /* 打印表头 */
    printf(OPT_BATCH_MODE ? "%.*s" : "\033[7m%.*s\033[0m", scr_width,
        "  PID  PPID USER     STAT   VSZ %VSZ"
        IF_FEATURE_TOP_SMP_PROCESS(" CPU")
        IF_FEATURE_TOP_CPU_USAGE_PERCENTAGE(" %CPU")
        " COMMAND");
 
    ......
    /* 遍历每一个进程对应的描述 */
    while (--lines_rem >= 0) {
        if (s->vsz >= 100000)
            sprintf(vsz_str_buf, "%6ldm", s->vsz/1024);
        else
            sprintf(vsz_str_buf, "%7lu", s->vsz);
        /*打印每一行中除了COMMAND之外的信息,例如PID,USER,STAT等 */
        col = snprintf(line_buf, scr_width,
                "\n" "%5u%6u %-8.8s %s%s" FMT
                IF_FEATURE_TOP_SMP_PROCESS(" %3d")
                IF_FEATURE_TOP_CPU_USAGE_PERCENTAGE(FMT)
                " ",
                s->pid, s->ppid, get_cached_username(s->uid),
                s->state, vsz_str_buf,
                SHOW_STAT(pmem)
                IF_FEATURE_TOP_SMP_PROCESS(, s->last_seen_on_cpu)
                IF_FEATURE_TOP_CPU_USAGE_PERCENTAGE(, SHOW_STAT(pcpu))
        );
        /* 关键在这,读取cmdline */
        if ((int)(col + 1) < scr_width)
            read_cmdline(line_buf + col, scr_width - col, s->pid, s->comm);
        ......
    }
}

After removing irrelevant code, the function logic is clear

  1. All processes have been traversed in the code before this function and the description structure has been constructed
  2. Traverse the description structure in display_process_list and print the information in the specified order
  3. Get and print the process name through read_cmdline

We enter the function read_cmdline

Function: read_cmdline

void FAST_FUNC read_cmdline(char *buf, int col, unsigned pid, const char *comm)
{
    ......
    sprintf(filename, "/proc/%u/cmdline", pid);
    sz = open_read_close(filename, buf, col - 1);
    if (sz > 0) {
        ......
        while (sz >= 0) {
            if ((unsigned char)(buf[sz]) < ' ')
                buf[sz] = ' ';
            sz--;
        }
        ......
        if (strncmp(base, comm, comm_len) != 0) {
            ......
            snprintf(buf, col, "{%s}", comm);
            ......
    } else {
        snprintf(buf, col, "[%s]", comm ? comm : "?");
    }
}

After removing extraneous code, I found

  1. by /proc/<PID>/cmdlinegetting the process name
  2. If it /proc/<PID>/cmdlineis empty, it is used comm, in this case it is []enclosed in brackets
  3. If cmdlinethe basename is comminconsistent, {}enclose it with

For the convenience of reading, no further analysis cmdlineand analysis comm.

We focus on the question, under what circumstances, /proc/<PID>/cmdlineis empty?

Kernel code logic analysis

keyword: cmdline

/proc mounts proc, a special file system, and cmdline is definitely its unique function.

Assuming we are new to the kernel, all we can do at this point is to retrieve the keyword cmdline in the kernel proc source code.

[GMPY@09:54 proc]$cd fs/proc && grep "cmdline" -rnw *

Found two key matching files base.c and cmdline.c

array.c:11: * Pauline Middelink :  Made cmdline,envline only break at '\0's, to
base.c:224: /* Check if process spawned far enough to have cmdline. */
base.c:708: * May current process learn task's sched/cmdline info (for hide_pid_min=1)
base.c:2902:    REG("cmdline",    S_IRUGO, proc_pid_cmdline_ops),
base.c:3294:    REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),
cmdline.c:26:   proc_create("cmdline", 0, NULL, &cmdline_proc_fops);
Makefile:16:proc-y  += cmdline.o
vmcore.c:1158:   * If elfcorehdr= has been passed in cmdline or created in 2nd kernel,

The code logic of cmdline.c is very simple, it is easy to find that it is the implementation of /proc/cmdline, not our needs

Let's focus on base.c, the relevant code

REG("cmdline",   S_IRUGO, proc_pid_cmdline_ops),

Experience intuition tells me,

  1. cmdline: is the file name
  2. S_IRUGO: is the file permission
  3. proc_pid_cmdline_ops: is the operation structure corresponding to the file

Sure enough, entering proc_pid_cmdline_opswe find that it is defined as:

static const struct file_operations proc_pid_cmdline_ops = {
    .read   = proc_pid_cmdline_read,
    .llseek = generic_file_llseek,
}

Function: proc_pid_cmdline_read

static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
                size_t _count, loff_t *pos)
{
    ......
    /* 获取进程对应的虚拟地址空间描述符 */
    mm = get_task_mm(tsk);
    ......
    /* 获取argv的地址和env的地址 */
    arg_start = mm->arg_start;
    arg_end = mm->arg_end;
    env_start = mm->env_start;
    env_end = mm->env_end;
    ......
    while (count > 0 && len > 0) {
        ......
        /* 计算地址偏移 */
        p = arg_start + *pos;
        while (count > 0 && len > 0) {
            ......
            /* 获取进程地址空间的数据 */
            nr_read = access_remote_vm(mm, p, page, _count, FOLL_ANON);
            ......
        }
    }
}

Xiaobai may be confused at this time, how do you know what it access_remote_vmis?

Very simple, jump to the access_remote_vmfunction, you can see that this function is commented

/**
 * access_remote_vm - access another process' address space
 * @mm:         the mm_struct of the target address space
 * @addr:       start address to access
 * @buf:        source or destination buffer
 * @len:        number of bytes to transfer
 * @gup_flags:  flags modifying lookup behaviour
 *
 * The caller must hold a reference on @mm.
 */
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
        void *buf, int len, unsigned int gup_flags)
{
    return __access_remote_vm(NULL, mm, addr, buf, len, gup_flags);
}

In the Linux kernel source code, many functions have very standardized function descriptions, parameter descriptions, precautions, etc. We must make full use of these resources to learn the code.

With that out of the way, let's get back to the topic.

Fromproc_pid_cmdline_read this, we found that reading /proc/<PID>/cmdlineis actually reading arg_startthe address space data at the beginning. Therefore, when the address space data is empty, of course, no data can be read. So the question is, when is the address space data identified by arg_start empty?

keyword: arg_start

Address space related, definitely not just proc, let's try to retrieve keywords globally in the kernel source code:

[GMPY@09:55 proc]$find -name "*.c" -exec grep --color=auto -Hnw "arg_start" {} \;

There are a lot of matches, I don't want to look at them one by one, and I can't find the direction from the retrieved code:

./mm/util.c:635:    unsigned long arg_start, arg_end, env_start, env_end;
......
./kernel/sys.c:1747:        offsetof(struct prctl_mm_map, arg_start),
......
./fs/exec.c:709:    mm->arg_start = bprm->p - stack_shift;
./fs/exec.c:722:    mm->arg_start = bprm->p;
......
./fs/binfmt_elf.c:301:  p = current->mm->arg_end = current->mm->arg_start;
./fs/binfmt_elf.c:1495: len = mm->arg_end - mm->arg_start;
./fs/binfmt_elf.c:1499:                (const char __user *)mm->arg_start, len))
......
./fs/proc/base.c:246:   len1 = arg_end - arg_start;
......

But from the matching filename gave me inspiration:

/proc/<PID>/cmdline is an attribute of each process. From task_struct to mm_struct , they describe the process and related resources. When will it be modified to mm_struct where arg_start is located? When the process is initialized!

It is further thought that creating a process in user space is nothing more than two steps:

  1. fork
  2. exec

When fork, only new ones are created task_struct, and the parent and child processes share one share mm_struct. Only execwhen they are, they will be independent mm_struct, so arg_start must execbe modified when they are! In the matching arg_startfiles, there are just exec.c.

After checking the fs/exec.cfunction where the keyword is located setup_arg_pages, the key code was not found, so I continued to check the matching file name, and further associations occurred:

exec executes a new program, which actually loads the bin file of the new program, and the files matching the keyword just happen to have it binfmt_elf.c!

The problem of positioning is not only to understand the code, Lenovo is also very effective sometimes

Function: create_elf_tables

The function create_elf_tables matches the keyword arg_start in binfmt_elf.c. The function is quite long. Let’s simplify it:

static int
create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
        unsigned long load_addr, unsigned long interp_load_addr)
{
    ......
    /* Populate argv and envp */
    p = current->mm->arg_end = current->mm->arg_start;
    while (argc-- > 0) {
        ......
        if (__put_user((elf_addr_t)p, argv++))
            return -EFAULT;
        ......
    }
    ......
    current->mm->arg_end = current->mm->env_start = p;
    while (envc-- > 0) {
        ......
        if (__put_user((elf_addr_t)p, envp++))
            return -EFAULT;
        ......
    }
    ......
}

In this function, argv and envp are stored in the address space of arg_start and env_start .

Next, let's try to trace the source by source, and trace create_elf_tablesthe call of the function together

First of all, it create_elf_tablesis declared as static , which means that its effective scope cannot exceed the file where it is located. Retrieving in the file, it is found that the superior function is:

static int load_elf_binary(struct linux_binprm *bprm)

It turned out to be static , so I continued to search in this file load_elf_binaryand found the following code:

static struct linux_binfmt elf_format = {
    .module         = THIS_MODULE,
    .load_binary    = load_elf_binary,
    .load_shlib     = load_elf_library
    .core_dump      = elf_core_dump,
    .min_coredump   = ELF_EXEC_PAGESIZE,
};
 
static int __init init_elf_binfmt(void)
{
    register_binfmt(&elf_format);
    return 0;
}
 
core_initcall(init_elf_binfmt);

Retrieved here, the code structure is very clear, the load_elf_binaryfunction is assigned to struct linux_binfmt, and registered to the upper layer through ` register_binfmt, providing the upper layer callback.

Keyword: load_binary

Why lock the keyword load_binary? Since .load_binary = load_elf_binary,, it means that the call of the upper layer should be XXX->load_binary(...), so lock the keyword load_binary to locate where this callback is called.

[GMPY@09:55 proc]$ grep "\->load_binary" -rn *

Luckily, this callback only fs/exec.ccalls:


fs/exec.c:78:   if (WARN_ON(!fmt->load_binary))
fs/exec.c:1621:     retval = fmt->load_binary(bprm);

Enter line 1621 of fs/exex.c , which belongs to the function search_binary_handler. Unfortunately, EXPORT_SYMBOL(search_binary_handler);its existence means that it is very likely that this function will be called in multiple places. It is obviously very difficult to continue forward analysis at this time. Why not try reverse analysis? ?

When the road fails, look at the problem from a different angle, and the answer is in front of you

Since it is not easy to continue the analysis from search_binary_handler, let's see if execvethe system call can be reached step by step search_binary_handler?

keyword: exec

On Linux-4.9, the definition of the system call is generally SYSCALL_DEFILNE<参数数量>(<函数名>..., so we search the keyword globally, first determine where the system call is defined?

[GMPY@09:55 proc]$ grep "SYSCALL_DEFINE.*exec" -rn *

Navigate to filefs/exec.c:

fs/exec.c:1905:SYSCALL_DEFINE3(execve,
fs/exec.c:1913:SYSCALL_DEFINE5(execveat,
fs/exec.c:1927:COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
fs/exec.c:1934:COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
kernel/kexec.c:187:SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
kernel/kexec.c:233:COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
kernel/kexec_file.c:256:SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,

The call of the follow-up function is no longer cumbersome, and the calling relationship is summarized as follows:

execve -> do_execveat -> do_execveat_common -> exec_binprm -> search_binary_handler

After all, it returns to search_binary_handler

After analyzing this, we have determined the assignment logic:

  1. When execvea new program is executed, it is initializedmm_struct

  2. Save the execveargv and envp passed in to the address specified by arg_start and env_start

  3. When cat /proc/<PID>/cmdlinethe data is obtained from the virtual address of arg_start

Therefore, as long as the process created by the user space passes the execve system call, there will be /proc/<PID>/cmdline, but it is still not clarified, when will the cmdline be empty?

We know that in Linux, processes can be divided into user space processes and kernel space processes. Since the user space process cmdline is not empty, let's look at the kernel process.

Function: kthread_run

In the kernel driver, we often kthread_runcreate a kernel process. We use this function as the entry point to analyze whether the cmdline will be assigned when creating a kernel process?

Start directly from kthread_run, track the calling relationship, and find that the real work is the function__kthread_create_on_node:

kthread_run -> kthread_create -> kthread_create_on_node -> __kthread_create_on_node

Remove redundant code and focus on what the function does:

static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
                void *data, int node, const char namefmt[], va_list args)
{
    /* 把新进程相关的属性存于 kthread_create_info 的结构体中 */
    struct kthread_create_info *create = kmalloc(sizeof(*create), GFP_KERNEL);
    create->threadfn = threadfn;
    create->data = data;
    create->node = node;
    create->done = &done;
    
    /* 把初始化后的create加入到链表,并唤醒kthreadd_task进程来完成创建工作 */
    list_add_tail(&create->list, &kthread_create_list);
    wake_up_process(kthreadd_task);
    /* 等待创建完成 */
    wait_for_completion_killable(&done)
    
    ......
 
    task = create->result;
    if (!IS_ERR(task)) {
        ......
        /* 创建后,设置进程名,此处的进程名属性为comm,不同于cmdline */
        vsnprintf(name, sizeof(name), namefmt, args);
        set_task_comm(task, name);
        ......
    }
}

The analysis method is similar to the above, and is not repeated here. In summary, the function does two things

  1. Wake up a process kthread_taskto create a new process

  2. Set the properties of the process, where the properties include comm but not cmdline

Review the user code analysis , if it /proc/<PID>/cmdlineis empty, use comm, and enclose it with []**

Therefore, the /proc/<PID>/cmdlinecontent of the kernel process created by kthread_run/ktrhread_create is empty.

In this analysis, the following analysis methods were mainly used

  1. Keyword retrieval - from COMMAND of top program to arg_start, load_binary, exec of kernel source code
  2. function annotation - function description of the function access_remote_vm
  3. Associate - associate the process attribute to the user space to create a process, and then locate the processing function of the arg_start keyword
  4. Reverse thinking - It is difficult to deduce the calling relationship from search_binary_handler upwards. Can the system call of execve be analyzed step by step to search_binary_handler?

Based on this analysis, we draw the following conclusions:

  1. Processes created by user space do not need [] in top/ps display;

  2. The process created by the kernel space will have [] in the top/ps display;

How do module units interact with the system kernel?

The make file of the kernel module:

First, let's take a look at how the make file of the module program should be written. Since version 2.6, Linux's specifications for kernel modules have changed a lot. For example, the extension name of all modules is changed from ".o" to ".ko". For details, see Documentation/kbuild/makefiles.txt. To edit the Makefile for kernel modules, see Documentation/kbuild/modules.txt.

When we practiced "helloworld.ko", we used a simple Makefile:

TARGET = helloworld
 
KDIR = /usr/src/linux
 
PWD = $(shell pwd)
 
obj-m += $(TARGET).o
 
default:
 
       make -C $(KDIR) M=$(PWD) modules

$(KDIR) indicates the location of the top-level directory of the source code.

"obj-m += $(TARGET).o" tells kbuild that it wants to compile $(TARGET), which is helloworld, into a kernel module.

"M=$(PWD)" means that the generated module files will be in the current directory.

make file for multifile kernel module

Now, let's extend the question, how to compile a multi-file kernel module? Also taking "Hello, world" as an example, we need to do the following:

Among all the source files, only one file adds the line #define __NO_VERSION__. This is because module.h generally includes the definition of the global variable kernel_version, which contains the kernel version information for which the module was compiled. If you need version.h, you need to include it yourself, because module.h will not include version.h after __NO_VERSION__ is defined.

An example of a multi-file kernel module is given below.  

Start.c:

/* start.c
*
 * "Hello, world" –内核模块版本
* 这个文件仅包括启动模块例程
 */
 
   
 
/* 必要的头文件 */
 
 
 
/* 内核模块中的标准 */
 
#include <linux/kernel.h>   /*我们在做内核的工作 */
 
#include <linux/module.h> 
 
 
 
/*初始化模块 */
 
int init_module()
 
{
 
  printk("Hello, world!\n");
 
 
 
/* 如果我们返回一个非零值, 那就意味着
* init_module 初始化失败并且内核模块
* 不能加载 */
 
  return 0;
 
}

stop.c:

/* stop.c
*"Hello, world" -内核模块版本
*这个文件仅包括关闭模块例程
 */
   
 
/*必要的头文件 */
/*内核模块中的标准 */
#include <linux/kernel.h>   /*我们在做内核的工作 */
#define __NO_VERSION__     
#include <linux/module.h>  
#include <linux/version.h>   /* 不被module.h包括,因为__NO_VERSION__ */
 
/* Cleanup - 撤消 init_module 所做的任何事情*/
 
void cleanup_module()
 
{
 
  printk("Bye!\n");
 
}
 
/*结束*/

This time, the helloworld kernel module contains two source files, "start.c" and "stop.c". Let's take a look at how to write a Makefile for a multi-file kernel module

Makefile:

TARGET = helloworld
 
KDIR = /usr/src/linux
 
PWD = $(shell pwd)
 
obj-m += $(TARGET).o
 
$(TARGET)-y := start.o stop.o
 
default:
 
    make -C $(KDIR) M=$(PWD) modules

Compared with the previous one, only one line is added:

$(TARGET)-y := start.o stop.o

Write a kernel module

Let's try to write a very simple module program, which can be implemented on version 2.6.15, and may need some adjustments for kernel versions lower than 2.4.

helloworld.c

#include <linux/module.h>        /* Needed by all modules */
#include <linux/kernel.h>          /* Needed for KERN_INFO */
 
int init_module(void)
{
    printk(KERN_INFO “Hello World!\n”);
    return 0;
}
 
void cleanup_module(void)
{
    printk(KERN_INFO “Goodbye!\n”);
}
 
MODULE_LICENSE(“GPL”);

illustrate:

1. The writing of any module program needs to include the header file linux/module.h, which contains the structure definition of the module and the version control of the module. The main data structures in the file will be described in detail later.
2. The function init_module() and the function cleanup_module( ) are the most basic and necessary functions in module programming. init_module() registers new functions provided by the module to the kernel; cleanup_module() is responsible for unregistering all functions registered by the module.
3. Note that we are using the printk() function here (do not habitually write it as printf), the printk() function is defined by the Linux kernel, and its function is similar to printf. The string KERN_INFO represents the priority of the message. One of the features of printk() is that it processes messages of different priorities differently.

Next, we need to compile and load this module. One more thing to note: Make sure you are a superuser now. Because only superuser can load and unload modules. Before compiling the kernel module, prepare a Makefile:

TARGET = helloworld
 
KDIR = /usr/src/linux
 
PWD = $(shell pwd)
 
obj-m += $(TARGET).o
 
default:
       make -C $(KDIR) M=$(PWD) modules

Then simply enter the command make:

#make

As a result, we get the file "helloworld.ko". Then execute the load command of the kernel module:

#insmod  helloworld.ko

Hello World!

At this time, the string "Hello World!" is generated, which is defined in init_module(). This shows that the helloworld module has been loaded into the kernel. We can use the lsmod command to see. What the lsmod command does is tell us information about all modules running in the kernel, including the name of the module, the size of the footprint, usage count, and the current state and dependencies.

root# lsmod

Module    Size    Used  by

helloworld  464    0   (unused)

Finally, we want to uninstall this module.

# rmmod helloworld

Goodbye!

Saw "Goodbye!" printed on the screen, which is defined in cleanup_module(). This shows that the helloworld module has been deleted. If we use lsmod to check again at this time, we will find that the helloworld module is no longer there.

Regarding the two commands insmod and rmmod, I can only tell you briefly that they are two utilities for inserting and removing modules from the kernel. The previously used insmod, rmmod and lsmod are all modutils module utilities.

We have successfully implemented a simplest module program on the machine.

3. Memory Management

The memory management subsystem is an important part of the operating system. Since the early days of computer development, there has been a need for memory that is larger than the physical capabilities of the system. To overcome this limitation, many strategies have been developed, the most successful of which is virtual memory. Virtual memory makes the system appear to have more memory than it actually does by sharing memory among competing processes.

Not only does virtual memory make your computer appear more memory, the memory management subsystem also provides:

Large Address Spaces The operating system makes the system appear to have a larger amount of memory than it actually does. Virtual memory can be many times larger than the physical memory in the system.

Each process in the Protection system has its own virtual address space. These virtual address spaces are completely separated from each other, so a process running one application will not affect another process. In addition, the hardware's virtual memory mechanism allows write protection of memory areas. This prevents code and data from being overwritten by malicious programs.

Memory Mapping Memory mapping is used to map images and data into the address space of a process. With memory mapping, the contents of the file are linked directly into the virtual address space of the process.

Fair Physics Memory Allocation The memory management subsystem allows each running process in the system to share the system's physical memory fairly

Shared Virtual Memory While virtual memory allows processes to have separate (virtual) address spaces, sometimes you also need to share memory between processes. For example, there may be multiple processes in the system running the command interpreter bash. While it is possible to have a copy of bash in each process' virtual address space, it is better to have only one copy in physical memory, and all processes running bash share code. Dynamically linked libraries are another common example of multiple processes sharing executing code. Shared memory can also be used for inter-process communication (IPC) mechanisms, where two or more processes can exchange information through jointly owned memory. The Linux system supports the shared memory IPC mechanism of System V.

3.1 An Abstract Model of Virtual Memory

Before thinking about Linux's approach to supporting virtual memory, it's best to think about an abstract model so as not to get cluttered with too many details.

When a process executes a program, it reads instructions from memory and decodes them. Decoding an instruction may require reading or storing the contents of a specific location in memory, and then the process executes the instruction and moves to the next instruction in the program. A process accesses memory whether it is reading instructions or accessing data.

In a virtual memory system, all addresses are virtual addresses rather than physical addresses. The processor translates virtual addresses to physical addresses through a set of information held by the operating system.

To make this conversion easier, virtual memory and physical memory are divided into appropriately sized blocks, called pages. The pages are the same size. (Of course it can be different, but this makes system management more difficult). Linux uses 8K-byte pages on Alpha AXP systems and 4K-byte pages on Intel x86 systems. Each page is assigned a unique number: page frame number (PFN page number). Under this paging model, a virtual address consists of two parts: the virtual page number and the offset within the page. If the page size is 4K, then bits 11 to 0 of the virtual address include the offset within the page, and bits 12 and above are the page number. Every time the processor encounters a virtual address, it must extract the offset and virtual page number. The processor must translate the virtual page number to the physical page and access the correct offset of the physical page. For this, the processor uses page tables.

Figure 3.1 shows the virtual address spaces of two processes, Process X and Process Y, each with its own page table. These page tables map each process's virtual pages to physical pages of memory. The figure shows that process X's virtual page number 0 maps to physical page number 1, while process Y's virtual page number 1 maps to physical page number 4. In theory, each entry in the page table includes the following information:

Valid flag Indicates whether this entry in the page table is valid

The physical page number described by this page table entry

Access Control Information Describe how this page is used: Is it writable? Does it include executing code?

Page tables are accessed using virtual page numbers as offsets. Virtual page number 5 is the 6th element in the table (0 is the first element)

To translate a virtual address to a physical address, the processor first finds the page number and offset within the page of the virtual address. Using power-of-2 page sizes can be handled simply with masks or shifts. Looking again at Figure 3.1, assuming the page size is 0x2000 (8192 decimal) and the address in process Y's virtual address space is 0x2194, the processor will translate the address to offset 0x194 within virtual page number 1.

The processor uses the virtual page number as an index to find its page table entry in the process's page table. If the entry is valid, the processor fetches the physical page number from the entry. If this entry is invalid, the process has accessed an area of ​​its virtual memory that does not exist. In this case, the processor cannot interpret the address and must pass control to the operating system to handle it.

How the processor specifically informs the operating system process that it is accessing an invalid virtual address that cannot be translated is processor-dependent. The processor communicates this information (page fault), and the operating system is notified that the virtual address is faulty, and the reason for the fault.

Assuming this is a valid page table entry, the processor fetches the physical page number and multiplies the page size to get the base address of the page in physical memory. Finally, the processor adds the offset of the instruction or data it needs.

Using the above example again, the virtual page number 1 of process Y is mapped to the physical page number 4 (starting at 0x8000, 4x 0x2000), and the offset 0x194 is added to get the final physical address 0x8194.

By mapping virtual addresses to physical addresses in this way, virtual memory can be mapped into the system's physical memory in any order. For example, in Figure 3.1, the virtual page number of virtual memory X is mapped to physical page number 1 and virtual page number 7 is mapped to physical page number 0, although it is higher in virtual memory than virtual page 0. This also demonstrates an interesting by-product of virtual memory: virtual memory pages do not have to be mapped into physical memory in a specified order.

3.1.1 Demand Paging

Because physical memory is much smaller than virtual memory, the operating system must avoid inefficient use of physical memory. One way to save physical memory is to load only the virtual pages that the executor is using. For example: a database program may be running a query on the database. In this case, not all data must be placed in memory, but only the data record being examined. If this is a lookup query, then the code in the loader to increment the record doesn't make much sense. This technique of loading virtual pages only when they are accessed is called demand paging.

When a process attempts to access a virtual address that is not currently in memory, the processor cannot find the page table entry for the referenced virtual page. For example: Process X in Figure 3.1 does not have an entry for virtual page 2 in its page table, so if process X tries to read from an address in virtual page 2, the processor cannot translate the address to a physical address. At this time, the processor notifies the operating system of a page fault.

If the wrong virtual address is invalid it means that the process is trying to access a virtual address it should not. Maybe a program error, such as writing to an arbitrary address in memory. In this case, the operating system will interrupt it, thus protecting other processes in the system.

If the faulty virtual address is valid but the page it is on is not currently in memory, the operating system must load the corresponding page into memory from the disk image. Disk accesses take a relatively long time, so the process must wait until the page is fetched into memory. If there are other systems currently running, the operating system will choose one to run. The fetched page is written to a free page and a valid virtual page entry is added to the process's page table. The process then re-runs the machine instructions where the memory error occurred. During this virtual memory access, the processor can translate the virtual address to a physical address, so the process can continue to run.

Linux uses demand paging technology to load the executable image into the virtual memory of the process. When a command executes, the file containing it is opened and its contents are mapped into the virtual memory of the process. This process is accomplished by modifying the data structures that describe the memory mapping of the process, also known as memory mapping. However, only the first part of the image is actually placed in physical memory. The rest of the image remains on disk. When the image executes, it generates a page fault, and Linux uses the process's memory map table to determine which part of the image needs to be loaded into memory for execution.

3.1.2 Swapping

If a process needs to put a virtual page into physical memory and there are no free physical pages, the operating system must discard another page in physical space to make room for that page.

If the page in physical memory that needs to be discarded comes from an image or data file on disk, and has not been written so it does not need to be stored, the page is discarded. If the process needs the page again, it can be loaded into memory again from the image or data file.

However, if the page has been changed, the operating system must preserve its contents for later access. This is also called a dirty page, and when it is discarded from physical memory, it is stored in a special file called a swap file. Because access to the swap file is slow compared to access to the processor and physical memory, the operating system must decide whether to write data pages to disk or keep them in memory for next access.

Thrashing occurs if the algorithm for deciding which pages need to be discarded or swapping is not efficient. At this point, pages are constantly being written to disk and read back, and the operating system is too busy to do actual work. For example, in Figure 3.1, if physical page number 1 is accessed frequently, then do not swap it to disk. What a process is using is also called the working set. An efficient swapping scheme should guarantee that the working sets of all processes are in physical memory.

Linux uses the LRU (Least Recently Used) page technique to fairly select pages that need to be discarded from the system. This scheme assigns every page in the system an age that changes when the page is accessed. The more pages visited, the younger, the less visited, the older the more stale. Stale pages are good candidates for swap.

3.1.3 Shared Virtual Memory

Virtual memory enables multiple processes to easily share memory. All memory accesses are through page tables, and each process has its own page table. For two processes to share a physical memory page, the physical page number must appear in the page tables of both processes.

Figure 3.1 shows two processes sharing physical page number 4. The virtual page number is 4 for process X and 6 for process Y. This also shows an interesting aspect of shared pages: a shared physical page does not have to exist in the same place in the virtual memory space of the process that shares it.

3.1.4 Physical and Virtual Addressing Modes

For the operating system itself, running in virtual memory makes little sense. This would be a nightmare if the OS had to maintain its own page tables. Most multipurpose processors support both physical and virtual address modes. The physical addressing mode does not require page tables, and the processor does not need to do any address translation in this mode. The Linux kernel runs in physical address mode.

Alpha AXP processors have no special physical addressing modes. It divides the memory space into several areas, two of which are designated as physical mapped address areas. The kernel's address space is called the KSEG address space and includes all addresses from 0xfffffc0000000000 up. In order to execute code (core code) attached to the KSEG or access data there, the code must be executed in core state. The Linux kernel on the Alpha connects to execute from address 0xfffffc0000310000.

3.1.5 Access Control

Page table entries also include access control information. When a processor uses page table entries to map a process's virtual address to a physical address, it can easily use access control information to control the process from accessing it in a way that is not allowed.

There are many reasons why you might want to restrict access to memory regions. Some memory, such as containing executable code, is essentially read-only code, and the operating system should prevent a process from writing its executable code. In turn, pages containing data can be written, but attempts to execute this memory should fail. Most processors have two execution states: kernel mode and user mode. You don't want users to directly execute kernel-mode code or access kernel data structures unless the processor is running in kernel-mode.

The access control information is placed in the PTE (page table entry) and is related to the specific processor. Figure 3.2 shows the PTE of Alpha AXP. The meaning of each bit is as follows:

V 有效,这个PTE是否有效
FOE “Fault on Execute” 试图执行本页代码时,处理器是否要报告page fault,并将控制权传递给操作系统。
FOW “Fault on Write” 如上,在试图写本页时产生page fault
FOR “fault on read” 如上,在试图读本页时产生page fault
ASM 地址空间匹配。用于操作系统清除转换缓冲区中的部分条目
KRE 核心态的代码可以读本页
URE 用户态的代码可以读本页
GII 间隔因子,用于将一整块映射到一个转换缓冲条目而非多个。
KWE 核心态的代码可以写本页
UWE 用户态的代码可以写本页
Page frame number 对于V位有效的PTE,包括了本PTE的物理页编号;对于无效的PTE,如果不是0,包括了本页是否在交换文件的信息。

The following two bits are defined and used by Linux:

_PAGE_DIRTY 如果设置,本页需要写到交换文件中
_PAGE_ACCESSED Linux 使用,标志一页已经访问过

3.2 Caches

If you implement a system with the above theoretical model, it will work, but it will not be very efficient. The designers of both operating systems and processors strive to make systems more performant. Besides using faster processors, memory, etc., the best approach is to maintain a cache of useful information and data, which will make some operations faster.

Linux uses a series of cache-related memory management techniques:

Buffer Cache: The Buffer cache contains data buffers for block device drivers. These buffers are fixed in size (eg 512 bytes) and contain data read from or written to a block device. A block device is a device that can only be accessed by reading and writing fixed-size blocks of data. All hard disks are block devices. The block device uses the device identifier and the number of the data block to be accessed as an index to quickly locate the data block. Block devices can only be accessed through the buffer cache. If the data can be found in the buffer cache, then there is no need to read from a physical block device such as a hard disk, thus speeding up access.

See fs/buffer.c

Page Cache is used to speed up access to images and data on disk. It is used to cache the logical content of a file, one page at a time, and accessed via the file and offsets within the file. When a data page is read from disk into memory, it is cached in the page cache.

See mm/filemap.c

Swap Cache Only changed (or dirty) pages exist in the swap file. As long as they are not modified again after they are written to the swap file, the next time these pages need to be swapped out, they do not need to be written to the swap file, because the page is already in the swap file, and the page can be discarded directly. On a heavily swapped system, this saves many unnecessary and expensive disk operations.

See mm/swap_state.c mm/swapfile.c

Hardware Cache: A common implementation of hardware cache is inside the processor: the PTE cache. In this case, the processor does not always need to read the page table directly, but places the page translation table in the cache when needed. There are translation table buffers (TLB Translation Look-aside Buffers) in the CPU, which place a cached copy of the page table entries of one or more processes in the system.

When referencing a virtual address, the processing area tries to look in the TLB. If found, it directly translates the virtual address to the physical address and performs the correct operation on the data. If it can't find it, it needs the help of the operating system. It signals to the operating system that a TLB missing has occurred. A system-dependent mechanism forwards the exception to the corresponding code in the operating system for processing. The operating system generates new TLB entries for this address mapping. When the exception clears, the processor tries again to translate the virtual address, this time it will succeed because there is a valid entry for that address in the TLB.

A side effect of caches (hardware or otherwise) is that Linux has to spend a lot of time and space maintaining these caches, and if those caches crash, the system crashes too.

3.3 Linux Page Tables

Linux assumes a three-level page table. Each page table accessed includes the page number of the next-level page table. Figure 3.3 shows how a virtual address is divided into a series of fields: each field provides an offset in a page table. To translate a virtual address to a physical address, the processor must take the contents of each level of field, translate to an offset within the physical page that includes that page table, and then read the page number of the next level of page table. This is repeated three times until the page number of the physical address including the virtual address is found. Then use the last field in the virtual address: the byte offset to look up the data within the page.

Every platform on which Linux runs must provide translation macros that let the core handle the page tables of a particular process. In this way, the kernel does not need to know the exact structure of page table entries or how they are organized. In this way, Linux successfully uses the same page table handler for Alpha and Intel x86 processors, where Alpha uses three-level page tables and Intel uses two-level page tables.

See include/asm/pgtable.h

3.4 Page Allocation and Deallocation

There is a large demand for physical pages in the system. For example, the operating system needs to allocate pages when a program image is loaded into memory. These pages need to be freed when the program finishes execution and unloads. In addition, in order to store core-related data structures such as the page table itself, physical pages are also required. This mechanism and data structure for allocating and reclaiming pages is perhaps the most important for maintaining the efficiency of the virtual memory subsystem.

All physical pages in the system are described using the mem_map data structure. This is a linked list of mem_map_t structures, initialized at startup. Each mem_map_t (confusingly this structure is also called a page structure) structure describes a physical page in the system. The important fields (at least for memory management) are:

see include/linux/mm.h

count The number of users on this page. If this page is shared by multiple processes, the counter is greater than 1.

Age Describes the age of this page. Used to decide whether this page can be discarded or swapped out.

The physical page number described by Map_nr mem_map_t.

The page allocation code uses the free_area vector to find free pages. The entire buffer management scheme is supported by this mechanism. As long as this code is used, the size of the page used by the processor and the mechanism of the physical page can be irrelevant.

Each free_area unit includes page block information. The first cell in the array describes a single page, the next is a 2-page block, the next is a 4-page block, and so on, all multiples of 2 upwards. This linked list cell is used as the head of the queue, with pointers to the data structures of the pages in the mem_map array. Free page blocks are queued here. Map is a bitmap that keeps track of allocation groups for pages of this size. If the Nth block in the page block is free, the Nth bit in the bitmap is set.

Figure 3.4 shows the free_area structure. Unit 0 has one free page (page number 0), and unit 2 has 2 free blocks of 4 pages, the first starting at page number 4 and the second starting at page number 56.

3.4.1 Page Allocation

See mm/page_alloc.c get_free_pages()

Linux uses the Buddy algorithm to efficiently allocate and reclaim page blocks. Page allocation code attempts to allocate a block of one or more physical pages. Page allocations use power-of-two-sized blocks. This means that 1-page-sized, 2-page-sized, 4-page-sized blocks can be allocated, and so on. As long as the system has enough free pages to meet the needs (nr_free_pages > min_free_pages), the allocation code will look in the free_area for a block of pages of the required size. Each cell in the Free_area has a bitmap that describes the occupancy and vacancy of its own size page block. For example, cell 2 in the array has an allocation map that describes the free and occupied blocks of 4 pages in size.

The algorithm first finds a block of memory pages of the size it requests. It keeps track of the linked list of free pages in the list unit queue in the free_area data structure. If a page block of the requested size is not free, find a block of the next size (2 times the requested size). Continue this process until all free_areas are traversed or a free page block is found. If the found page block is larger than the requested page block, the block will be split into appropriately sized blocks. Because all blocks are made up of power-of-2 pages, the splitting process is relatively simple, you just need to divide it equally. Free blocks are placed in the appropriate queue, and allocated page blocks are returned to the caller.

For example, in Figure 3.4, if a 2-page block of data is requested, the first 4-page block (starting at page number 4) will be divided into two 2-page blocks. The first 2-page block starting at page number 4 will be returned to the caller, and the second 2-page block (starting at page number 6) will be queued at location 1 in the free_area array with 2 pages free in the block queue.

3.4.2 Page Deallocation

In the process of allocating page blocks, dividing large page blocks into small page blocks will make the memory more fragmented. The code for page reclamation joins pages into large blocks whenever possible. In fact, the size of the page block is very important (power of 2), because this makes it easy to group page blocks into large page blocks.

Whenever a page block is reclaimed, it is checked whether its adjacent or together page blocks of the same size are free. If so, it is combined with the newly freed page block to form a new free page block of the next size. Every time two memory page blocks are combined into a larger page block, the page reclamation code attempts to merge the page blocks into a larger block. In this way, free page blocks will be as large as possible.

For example, in Figure 3.4, if page number 1 is freed, it is combined with the already free page number 0 and placed in the free 2-page block queue in unit 1 of the free_area.

3.5 Memory Mapping

When an image executes, the contents of the executing image must be placed in the virtual address space of the process. The same is true for any shared library that the execution image connects to. The executable file is not actually placed in physical memory, but is only attached to the virtual memory of the process. This way, whenever a running program references a part of the image, that part of the image is loaded into memory from the executable. This connection of the image to the virtual address space of the process is called a memory map.

The virtual memory of each process is represented by a mm_struct data structure. This includes information about the currently executing image (eg bash) and a pointer to a set of vm_area_struct structures. Each vm_area_struct data structure describes the start of the memory area, the process's access rights to the memory area, and operations on this memory. These operations are a set of routines that Linux uses to manage this virtual memory. For example, one of the virtual memory operations is the correct operation that must be performed when a process tries to access this virtual memory and finds (through a page fault) that the memory is not in physical memory. This operation is called a nopage operation. The nopage operation is used when Linux requests that the pages of the execution image be loaded into memory.

When an execution image is mapped into the virtual address space of the process, a set of vm_area_struct data structures are generated. Each vm_area_struct structure represents a part of the execution image: execution code, initialized data (variables), uninitialized data, and so on. Linux supports a standard set of virtual memory operations, and a correct set of virtual memory operations is associated with them when the vm_area_struct data structure is created.

3.6 Demand Paging

As long as the execution image is mapped into the virtual memory of the process, it can start running. Because only the first part of the image

The points are placed in physical memory, and soon the virtual space area that has not been placed in physical memory will be accessed. When a process accesses a virtual address that does not have a valid page table entry, the processor reports a page fault to Linux. Page fault describes the virtual address and memory access type at which the page fault occurs.

Linux must find the vm_area_struct data structure (connected together with the Adelson-Velskii and Landis AVL tree structure) corresponding to the space area where the page fault occurred. If the vm_area_struct structure corresponding to this virtual address cannot be found, it means that the process has accessed an illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the process doesn't handle the signal, it will exit.

See handle_mm_fault() in mm/memory.c

Linux then checks the type of page fault and the type of access allowed for that virtual memory area. A memory error is also signaled if a process accesses memory in an illegal way, such as writing to an area it can only read.

Now that Linux determines that the page fault is legitimate, it must handle it. Linux must distinguish between pages in the swap file and the disk image, which it uses to determine the page table entry for the virtual address where the page fault occurred.

See do_no_page() in mm/memory.c

If the page table entry for the page is invalid but not empty, the page is in the swap file. For Alpha AXP page table entries, the valid bit is set but the PFN field is not empty. In this case the PFN field holds the position of the page in the swap file (and that swap file). How pages are handled in the swap file is discussed later.

Not all vm_area_struct data structures have a full set of virtual memory operations, and those that have special memory operations may not have nopang operations either. Because by default, for nopage operations, Linux allocates a new physical page and creates a valid page table entry. If this section of virtual memory has a special nopage operation, Linux will call this special code.

The usual Linux nopage operation is used to memory map the execution image and use the page cache to load the requested image page into physical memory. Although the process's page table is updated after the requested page is brought into physical memory, the necessary hardware action may be required to update these entries, especially if the processor uses the TLB. Now that the page fault has been dealt with, it can be thrown aside and the process restarted at the instruction that caused the virtual memory access error.

See filemap_nopage() in mm/filemap.c

 

3.7 The Linux Page Cache

The role of Linux's page cache is to speed up access to disk files. A memory-mapped file is read one page at a time, and these pages are stored in the page cache. Figure 3.6 shows the page cache, including a vector of pointers to the mem_map_t data structure: page_hash_table. Each file in Linux is identified by a VFS inode data structure (described in Section 9), and each VFS inode is unique and can fully identify a unique file. The index of the page table is taken from the inode number of the VFS and the offset in the file.

See linux/pagemap.h

When a page of data is read from a memory-mapped file, such as when demand paging needs to be placed in memory, the page is read from the page cache. If the page is in the cache, a pointer to the mem_map_t data structure is returned to the page fault handling code. Otherwise, the page must be loaded into memory from the file system that holds the file. Linux allocates physical memory and reads the page from a disk file. If possible, Linux initiates reading of the next page of the file. This single-page look-ahead means that if the process reads data sequentially from the file, the next page of data will be waiting in memory.

The page cache keeps growing as the program image is read and executed. If the page is no longer needed, it will be removed from the cache. Such as an image that is no longer used by any process. When Linux uses memory, physical pages may continue to decrease. At this time, Linux can reduce the page cache.

3.8 Swapping out and Discarding Pages

When physical memory is scarce, the Linux memory management subsystem must attempt to free physical pages. This task falls on the core swap process (kswapd). A core swap daemon is a special type of process, a core thread. A core thread is a process without virtual memory and runs in the physical address space in core state. The core swap daemon has a slightly misnamed name because it does more than just swap pages to the system swap file. Its job is to ensure that the system has enough free pages for the memory management system to operate efficiently.

The kernel swap daemon (kswapd) is started by the kernel's init process at startup and waits for the kernel's swap timer to expire. Each time the timer expires, the swap process checks to see if there are too few free pages in the system. It uses two variables: free_pages_high and free_pages_low to decide whether to free some pages. As long as the number of free pages in the system remains above free_pages_high, the swap process does nothing. It sleeps again until the next time its timer expires. To do this check, the swap process takes into account the number of pages that are being written to the swap file, counted by nr_async_pages: incremented each time a page is queued for writing to the swap file, and decremented when it is finished. Free_page_low and free_page_high are set at system startup time and are related to the number of physical pages in the system. If the number of free pages in the system is less than free_pages_high or lower than free_page_low, the kernel swap process will try three methods to reduce the number of physical pages used by the system:

See kswapd() in mm/vmscan.c

Reduce the size of buffer cache and page cache

Swap out System V shared memory pages

Swap and discard pages

If the number of free pages in the system falls below free_pages_low, the core swap process will try to free 6 pages before the next run. Otherwise try to free 3 pages. Each of the above methods is tried until enough pages are freed. The core swap process records the last method it used to free physical pages. Each time it runs it will first try the last successful method to free the page.

After freeing enough pages, the swap process sleeps again until its timer expires again. If the reason the kernel swap process frees pages is that the number of system free pages is less than free_pages_low, it only sleeps half the time it normally would. As long as the number of free pages is greater than free_pages_low, the swap process resumes checking at the original interval.

3.8.1 Reducing the size of the Page and Buffer Caches

Pages and pages in the buffer cache are good candidates for free_area vector. Page Cache, which contains pages of memory-mapped files, may have unnecessary data that occupies system memory. Likewise, the Buffer Cache, which includes data read from or written to physical devices, may also contain useless buffers. When the physical pages in the system are about to be exhausted, it is relatively easy to discard the pages in these caches because it does not require writing to the physical device (unlike swapping pages out of memory). Abandoning these pages does not have many harmful side effects, except that accessing physical devices and memory-mapped files is a bit slower. Nonetheless, if pages in these caches are fairly discarded, all processes are equally affected.

Each time the kernel swap process wants to shrink these buffers, it checks the page blocks in the mem_map page vector to see if they can be discarded from physical memory. If the system free pages are too low (dangerous) and the core swap process is swapping heavily, the page block size for this check will be larger. The page block size is checked in a round-robin fashion: each attempt to reduce the memory map uses a different page block size. This is called the clock algorithm, like the clock's hands. The entire mem_map page vector is checked, some pages at a time.

See mm/filemap.c shrink_map()

Each page checked must be judged to be cached in the page cache or buffer cache. Note that discarding of shared pages is not considered at this time, and a page will not be in both caches at the same time. If the page is not in either buffer, the next page of the mem_map page vector table is checked.

Pages cached in buffer cache ch (or buffers in pages are cached) make buffer allocation and deallocation more efficient. The code that shrinks the memory map attempts to free the buffer containing the checked page. If the buffer is freed, the page containing the buffer is also freed. If the page checked is in the Linux page cache, it will be removed from the page cache and freed.

see fs/buffer.c free_buffer()

If this attempt frees enough pages, the core swap process will continue to wait until the next periodic wakeup. Because the freed pages do not belong to any process' virtual memory (just cached pages), there is no need to update the process's page table. If there are still not enough discarded cache pages, the swap process will try to swap out some shared pages.

3.8.2 Swapping Out System V Shared Memory Pages

Shared memory in System V is an inter-process communication mechanism that exchanges information by sharing virtual memory between two or more processes. How memory is shared between processes is discussed in detail in Section 5. For now, it is enough to say that each piece of System V shared memory is described by a shmid_ds data structure. It includes a pointer to the vm_area_struct linked list data structure for every process that shares this memory. The Vm_area_struct data structure describes the location of this shared memory in each process. Each vm_area_struct structure in this System V memory is linked together with vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure has a linked list of page table entries, each of which describes the correspondence between a shared virtual page and a physical page.

The kernel swap process also uses the clock algorithm when swapping out the shared memory pages of System V. Every time it runs, it records the page of the shared memory that was swapped out last time. It is recorded with two indexes: the first is the index in the shmid_ds data structure array, and the second is the index in the page table chain of this shared memory area. In this way, the sacrifice of the shared memory area is fairer.

See ipc/shm.c shm_swap()

Because the physical page number corresponding to a virtual page of a specified System V shared memory is included in the page table of each process that shares the virtual memory, the kernel swap process must modify the page tables of all processes to reflect that the page is no longer available. memory and in the swap file. For each shared page swapped out, the swapping process must find the corresponding entry for this page in each shared process's page table (by looking up each vm_area_struct pointer) if the entry for this shared memory page in a process' page table is valid , the swap process will invalidate it, mark it as a swap page, and decrement the in-use number of this shared page by 1. The format of the swapped out System V shared page table consists of an index in the shmid_ds data structure group and an index into the page table entry in this shared memory area.

If all shared memory has been modified and the number of pages in use becomes 0, the shared page can be written to the swap file. The entry for this page in the page table pointed to by the shmid_ds data structure of this System V shared memory area will be replaced with the swapped page table entry. The swapped out page table entry is invalid but contains an index to the open swap file and the offset of this page within this file. This information is used to retrieve the page back into physical memory.

3.3 Swapping Out and Discarding Pages

The swap process takes turns checking whether every process in the system is available for swap. Good candidates are processes that can be swapped (some don't) and have one or more pages that can be swapped out of memory or discarded. Pages are swapped from physical memory to the system swap file only when nothing else works.

See mm/vmscan.c swap_out()

Most of the contents of the execution image from the image file can be re-read from the file. For example: the execution instructions of an image will not be changed by itself, so it does not need to be written to the swap file. These pages are simply discarded. If referenced by the process again, it can be loaded into memory again from the execution image.

Once the process to be swapped is determined, the swapping process looks at all of its virtual memory regions, looking for regions that are not shared or locked. Linux does not swap out all the pages of the selected process that can be swapped out, but only removes a small number of pages. If a page is locked in memory, it cannot be swapped or discarded.

See mm/vmscan.c swap_out_vme() Traces the vm_next vm_nex pointer in the vm_area_struct structure arranged in the process mm_struct.

The Linux swap algorithm uses page age. Each page has a counter (in the mem_map_t data structure) that tells the kernel swap process whether the page is worth swapping out. Pages age when not in use, and are updated when accessed. The swap process only swaps old pages. By default, the age is assigned the value 3 when the page is first allocated. With each visit, its age increases by 3 until it reaches 20. Each time the system swap process runs it decrements the age of the page by 1 to make the page old. This default behavior can be changed, so this information (and other related information) is stored in the swap_control data structure.

If the page is too old (age = 0), the swap process is further processed. Dirty pages can be swapped out, and Linux describes such pages with an architecture-dependent bit in the PTE that describes the page (see Figure 3.2). However, not all dirty pages need to be written to the swap file. The virtual memory area of ​​each process can have its own swap operation (indicated by the vm_ops pointer in vm_area_struct), if so, the swap process will use it in this way. Otherwise, the swap process allocates a page from the swap file and writes the page to the file.

The page table entry for this page is replaced with an invalid entry, but contains information about the page in the swap file: the offset within the file where the page is located and the swap file used. Regardless of the swap, the original physical page is put back into the free_area for re-release. Clean (or not dirty) pages can be discarded and put back into the free_area for reuse.

If enough pages for the swappable process are swapped or discarded, the swap process sleeps again. The next time it wakes up it will consider the next process in the system. In this way, the swap process nibbles away each process's physical pages until the system is rebalanced. This is fairer than swapping out the entire process.

3.9 The Swap Cache

When swapping pages to the swap file, Linux avoids writing unnecessary pages. Sometimes a page may exist in both the swap file and physical memory. This happens when a page is swapped out of memory and then brought back into memory when the process wants to access it. As long as the page in memory has not been written to, the copy in the swap file continues to be valid.

Linux uses the swap cache to keep track of these pages. The swap cache is a page table entry or linked list of system physical pages. A swap page has a page table entry that describes the swap file used and its location in the swap file. If the swap cache entry is non-zero, it means that a page in the swap file has not been changed. If the page is later modified (written), its entry is removed from the swap cache)

When Linux needs to swap a physical page to the swap file, it looks in the swap cache, and if there is a valid entry for the page, it doesn't need to write the page to the swap file. Because this page in memory has not been modified since the last time the swap file was read.

Entries in the swap cache are page table entries that were once swapped out. They are marked as invalid, but contain information that allows Linux to find the correct swap file and the correct pages in the swap file.

3.10 Swapping Page In

Dirty pages stored in the swap file may need to be accessed again. For example: when the application wants to write data to the virtual memory, and the physical page corresponding to this page is swapped to the swap file. Accessing virtual memory pages that are not in physical memory will cause a page fault. A page fault is a signal from the processor to inform the operating system that it cannot convert virtual memory to physical memory. Because the page table entry in virtual memory describing this page is marked as invalid after swapping out. The processor cannot handle the virtual to physical address translation, transferring control back to the operating system, telling it the virtual address of the error and the reason for the error. The format of this information and how the processor transfers control back to the operating system is processor type dependent. The processor-dependent page fault handling code must locate the data structure vm_area_struct that describes the virtual memory area containing the virtual address of the fault. It works by looking through the process' vm_area_struct data structure until it finds the one that contains the erroneous virtual address. This is very time-critical code, so a process's vm_area_struct data structure is arranged in a particular way to make this lookup take as little time as possible.

see arch/i386/mm/fault.c do_page_fault()

Having performed the appropriate processor-dependent actions and found a valid virtual memory including the virtual address where the fault occurred (occurred), the page fault handling process is again generic and can be used for all processors on which Linux can run. The generic page fault handling code looks up the page table entry for the wrong virtual address. If the page table entry it finds is a page that was swapped out, Linux must swap the page back into physical memory. The format of page table entries for swapped pages is processor-dependent, but all processors mark these pages as invalid and put the necessary information in the page table entry to locate the page in the swap file. Linux uses this information to page the page back into physical memory.

See mm/memory.c do_no_page()

At this point, Linux knows the virtual address where the error (occurred) and the page table entry about where this page is swapped to. The Vm_area_struct data structure may contain a pointer to a routine to swap pages in this virtual memory back into physical memory. This is the swapin operation. If there is a swapin operation in this memory, Linux will use it. In fact, the reason why the swapped shared memory of System V needs special processing is because the format of the swapped shared memory pages of System V is different from that of ordinary swap pages. If there is no swapin operation, Linux assumes this is a normal page and requires no special handling. It allocates a free physical page and reads the swapped out pages from the swap file. Information about where (and which swap file) from the swap file is taken from an invalid page table entry.

See mm/page_alloc.c swap_in()

If the access that caused the page fault was not a write access, the page is left in the swap cache and its page table entry is marked as unwritable. If the page is later written, another page fault occurs, at which time the page is marked as dirty and its entry is removed from the swap cache. If the page has not been modified and needs to be swapped out, Linux can avoid writing the page to the swap file because the page is already in the swap file.

If the access to bring this page back from the swap file is a write access, the page is removed from the swap cache and the page table entry page for this page is marked dirty and writable.

4. Processes

Processes perform tasks in the operating system. A program is an executable image of a series of machine code instructions and data stored on disk and, therefore, is a passive entity. A process can be thought of as an executing computer program. It is a dynamic entity that is constantly changing as the processor executes machine code instructions. Processing the program's instructions and data, the process also includes the program counter and other CPU registers and the stack, which includes temporary data such as routine parameters, return addresses, and saved variables. The currently executing program, or process, includes all current activity in the microprocessor. Linux is a multi-process operating system. Processes are separate tasks with their own rights and responsibilities. If one process crashes, it shouldn't crash another process in the system. Each independent process runs in its own virtual address space and cannot affect other processes except through the mechanisms managed by the secure core.

During the lifetime of a process it uses many system resources. It uses the system's CPU to execute its instructions and the system's physical memory to store it and its data. It opens and uses files in the file system, and directly or indirectly uses the system's physical devices. Linux must keep track of the process itself and the system resources it uses in order to manage fairly that process and other processes in the system. If a process monopolizes most of the system's physical memory and CPU, it is unfair to other processes.

The most precious resource in the system is the CPU. Usually there is only one system. Linux is a multi-process operating system. Its goal is to keep the process running on every CPU in the system, making full use of the CPU. If there are more processes than the CPU (which is mostly the case), the remaining processes must wait until the CPU is freed before they can run. Multiprocessing is a simple idea: a process keeps running until it has to wait, usually for some system resource, before it can continue running. On a single-process system, such as DOS, the CPU is simply set to idle, so that waiting time is wasted. In a multi-process system, many processes are in memory at the same time. When a process has to wait, the operating system takes the CPU away from the process and gives it to another process that needs it more. is the scheduler selected

The next most appropriate process. Linux uses a series of scheduling schemes to ensure fairness.

Linux supports many different executable file formats, ELF is one of them, Java is another. Linux must manage these files transparently because processes use the system's shared libraries.

4.1 Linux Processes

In Linux, each process is represented by a task_struct (task and process interoperability in Linux) data structure, which is used to manage the processes in the system. The task vector table is an array of pointers to each task_struct data structure in the system. This means that the maximum number of processes in the system is limited by the task vector table, which is 512 by default. When a new process is created, a new task_struct is allocated from system memory and added to the task vector table. To make it easier to find, use the current pointer to point to the currently running process.

see include/linux/sched.h

In addition to normal processes, Linux also supports real-time processes. These processes must react quickly to external events (hence the name "real time"), and the scheduler must be treated differently from normal user processes. Although the task_struct data structure is very large and complex, its fields can be divided into the following functions:

When the State process executes, it changes state according to the situation. Linux processes use the following states: (SWAPPING is omitted here because it doesn't seem to be used)

Running process is running (is the current process of the system) or ready to run (waiting to be scheduled on a CPU of the system)

A Waiting process is waiting for an event or resource. Linux distinguishes between two types of waiting processes: interruptible and uninterruptible. An interruptible waiting process can be interrupted by a signal, while an uninterruptible waiting process directly waits for a hardware condition and cannot be interrupted by any situation.

Stopped Process stopped, usually by receiving a signal. The process being debugged can be in a stopped state.

The Zombie terminated process, for some reason, has an entry in the task_struct data structure in the task vector table. Just like it sounds, it's a dying process.

Scheduling Information The scheduler needs this information to fairly decide which of the processes in the system should run.

Every process in the Identifiers system has a process identifier. The process identifier is not an index in the task vector table, but just a number. Each process also has user and group identifiers. Used to control process access to files and devices in the system.

Inter-Process Communication Linux supports traditional UNIX-IPC mechanisms, namely signals, pipes and semaphores, as well as System V's IPC mechanisms, namely shared memory, semaphores and message queues. The IPC mechanisms supported by Linux are described in Section 5.

Links In a Linux system, no process is completely independent of other processes. Every process in the system, except the initial process, has a parent process. The new process is not created, but copied, or cloned from the previous process. Each process' task_struct has pointers to its parent and sibling processes (processes with the same parent) and its child processes.

On Linux systems you can see the family relationships of running processes with the pstree command:

init(1)-+-crond(98)
|-emacs(387)
|-gpm(146)
|-inetd(110)
|-kerneld(18)
|-kflushd(2)
|-klogd(87)
|-kswapd(3)
|-login(160)---bash(192)---emacs(225)
|-lpd(121)
|-mingetty(161)
|-mingetty(162)
|-mingetty(163)
|-mingetty(164)
|-login(403)---bash(404)---pstree(594)
|-sendmail(134)
|-syslogd(78)
`-update(166)

In addition, all the process information in the system is also stored in a doubly linked list of task_struct data structure, and the root is the init process. This table allows Linux to find all the processes in the system. It needs this table to provide support for commands such as ps or kill.

Times and Timers During the life cycle of a process, the core keeps track of its other times in addition to the CPU time it uses. Each time slice (clock tick), the kernel updates the time spent by the current process in jiffies in system and user mode. Linux also supports counters for process-specified time intervals. A process can use a system call to set up a timer and send a signal to itself when the timer expires. This timer can be one-time or periodic.

The File system process can open or close files as needed. The task_struct structure of the process stores a pointer to each open file descriptor and a pointer to two VFS inodes. Each VFS inode uniquely describes a file or directory in a file system, and also provides a common interface to the underlying file system. How filesystems are supported under Linux is described in Section 9. The first inode is the root of the process (its home directory), and the second is its current or pwd directory. Pwd is taken from the Unix command: print out working directory. The two VFS nodes themselves have count fields that grow as one or more processes reference them. That's why you can't delete a directory that a process has set as its working directory.

Virtual memory Most processes have some virtual memory (kernel threads and kernel daemons do not), and the Linux kernel must know how this virtual memory is mapped to the system's physical memory.

The Processor Specific Context process can be seen as the sum of the current state of the system. As long as the process runs, it uses the processor's registers, stack, and so on. When a process is suspended, the context of these processes, and the context related to the CPU must be saved to the task_struct structure of the process. When the scheduler restarts the process, its context is restored from here.

4.2 Identifiers

Linux, like all Unixes, uses user and group identifiers to check access rights to files and images on the system. All files in a Linux system have ownership and permissions, and these permissions describe what permissions the system has on that file or directory. The basic permissions are read, write, and execute, and 3 groups of users are assigned: the file owner, processes belonging to a specific group, and other processes in the system. Each group of users can have different permissions. For example, a file can be read and written by its owner and read by its group, but other processes in the system cannot access it.

Linux uses groups to give permissions to a group of users to a file or directory, rather than to a single user or process on the system. For example, you can create a group for all users in a software project, so that only they can read and write the source code of the project. A process can belong to several groups (32 by default), and these groups are placed in the groups vector table in the task_struct structure of each process. As long as one of the groups to which a process belongs has access to a file, the process has the appropriate group permissions for the file.

There are 4 pairs of process and group identifiers in the task_struct of a process.

Uid, gid  the identifier of the user and the identifier of the group used by the process to run

Effective uid and gid  Some programs change the uid and gid of the executing process to their own (in the properties of the VFS inode execution image). These programs are called setuid programs. This is useful because it can restrict access to services, especially those running in someone else's way, such as a network daemon. The effective uid and gid come from the setuid program, and the uid and gid remain the same. The kernel checks the effective uid and gid when checking privileges.

File system uid and gid  are usually equal to the effective uid and gid, check access rights to the file system. Filesystem for mounting via NFS. At this time, the user-mode NFS server needs to access the file as a special process. Only the filesystem uid and gid change (not the effective uid and gid). This prevents malicious users from sending Kill signals to NFS servers. Kill is sent to the process with a special effective uid and gid.

Saved uid and gid  This is a requirement of the POSIX standard, allowing programs to change the uid and gid of a process through system calls. Used to store the real uid and gid after the original uid and gid have changed.

4.3 Scheduling

All processes run partly in user mode and partly in system mode. How the underlying hardware supports these states varies but usually there is a safety mechanism to go from user mode to system mode and back. User mode has much lower permissions than system mode. Every time a process executes a system call, it switches from user mode to system mode and continues execution. At this point let the core execute the process. In Linux, processes are not competing with each other to be the currently running process, they cannot stop other running processes and execute themselves. Each process relinquishes the CPU when it has to wait for some system event. For example, a process might have to wait to read a character from a file. This wait happens in a system call in system mode. The process uses a library function to open and read the file, which in turn executes a system call to read bytes from the open file. At this point, the waiting process will be suspended, and another more worthwhile process will be selected for execution. Processes often call system calls, so they often need to wait. Even if the process needs to wait, it may use unbalanced CPU events, so Linux uses preemptive scheduling. With this scheme, each process is allowed to run for a small amount of time, 200 milliseconds. When this time elapses, another process is selected to run, and the original process waits for a while until it runs again. This time period is called a time slice.

The scheduler is required to select the most deserving process of all the processes that can run in the system. A runnable process is one that just waits for the CPU. Linux uses a reasonable and simple priority-based scheduling algorithm to choose among the current processes in the system. When it selects a new process to run, it saves the state of the current process, processor-related registers and other contextual information that needs to be saved into the process's task_struct data structure. It then restores the state of the new process to be run (again, associated with the processor), and transfers control of the system to this process. To fairly distribute CPU time among all runnable processes in the system, the scheduler stores information in each process' task_struct structure:

See also kernel/sched.c schedule()

policy The scheduling policy of the process. Linux has two types of processes: normal and real-time. Real-time processes have higher priority than all other processes. If there is a live process ready to run, it is always run first. There are two strategies for real-time processes: round robin and first in first out. Under the scheduling strategy of the ring, each real-time process runs in sequence, while under the first-in-first-out strategy, each runnable process runs according to its order in the scheduling queue, and this order will not change.

Priority The scheduling priority of the process. Also the amount of time (jiffies) it can use when it is allowed to run. You can change the priority of a process through system calls or the renice command.

Rt_priority Linux supports real-time processes. These processes have higher priority than other non-real-time processes in the system. This field allows the scheduler to assign a relative priority to each real-time process. The priority of a real-time process can be modified with a system call

Coutner The amount of time (jiffies) that the process can run at this time. When the process starts, it is equal to the priority (priority), which is decremented every clock cycle.

The scheduler runs from multiple places in the core. It can run after the current process is placed in the waiting queue, or it can run after the system call before the process returns from the system state to the process state. Another reason to run the scheduler is that the system clock just happens to set the current process's counter to zero. Every time the scheduler runs it does the following:

See also kernel/sched.c schedule()

The kernel work scheduler runs the bottom half handler and handles the system's queue of scheduled tasks. These lightweight core threads are described in detail in Section 11

Current pocess must dispose of the current process before selecting another process.

If the scheduling policy of the current process is ring, it is placed at the end of the run queue.

If the task is interruptible and it received a signal the last time it was scheduled, its state changes to RUNNING

If the current process times out, its state becomes RUNNING

If the state of the current process is RUNNING, keep this state

Processes that are not RUNNING or INTERRUPTIBLE are removed from the run queue. This means that such processes are not considered when the scheduler looks for the most worthwhile processes to run.

The Process Selection scheduler looks at the processes in the run queue to find the most worthwhile processes to run. If there is a real-time process (with a real-time scheduling policy), it will be heavier than a normal process. The weight of a normal process is its counter, but for a real-time process it is counter plus 1000. This means that if there is a runnable real-time process in the system, it will always run before any normal runnable process. The current process, because it uses up some time slices (its counter is reduced), will be at a disadvantage if other processes of the same priority are in the system: as it should be. If several processes have the same priority, the one closest to the front of the run queue is selected. The current process is put at the back of the run queue. If a balanced system has a large number of processes of the same priority, then execute these processes in order. This is called a ring scheduling policy. However, because processes need to wait for resources, the order in which they run may change.

Swap Processes If the most worthwhile process is not the current process, the current process must be suspended and the new process will run. When a process runs it uses CPU and system registers and physical memory. Each time it calls a routine, it passes parameters through registers or stacks, saves values ​​such as the return address of the calling routine, and so on. Therefore, when the scheduler runs it runs in the context of the current process. It may be in privileged mode: kernel mode, but it is still the currently running process. When the process is to be suspended, all its machine state, including the program counter (PC) and all processor registers, must be stored in the process' task_struct data structure. Then, all the machine state of the new process must be loaded. This operation is system-dependent, not exactly the same on different CPUs, but often with the help of some hardware.

Swapping out the context of the process happens at the end of the schedule. The context stored by the previous process is a snapshot of the hardware context of the system when the process is scheduled to end. Likewise, when a new process context is loaded, there is still a snapshot at the end of the schedule, including the contents of the process's program counter and registers.

If the previous process or the new current process uses virtual memory, the system's page tables need to be updated. Again, this action is architecture-dependent. Alpha AXP processors, using TLT (Translation Look-aside Table) or cached page table entries, must clear cached page table entries belonging to the previous process.

4.3.1 Scheduling in Multiprocessor Systems

In the Linux world, there are relatively few multi-CPU systems, but a lot of work has been done to make Linux an SMP (symmetric multiprocessing) operating system. That is, the ability to balance the load among the CPUs in the system. Load balancing is no more important than in the scheduler.

In a multiprocessor system, the desired situation is that all processors are busy running processes. Each process runs the scheduler independently until its current process runs out of time slice or has to wait for system resources. The first thing to be aware of in SMP systems is that there may be more than one idle process in the system. In a single-processor system, the idle process is the first task in the task vector table. In an SMP system, each CPU has an idle process, and you may have more than one idle CPU. In addition, each CPU has a current process, so the SMP system must record the current and idle processes of each processor.

In an SMP system, the task_struct of each process contains the currently running processor number (processor) and the last running processor number (last_processor). It doesn't make sense why a process should not run on a different CPU each time it is chosen to run, but Linux can use processor_mask to restrict a process to one or more CPUs. If bit N is set, the process can run on processor N. When the scheduler chooses a process to run, it does not consider processes whose corresponding bits of processor_mask are not set. The scheduler also takes advantage of the last process running on the current processor, since there is often a performance cost in moving a process to another processor.

4.4 Files

Figure 4.1 shows the two data structures used to describe information related to the file system in each process of the system. The first fs_struct contains the process's VFS inode and its umask. Umask is the default mode when new files are created and can be changed via system calls.

see include/linux/sched.h

The second data structure, files_struct, contains information about all files currently in use by the process. Programs read from standard input, write to standard output, and output error messages to standard error. These can be files, terminal input/output or devices of the century, but they are all considered files from a program's point of view. Each file has its descriptor, and files_struct contains pointers to 256 file data results, each describing a process-shaped file. The F_mode field describes the mode in which the file was created: read-only, read-write, or write-only. F_pos records the position of the next read and write operation in the file. F_inode points to the inode describing the file, and f_ops is a pointer to a set of routine addresses, each of which is a function for processing the file. For example functions that write data. This abstract interface is very powerful, allowing Linux to support a large number of file types. We can see that pipe is also implemented with this mechanism in Linux.

Each time a file is opened, a free file pointer in files_struct is used to point to the new file structure. There are 3 file descriptors already open when the Linux process starts. These are standard input, standard output, and standard error, all inherited from the parent process that created them. Access to files is through standard system calls, which need to pass or return a file descriptor. These descriptors are indexes into the process's fd vector table, so the file descriptors for standard input, standard output, and standard error are 0, 1, and 2, respectively. All access to the file is accomplished using the file operation routines in the file data structure along with its VFS inode.

4.5 Virtual Memory

The virtual memory of a process includes executing code and data from a variety of sources. The first is a loaded program image, such as the ls command. This command, like all execution images, consists of execution code and data. The image file contains all the information needed to load the execution code and associated program data into the virtual memory of the process. Second, a process can allocate (virtual) memory during processing, such as to hold the contents of a file it reads. The newly allocated virtual memory needs to be connected to the existing virtual memory of the process before it can be used. In the third, Linux processes use libraries of common code, such as file processing. It doesn't make sense for each process to include a copy of the library, Linux uses shared libraries, which can be shared by several concurrently running processes. The code and data in these shared libraries must be linked to the virtual address space of the process and the virtual address space of other processes that share the library.

At a given time, a process does not use all the code and data contained in its virtual memory. It may include code intended to be used in specific situations, such as initialization or handling specific events. It may just be using some of the routines in its shared library. It would be a waste to load all this code into physical memory and not use it. Multiply this waste by the number of processes in the system, and the operating efficiency of the system will be very low. Linux instead uses demand paging technology, where the virtual memory of a process is brought into physical memory only when the process tries to use it. Therefore, Linux does not directly load code and data into memory, but modifies the page table of the process to mark these virtual areas as existing but not in memory. When a process tries to access this code or data, the system hardware will generate a page fault, passing control to the Linux kernel for processing. Therefore, for each virtual memory region of the process address space, Linux needs to know where it came from and how to put it in memory so that it can handle these page faults.

The Linux kernel needs to manage all these virtual memory areas, and the content of each process' virtual memory is described by a mm_struct mm_struc data structure pointed to by its task_struct. The mm_struct data structure of the process also includes information about the loaded execution image and a pointer to the process page table. It contains pointers to a set of vm_area_struct data structures, each representing a virtual memory area in the process.

This linked list is sorted in virtual memory order. Figure 4.2 shows the virtual memory distribution of a simple process and the core data structures that manage it. Because these virtual memory areas come from different sources, Linux abstracts the interface by vm_area_struct pointing to a set of virtual memory processing routines (via vm_ops). This way all virtual memory for a process can be handled in a consistent way, regardless of the underlying service that manages this memory. For example, there would be a generic routine that is called when a process tries to access non-existent memory, which is how page faults are handled.

When Linux creates new virtual memory areas for a process and handles references to virtual memory that are not in the system's physical memory, the process's vm_area_struct data structure list is referenced repeatedly. This means that the time it takes to find the correct vm_area_struct data structure is important to the performance of the system. To speed up access, Linux also puts the vm_area_struct data structure into an AVL (Adelson-Velskii and Landis) tree. The tree is arranged so that each vm_area_struct (or node) has a left and a right pointer to the adjacent vm_area_struct structure. The left pointer points to a node with a lower starting virtual address, and the right pointer points to a node with a higher starting virtual address. To find the correct node, Linux starts at the root of the tree and follows the left and right pointers of each node until the correct vm_area_struct is found. Of course, freeing in the middle of this tree takes no time, and inserting a new vm_area_struct takes extra processing time.

When a process allocates virtual memory, Linux does not reserve physical memory for the process. It describes this virtual memory through a new vm_area_struct data structure, which is connected to the process's virtual memory list. When the process tries to write to this new virtual memory area, the system will page fault. The processor tries to decode this virtual address, but there is no page table entry for that memory, it will give up and generate a page fault exception, which the Linux kernel will handle. Linux checks whether the referenced virtual address is in the process's virtual address space, and if so, Linux creates the appropriate PTE and allocates a physical memory page for the process. Maybe the corresponding code or data needs to be loaded from the filesystem or swap disk, and then the process reruns from the instruction that caused the page fault, because this time the memory actually exists and can continue.

4.6 Creating a Process

When the system starts, it runs in the kernel state. At this time, there is only one process: the initialization process. Like all other processes, the initial process has a set of machine states represented by stacks, registers, and so on. This information is stored in the task_struct data structure of the initial process when other processes in the system are created and run. At the end of system initialization, the initial process starts a core thread (called init) and executes an idle loop, doing nothing. The scheduler runs this idle process when there is nothing to do. The task_struct of this idle process is the only one that is not dynamically allocated but statically defined when the core is connected. To avoid confusion, it is called init_task.

The Init core thread or process has process identifier 1 and is the first real process of the system. It performs some initial setup of the system (such as turning on the system to control it, mounting the root filesystem), and then executes the system initialization routine. Depending on your system, it may be one of /etc/init, /bin/init or /sbin/init. The Init program uses /etc/inittab as a script file to create new processes in the system. These new processes may themselves create new processes. For example, the getty process may create a login process when the user tries to log in. All processes in the system are descendants of the init core thread.

The creation of a new process is achieved by cloning the old process, or cloning the current process. A new task is created via a system call (fork or clone), and cloning occurs in the kernel state of the kernel. At the end of the system call, a new process is spawned, waiting for the scheduler to choose it to run. Allocate one or more physical pages from the system's physical memory for the cloned process's stack (user and core) for the new task_struct data structure. A process identifier will be created that is unique within the system's process identifier group. However, it is also possible that the cloned process retains the process identifier of its parent. The new task_struct enters the task vector table, and the contents of the task_struct of the old (current) process are copied to the cloned task_struct.

See kernel/fork.c do_fork()

When cloning processes, Linux allows two processes to share resources instead of having separate copies. Including process files, signal handling and virtual memory. When sharing these resources, their corresponding count fields increase or decrease accordingly, so that Linux will not release these resources until both processes stop using them. For example, if the cloned process wants to share virtual memory, its task_struct will include a pointer to the mm_struct of the original process, and the count field of mm_struct is incremented to indicate the number of processes currently sharing it.

Cloning a process' virtual memory requires considerable skill. A set of vm_area_struct data structures, the corresponding mm_struct data structures, and the page table of the cloned process must be generated without copying the virtual memory of the process. This can be a difficult and time-consuming task, as part of the virtual memory may be in physical memory and another part may be in the swap file. Instead, Linux uses a technique called "copy on write", which copies virtual memory only when one of the two processes tries to write. Any virtual memory that isn't written, and possibly even written, can be shared between two processes. What harm would it do. Read-only memory, such as executing code, can be shared. To implement "copy on write", the page table entry of a writable area is marked read-only, and the vm_area_struct data structure describing it is marked "copy on write". A page fault occurs when a process tries to write to this virtual memory. At this point Linux will make a copy of this memory and process the page tables and virtual memory data structures of the two processes.

4.7 Times and Timer

The core tracks the CPU time of the process and some other time. Every clock cycle, the kernel updates the jiffies of the current process to represent the sum of the time spent in system and user mode.

In addition to these accounting timers, Linux also supports process-specific interval timers. A process can use these timers to signal itself when these timers expire. Three types of interval timers are supported:

See kernel/itimer.c

Real This timer uses real-time timing, and when the timer expires, a SIGALRM signal is sent to the process.

Virtual This timer only counts when the process is running. When it expires, it sends a SIGVTALARM signal to the process.

Profile is timely both when the process is running and when the system executes on behalf of the process. When it expires, a SIGPROF signal will be sent.

One or all interval timers can be run, and Linux records all necessary information in the task_struct data structure of the process. These interval timers can be created, started, stopped, and the current value read using system calls. Virtual and profile timers are handled the same way: every clock cycle, the current process's timer is decremented, and if it expires, the appropriate signal is issued

See kernel/sched.c do_it_virtual(), do_it_prof()

Real-time interval timers are slightly different. The mechanism by which Linux uses timers is described in Section 11. Each process has its own timer_list data structure. When using a real-time timer, the system's timer table is used. When it expires, the second half of the timer process removes it from the queue and calls the interval timer handler. It generates the SIGALRM signal and restarts the interval timer, adding it back to the system timer queue.

See also: kernel/iterm.c it_real_fn()

4.8 Executing Programs

In Linux, like Unix, programs and commands are usually executed through a command interpreter. The command interpreter is a user process like any other process, called the shell (imagine a nut with the kernel as the middle edible part, and the shell surrounds it, providing an interface). There are many shells in Linux, the most commonly used are sh, bash and tcsh. Except for some internal commands, such as cd and pwd, commands are executable binaries. For each command entered, the shell looks for a matching name in the directory specified by the current process's search path (in the PATH environment variable). If the file is found, load it and run it. The shell clones itself using the aforementioned fork mechanism, and replaces the binary image it is executing (the shell) with the contents of the found execution image file in the child process. Usually the shell waits for the command to end, or the subprocess to exit. You can send a SIGSTOP signal to the child process by typing control-Z, which stops the child process and puts it in the background, letting the shell rerun. You can use the shell command bg to have the shell send a SIGCONT signal to the child process, put the child process in the background and run it again, it will continue to run until it finishes or needs input or output from the terminal.

The executable file can be in many formats or even a script file. Script files must be recognized and run with a suitable interpreter. For example /bin/sh interprets shell scripts. Executable object files contain executable code and data and enough other information that the operating system can load them into memory and execute them. The most commonly used object file type in Linux is ELF, and in theory Linux is flexible enough to handle almost any object file format.

Like filesystems, the binary formats that Linux can support are either built directly into the kernel when the kernel is linked or can be loaded as modules. The kernel keeps a list of supported binary formats (see Figure 4.3), and when trying to execute a file, each binary format is tried until it works. In general, the binaries supported by Linux are a.out and ELF. The executable does not need to be read completely into memory, but uses a technique called demand loading. Part of the execution image is loaded into memory when the process uses it, and unused images can be discarded from memory.

see fs/exec.c do_execve()

 

4.9 ELF

ELF (Executable and Linkable Format) object files, designed by Unix Systems Labs, are now the most commonly used format for Linux. Although there is a slight performance overhead compared to other object file formats such as ECOFF and a.out, ELF feels more flexible. ELF executable files include executable code (sometimes called text) and data. The table in the execution image describes how the program should be placed in the virtual memory of the process. Statically linked images are created with the linker (ld) or link editor, and a single image contains all the code and data needed to run the image. This image also describes the layout of the image in memory and the address in the image of the first part of the code to be executed.

Figure 4.4 shows the layout of a statically linked ELF executable image. This is a simple C program that prints "hello world" and exits. The header file describes that it is an ELF image, with two physical headers (e_phnum is 2), starting from the 52nd byte at the beginning of the image file (e_phoff). The first physical header describes the execution code in the image, at virtual address 0x8048000, with 65532 bytes. Because it's statically linked, it includes all the library code that calls printf() that outputs "hello world". The entry of the image, that is, the first instruction of the program, is not located at the beginning of the image, but at the virtual address 0x8048090 (e_entry). The code starts immediately after the second physical header. This physical header describes the program's data and will be loaded into virtual memory at address 0x8059BB8. This piece of data can be read and written. You will notice that the size of the data in the file is 2200 bytes (p_filesz) and the size in memory is 4248 bytes. Because the first 2200 bytes contain pre-initialized data, and the next 2048 bytes contain data that will be initialized by the executing code.

see include/linux/elf.h

When Linux loads the ELF executable image into the virtual address space of the process, it is not the actual loaded image. It sets up the virtual memory data structures, the process's vm_area_struct and its page tables. When the program executes a page fault, the code and data of the program will be placed in physical memory. Unused program parts will not be placed in memory. Once the ELF binary format loader satisfies the conditions, the image is a valid ELF executable image, which clears the process's current executable image from its virtual memory. Because this process is a cloned image (as all processes are), the old image is the image of the program executed by the parent process (eg the command interpreter shell bash). Clearing old executable images discards old virtual memory data structures and resets the process's page tables. It also clears other signal handlers set, closing open files. At the end of the cleanup process, the process is ready to run the new executable image. Regardless of the format of the executable image, the same information is set in the mm_struct of the process. Includes pointers to the beginning of code and data in the image. These values ​​are read from the physical header of the ELF executable image, and the part they describe is also mapped into the process's virtual address space. This also happens when the vm_area_struct data structure of the process is created and the page table is modified. The mm_struct data structure also includes pointers to the parameters passed to the program and the environment variables of the process.

ELF Shared Libraries

Dynamically linked images, in turn, do not contain all the code and data needed to run. Some of them are placed in shared libraries and linked into the image at runtime. When the runtime dynamic library is linked into the image, the dynamic linker also uses the ELF shared library table. Linux uses several dynamic linkers, ld.so.1, libc.so.1 and ld-linux.so.1, all in the /lib directory. These libraries include common code, such as language subroutines. Without dynamic linking, all programs would have to have independent copies of these libraries, requiring more disk space and virtual memory. In the case of dynamic linking, the table of the ELF image includes information on all library routines that are referenced. This information instructs the dynamic linker how to locate library routines and how to link into the program's address space.

4.10 Scripts Files

Script files are executable files that require an interpreter to run. There are a large number of interpreters under Linux, such as wish, perl, and command interpreters such as tcsh. Linux uses the standard Unix convention of including the name of the interpreter on the first line of a script file. So a typical script file might start with:

#!/usr/bin/wish

The script file loader tries to find out which interpreter the file is using. It attempts to open the executable specified on the first line of the script file. If it can be opened, get a pointer to the file's VFS inode and execute it to interpret the script file. The name of the script file becomes parameter 0 (the first parameter), and all other parameters are shifted up one place (the original first parameter became the second parameter, etc.). Loading an interpreter is the same as loading other executable programs in Linux. Linux tries various binary formats in turn until it works. This means that in theory you can stack several interpreters and binary formats to make Linux's binary format handlers more flexible.

See fs/binfmt_script.c do_load_script()

5. Interprocess Communication Mechanisms

Processes communicate with each other and with the core, coordinating their behavior. Linux supports some mechanisms for inter-process communication (IPC). Signals and pipes are two of them, and Linux also supports the System V IPC (named after the version of Unix that first appeared) mechanism.

5.1 Signals

Signals are one of the oldest methods of interprocess communication used in Unix systems. Signals used to send asynchronous events to one or more processes. The signal can be generated using the keyboard terminal, or by an error condition, such as a process attempting to access a location in its virtual memory that does not exist. The shell also uses signals to send job control signals to its child processes.

Some signals are generated by the core, and others can be generated by other privileged processes on the system. You can use the kill command (kill -l) to list your system's signal set, output on my Linux Intel system:

1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SEAL

5) SIGTRAP 6) SIGIOT 7) SIGBUS 8) SIGFPE

9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2

13) SIGPIPE 14) SIGALRM 15) SIGTERM 17) SIGCHLD

18) SIGCONT 19) SIGSTOP 20) SIGTSTP 21) SIGTTIN

22) SIGTTOU 23) SIGURG 24) SIGXCPU 25) SIGXFSZ

26) SIGVTALRM 27) SIGPROF 28) SIGWINCH 29) SIGIO

30) SINGAPORE

The numbers are different on Alpha AXP Linux systems. A process can choose to ignore most signals that are generated, with two exceptions: SIGSTOP (to stop execution of the process) and SIGKILL (to cause the process to exit) cannot be ignored, although a process can choose how it handles signals. A process can block a signal, and if it doesn't block a signal, it can choose to handle it itself or let the core handle it. If handled by the core, the default behavior for this signal will be performed. For example, the default action of a process receiving a SIGFPE (floating point exception) is to generate core and exit. Signals have no inherent priority, if a process generates two signals at the same time, they will appear in the process in any order and be processed in any order. Also, there is no mechanism to handle multiple signals of the same kind. There is no way for a process to know whether it received 1 or 42 SIGCONT signals.

Linux uses the information stored in the task_struct of the process to implement the signal mechanism. The supported signals are limited by the word size of the processor. A 32-bit word-length processor can have 32 signals, while a 64-bit processor such as the Alpha AXP can have up to 64 signals. The currently pending signal is placed in the signal field, and the blocked field is placed in the signal mask to be blocked. All signals can be blocked except SIGSTOP and SIGKILL. If a blocked signal is generated, it remains pending until unblocked. Linux also saves information about how each process handles each possible signal. This information is placed in an array of sigaction data structures, and the task_struct of each process has a pointer to the corresponding array. This array contains the address of the routine that handles the signal, or a flag that tells Linux whether the process wants to ignore the signal or let the core handle it. Processes change the default signal handling by executing system calls that change the appropriate signal sigaction and blocking mask.

Not all processes in the system can send signals to every other process, only the core and superuser can. Normal processes can only send signals to processes that have the same uid and gid or are in the same process group. Signals are generated by setting the appropriate bits in the signal of task-struct. If the process is not blocking the signal, and is waiting but can be interrupted (state is Interruptible), then its state is changed to Running and it is confirmed that it is in the run queue, and this way it is woken up. In this way, the scheduler will treat it as a running candidate when the system next schedules it. If default handling is required, Linux can optimize the handling of signals. For example, if the signal SIGWINCH (X window changes focus) occurs and the default handler is used, nothing needs to be done.

Signals do not appear in the process immediately when they are generated, they must wait until the next time the process runs. Every time a process exits from a system call its signal and blocked fields are checked, and if there are any unblocked signals, it can be sent. This may seem very unreliable, but every process in the system is invoking a system call, such as in the process of writing a character to the terminal. Processes can choose to wait for a signal if they wish, and they hang in the Interruptible state until a signal is given. The Linux signal-handling code checks the sigaction structure for every currently unblocked signal.

If a signal handler is set as the default action, the core handles it. The default processing of the SIGSTOP signal is to change the state of the current process to Stopped, then run the scheduler and select a new process to run. The default action of the SIGFPE signal is to cause the current process to generate a core (core dump) and let it exit. Alternatively, a process can specify its own signal handler. This is a routine that is called when the signal is generated and the sigaction structure contains the address of this routine. Linux must call the process's signal-handling routine, and how that happens is processor-dependent. However, all CPUs have to deal with are processes that are currently running in kernel mode and are preparing to return to user mode that called kernel or system routines. The solution to this problem is to deal with the stack and registers of the process. The process program counter is set to the address of its signal handler, and the routine's arguments are added to the call structure or passed through registers. When the process resumes it appears that the signal handler is a normal call.

Linux is POSIX compliant, so a process can specify a signal to block when calling a specific signal handler. This means changing the blocked mask when calling the process's signal handler. When the signal handler ends, the blocked mask must be restored to its original value. Therefore, Linux adds a call to a cleanup routine to the stack of the process that received the signal, restoring the blocked mask to its original value. Linux also optimizes this situation: if several signal handling routines need to be called at the same time, they are stacked together, and each time one handler is exited, the next is called, and the cleanup routine is not called until the end.

5.2 Pipes

Normal Linux shells allow redirection. E.g:

$ ls | pr | lpr

Paginate by piping the output of the ls command, which lists directory files, to the standard input of the pr command. Finally, the standard output of the pr command is piped to the standard input of the lpr command, and the result is printed on the default printer. A pipe is a unidirectional stream of bytes that connects the standard output of one process with the standard input of another process. None of the processes are aware of this redirection, and it works as usual. It is the shell that establishes the temporary pipe between the processes. In Linux, pipes are implemented using two file data structures pointing to the same temporary VFS inode (which itself points to a physical page in memory). Figure 5.1 shows that each file data structure contains pointers to a vector table of different file manipulation routines: one for writing and one for reading from the pipe. This masks the difference from the usual system calls for reading and writing ordinary files. When the writing process writes to the pipe, the bytes are copied to the shared data page, and when reading from the pipe, the bytes are copied from the shared page. Linux must synchronize access to pipes. Pipeline writes and reads must be kept in lockstep, using locks, wait queues and signals.

See include/linux/inode_fs.h

When the writer process writes to the pipe, it uses the standard write library functions. The file descriptor passed by these library functions is an index into the process's group of file data structures, each of which represents an open file, in this case, an open pipe. The Linux system call uses the write routine pointed to by the file data structure describing this pipe. The write routine uses the information stored in the VFS inode representing the pipe to manage write requests. If there is enough space to write all the bytes into the pipe, as long as the pipe is not locked by the reading process, Linux locks the writing process and copies the bytes from the process' address space to the shared data page. If the pipe is locked by the reading process or there is not enough space, the current process sleeps and is placed in the waiting queue of the pipe I node, and calls the scheduler to run another process. It is interruptible, so it can receive signals. When there is enough space in the pipe to write data or the lock is released, the writing process will be awakened by the reading process. After the data is written, the VFS inode lock of the pipeline is released, and all read processes in the waiting queue of the pipeline inode will be awakened.

See fs/pipe.c pipe_write()

Reading data from a pipe is very similar to writing data. Processes are allowed to do non-blocking reads (depending on the mode in which they opened the file or pipe), in which case an error is returned if there is no data to read or the pipe is locked. This means that the process will continue to run. Another way is to wait in the wait queue of the pipe's inode until the writing process is complete. If the process of the pipeline has completed the operation, the inode of the pipeline and the corresponding shared data page are discarded.

See fs/pipe.c pipe_read()

Linux can also support named pipes, also called FIFOs, because pipes work on a first-in, first-out basis. The first data written to the pipe is the first data to be read out. Don't want pipes, FIFOs are not temporary objects, they are entities in the file system that can be created with the mkfifo command. A process can use a FIFO as long as it has the appropriate access rights. FIFOs are opened slightly differently than pipes. A pipe (its two file data structures, the VFS inode and the shared data page) is created once, while the FIFO already exists and can be opened and closed by its user. Linux has to deal with processes that open the FIFO for reading before the writing process opens the FIFO, and processes that read before the writing process writes data. Other than that, FIFOs are handled almost exactly like pipes, and they use the same data structures and operations.

5.3 System V IPC mechanisms

Linux supports three mechanisms for interprocess communication that first appeared in Unix System V (1983): message queues, semaphores, and shared memory (message queues, semaphores and shared memory). The System V IPC mechanism shares a common authentication method. Processes can only access these resources through system calls, passing a unique reference identifier to the kernel. Checks for access to System V IPC objects use access permissions much like checks for file access. Access to System V IPC objects is created by the object's creator through a system call. Each mechanism uses the object's reference identifier as an index into the resource table. This is not a direct index, some operations are required to generate the index.

All Linux data structures in the system that represent System V IPC objects include an ipc_perm data structure, including the user and group identifiers of the creating process, the access mode (owner, group, and others) for the object and the key of the IPC object. . Key is used as a method for locating the reference identifier of a System V IPC object. Two types of keys are supported: public and four. If the key is public, any process in the system can find the reference identifier of the corresponding System V IPC object as long as it passes the permission check. System V IPC objects cannot be referenced by key, they must be referenced using their reference identifier.

See include/linux/ipc.h

5.4 Message Queues

A message queue allows one or more processes to write messages and one or more processes to read messages. Linux maintains a msgque vector table for a series of message queues. Each of these units points to a msqid_ds data structure that fully describes the message queue. When creating a message queue, allocate a new msqid_ds data structure from system memory and insert it into the vector table

Each msqid_ds data structure includes an ipc_perm data structure and pointers to messages entering the queue. In addition, Linux retains the change time of the queue, such as the time of the last queue write, etc. The Msqid_ds queue also includes two waiting queues: one for writing to the message queue and one for reading.

See include/linux/msg.h

Each time a process attempts to write a message to the write queue, its effective user and group identifiers are compared against the schema of the queue's ipc_perm data structure. If the process can write to this queue, the message will be written from the process's address space to the msg data structure and placed at the end of the message queue. Each message carries a token of an application-specified type agreed upon between processes. However, because Linux limits the number and length of messages that can be written, there may be no room for messages. At this time, the process will be placed in the write waiting queue of the message queue, and then the scheduler will be called to select a new process to run. Wake up when one or more messages are read from this message queue.

Reading from the queue is a similar process. The access rights of the process are checked as well. A reading process can choose whether to read the first message from the queue regardless of the type of the message or to choose a special type of message. If there is no eligible message, the read process will be added to the read wait process of the message queue, and then run the scheduler. When a new message is written to the queue, the process will be woken up and continue to run.

5.5 Semaphores

In its simplest form, a semaphore is a location in memory whose value can be checked and set by multiple processes. The operation of check and set, at least for each associated process, is uninterruptible or atomic: it cannot be terminated once it is started. The result of the checksum set operation is the sum of the current value and the set value of the semaphore, which can be positive or negative. Depending on the results of the test and set operations, one process may have to sleep until the value of the semaphore is changed by another process. Semaphores can be used to implement critical regions, that is, important code areas where only one process can be running at a time.

Say you have many cooperating processes reading and writing records from a single data file. You may want access to files to be strictly coordinated. You can use a semaphore with an initial value of 1. In the code for the file operation, add two semaphore operations, the first to check and decrease the value of the semaphore, and the second to check and increase it. The first process accessing the file attempts to decrement the value of the semaphore, and if successful, the value of the semaphore becomes 0. The process can now continue to run and use the data file. However, if another process needs to use this file, and now it tries to decrement the value of the semaphore, it will fail because the result will be -1. The process will be suspended until the first process has finished processing the data file. When the first process finishes processing the data file, it increments the semaphore to 1. Now the waiting process will be woken up and its attempt to decrease the semaphore will succeed this time.

Each System V IPC semaphore object describes an array of semaphores, which Linux uses the semid_ds data structure to represent. All semid_ds data structures in the system are pointed to by the secondary pointer vector table. Each semaphore array has sem_nsems, which are described by a sem data structure pointed to by sem_base. All processes that are allowed to operate on the semaphore array of a System V IPC semaphore object can operate on them through system calls. A system call can specify a variety of operations, each of which is described by three additional inputs: the semaphore index, the operation value, and a set of flags. The semaphore index is the index of the semaphore array, and the operation value is the value to be added to the current semaphore value. First, Linux checks that all operations were successful. The operation is successful only if the current value of the operand plus the semaphore is greater than 0 or both the operand and the current value of the semaphore are 0. If any semaphore operation fails, Linux will suspend the process as long as the operation flag does not require the system call to be non-blocking. If the process is to be suspended, Linux must save the state of the semaphore operation to be performed and put the current process on the waiting queue. It does this by building a sem_queue data structure on the stack and filling it up. The new sem_queue data structure is placed at the end of the semaphore object's wait queue (using the sem_pending and sem_pending_last pointers). The current process is put into the waiting queue (sleeper) of this sem_queue data structure, and the scheduler is called to run another process.

See include/linux/sem.h

If all semaphore operations succeed, the current process does not need to be suspended. Linux goes ahead and applies these operations to the appropriate members of the semaphore array. Now Linux must check for any sleeping or hung processes, their actions may now be enforceable. Linux sequentially looks for each member of the operation waiting queue (sem_pending) to check whether its semaphore operation can succeed now. It removes the sem_queue data structure from the operation wait list if it can, and applies the semaphore operation to the semaphore array. It wakes up the sleeping process so that it can continue running the next time the scheduler runs. Linux checks the wait queue from start to finish until no more processes can be woken up by performing a semaphore operation.

There is a problem in semaphore operation: deadlock. This happens when a process changes the value of a semaphore into a critical region but doesn't leave the critical region either because it crashed or was killed. Linux avoids this by maintaining a tuning table of the semaphore array. That is, if these adjustments are implemented, the semaphore returns to a process's state before the semaphore operation. These adjustments are placed in the sem_undo data structure, queued in the sem_ds data structure, and queued in the task_struct data structure of the process using these semaphores.

An adjustment action may be required to be maintained for each individual beacon operation. Linux maintains at least one sem_undo data structure for each semaphore array per process. If the requested process doesn't have one, create one for it when needed. This new sem_undo data structure is queued in both the task_struct data structure of the process and the semid_ds data structure of the semaphore queue. When performing an operation on a semaphore in the semaphore queue, the value offset by the operation value is added to the semaphore entry in the adjustment queue of the process's sem_undo data structure. So, if the operation value is 2, then this adds -2 to the adjustment entry for this semaphore.

When a process is deleted, such as exiting, Linux traverses its set of sem_undo data structures and implements adjustments to the semaphore array. If a semaphore is deleted, its sem_undo data structure remains in the process' task_struct queue, but the corresponding semaphore array identifier is marked as invalid. In this case, the code to clear the semaphore simply discards the sem_undo data structure.

5.6 Shared Memory

Shared memory allows one or more processes to communicate through memory that is simultaneously present in their virtual address space. The pages of this virtual memory are referenced by page table entries in the page table of each shared process. But there is no need to have the same address in the virtual memory of all processes. As with all System V IPC objects, access to shared memory areas is controlled by keys and checked for access rights. After the memory is shared, there is no more checking how the process uses this memory. They must rely on other mechanisms, such as System V semaphores, to synchronize access to memory.

Each newly created memory region is represented by a shmid_ds data structure. These data structures are kept in the shm_segs vector table. The Shmid_ds data structure describes how big this shared memory access is, how many processes are using it, and how the shared memory is mapped into their address space. It is up to the creator of shared memory to control access to this memory and whether its keys are public or private. It can also lock shared memory in physical memory if it has sufficient privileges.

See include/linux/sem.h

Every process wishing to share this memory must attach to virtual memory via a system call. This creates a new vm_area_struct data structure for the process that describes the shared memory. A process can choose where shared memory is located in its virtual address space or Linux can choose a sufficient free area.

This new vm_area_struct structure is placed in the vm_area_struct list pointed to by shmid_ds. Link them together via vm_next_shared and vm_prev_shared. Virtual memory is not actually created when it is glued, it happens when the first process tries to access it.

A page fault occurs the first time a process accesses one of the pages of shared virtual memory. When Linux handles this page fault, it finds the vm_area_struct data structure that describes it. This contains pointers to such shared virtual memory handler routines. The shared memory page fault handling code looks in the shmid_ds page table entry list to see if there is an entry for this shared virtual memory page. If it doesn't exist, it allocates a physical page and creates a page table entry for it. This entry is not only entered into the page table of the current process, but also stored in this shmid_ds. This means that when the next process tries to access this memory and gets a page fault, the shared memory error handling code will also let this process use the newly created physical page. So, it is the first process to access a shared memory page that causes the page to be created, and other processes that access it subsequently cause the page to be added to their virtual address space.

When processes no longer need shared virtual memory, they are detached from it. This separation only affects the current process as long as other processes are still using this memory. Its vm_area_struct is removed from the shmid_ds data structure, and freed. The page table of the current process is also updated, invalidating its shared virtual memory area. When the last process sharing this memory is detached from it, the pages of the shared memory currently in physical memory are released, and the shmid_ds data structure of this shared memory is also released.

It is more complicated if the shared virtual memory is not locked in physical memory. In this case, pages of shared memory may be swapped to the system's swap disk when the system is using a lot of memory. How shared memory is swapped to and from physical memory is described in Section 3.

6、Peripheral Component Interconnect(PCI)

PCI, as its name implies, is a standard that describes how to connect peripheral components in a system in a structured and controlled manner. The standard PCI Local Bus specification describes how system components are electrically connected and how they behave. This section explores how the Linux kernel initializes the system's PCI bus and devices.

Figure 6.1 is a logical diagram of a PCI-based system. The PCI bus and the PCI-PCI bridge are the glue that holds the system components together. CUP and video devices are connected to the main PCI bus, PCI bus 0. A special PCI device, the PCI-PCI bridge, connects the primary bus to the secondary PCI bus, PCI bus 1. In PCI specification terminology, PCI bus 1 is described as downstream of the PCI-PCI bridge and PCI bus 0 is upstream of the bridge. Connected to the secondary PCI bus are the system's SCSI and Ethernet devices. Physically the bridge, the secondary PCI bus, and both devices can be on the same PCI card. The PCI-ISA bridge in the system supports old, legacy ISA devices. This figure shows a super I/O controller chip that controls the keyboard, mouse and floppy drive.

6.1 PCI Address Space

The CPU and PCI devices need to access the memory they share. This memory lets device drivers control these PCI devices and pass information between them. Commonly shared memory includes the device's control and status registers. These registers are used to control the device and read its status. For example, a PCI SCSI device driver can read the status register of the SCSI device to determine whether it can write a piece of information to the SCSI disk. Or it can write to the control register to have the device it shut down start running.

The CPU's used system memory can be used as such shared memory, but if so, every time a PCI device accesses memory, the CPU has to stall, waiting for the PCI device to complete. Access to memory is usually restricted, and only one system component is allowed access at a time. This will slow down the system. It is also not a good idea to allow external devices to the system to access main memory in an uncontrolled manner. This can be very dangerous: a malicious device can make the system very unstable.

External devices have their own memory space. The CPU can access these spaces, but the device's access to the system memory is strictly controlled and must pass through the DMA (Direct Memory Access) channel. ISA devices can access two address spaces: ISA I/O (input/output) and ISA memory. PCI consists of three parts: PCI I/O, PCI memory, and PCI configuration space. The CPU has access to all address spaces where the PCI I/O and PCI memory address spaces are used by device drivers and the PCI configuration space is used by Linux and the PCI initialization code in mind.

The Alpha AXP processor has no native access modes to address spaces other than the system address space. It requires the use of support chips to access other address spaces like the PCI configuration space. It uses an address space mapping scheme that steals a portion of the huge virtual address space and maps it to the PCI address space.

6.2 PCI Configuration Headers

Each PCI device in the system, including the PCI-PCI bridge, consists of a configuration data structure located in the PCI configuration address space. The PCI configuration header allows the system to identify and control the device. The exact location of this header in the PCI configuration address space depends on the PCI topology used by the device. For example, a PCI graphics card that is plugged into one PCI slot on a PC motherboard will have its configuration header in one location, while if it is inserted into another PCI slot, its header will appear in another location in the PCI configuration memory. But no matter where these PCI devices and bridges are located, the system can discover and configure them using the status and configuration registers in their configuration headers.

Typically, systems are designed so that each PCI slot's PCI configuration header has an offset relative to its slot on the board. So, for example, the PCI configuration for the first slot on the board might be at offset 0 and the second slot at offset 256 (all headers are the same length, 256 bytes), and so on. Defines system-specific hardware mechanisms so that PCI configuration code can try to check all possible PCI configuration headers on a given PCI bus, try to read a field in the header (usually the Vendor Identification field) and get some errors to know Those devices exist and those devices don't. The PCI Local Bus specification describes a possible error message: an attempt to read the Verdor Identification and Device Identification fields of an empty PCI slot returns 0xFFFFFFFF.

 

Figure 6.2 shows the layout of the 256-byte PCI configuration header. It includes the following domains:

see include/linux/pci.h

Vendor Identification A unique number that describes the inventor of this PCI device. Digital's PCI Vendor Identification is 0x1011 and Intel's is 0x8086.

Device Identification A unique number that describes the device itself. For example, Digital's 21141 Fast Ethernet device has a device identifier of 0x0009.

Status This field gives the status of the device, the meaning of its bits is specified by the PCI Local Bus specification.

The Command system controls the device by writing to this field. For example: Allow the device to access PCI I/O memory.

Class Code identifies the type of device. There are standard classifications for each type of device: display, SCSI, and so on. The type code for SCSI is 0x0100.

Base Address Registers These registers are used to determine and assign the type, size, and location of PCI I/O and PCI memory that a device can use.

Interrupt Pin 4 of the physical pins of the PCI card are used to deliver interrupts to the PCI bus. They are labeled A, B, C, and D in the standard. The Interrupt Pin field describes which pin this PCI device uses. Usually for a device this is determined by the hardware. That is to say, every time the system starts, the device uses the same interrupt pin. This information allows the interrupt handling subsystem to manage interrupts for these devices.

Interrupt Line The Interrupt Line field in the PCI configuration header is used to transfer interrupt control between the PCI initialization code, device drivers, and Linux's interrupt handling subsystem. The number written here is meaningless to the device driver, but it allows the interrupt handler to correctly send an interrupt from the PCI device to the correct device driver's interrupt handling code in the Linux operating system. See Section 7 for how Linux handles interrupts.

6.3 PCI I/O and PCI Memory Address

These two address spaces are used for device driver communication between devices and the Linux kernel running on the CPU. For example: DECchip 21141 Fast Ethernet device maps its internal registers to PCI I/O space. Its Linux device driver then controls the device by reading and writing to these registers. Display drivers typically use a large amount of PCI memory space to place display information.

Devices cannot access these address spaces until the PCI system is established and the device's access to these address spaces is enabled using the Command field in the PCI configuration header. It should be noted that only PCI configuration code reads and writes PCI configuration addresses, Linux device drivers only read and write PCI I/O and PCI memory addresses.

6.4 PCI-ISA Bridges

This bridge converts PCI I/O and PCI memory address space accesses into ISA I/O and ISA memory accesses to support ISA devices. Most systems sold today include several ISA bus slots and several PCI bus slots. The need for this backward compatibility will continue to diminish, and there will be PCI-only systems in the future. In the early days of the Intel 8080-based PC, the ISA address space (I/O and memory) of the ISA devices in the system was fixed. Even an S5000 Alpha AXP-based computer system's ISA floppy drive would have the same ISA I/O addresses as the first IBM PC. The PCI specification reserves the lower regions of the PCI I/O and PCI memory address spaces for ISA peripherals in the system and uses a PCI-ISA bridge to convert all PCI memory accesses to these regions to ISA accesses.

 

6.5 PCI-PCI Bridges(PCI-PCI桥)

PCI-PCI bridges are special PCI devices that glue together the PCI buses in the system. There was only one PCI bus in a simple system, and at that time there was an electrical limit to the number of PCI devices that a single PCI bus could support. Adding more PCI buses using a PCI-PCI bridge allows the system to support more PCI devices. This is especially important for high-performance servers. Of course, Linux fully supports the use of PCI-PCI bridges.

6.5.1 PCI-PCI Bridges: PCI I/O and PCI Memory Windows

The PCI-PCI bridge only passes downstream a subset of PCI I/O and PCI memory reads and writes. For example, in Figure 6.1, the PCI-PCI bridge will pass the read and write addresses from PCI bus 0 to bus 1 only if the read and write addresses belong to SCSI or Ethernet devices, and the rest are ignored. This filtering prevents unnecessary address information from traversing the system. To achieve this, PCI-PCI bridges must be programmed to set the base and limits of the PCI I/O and PCI memory address space accesses they must pass from the primary bus to the secondary bus. Once the PCI-PCI bridge in the system is set up, the PCI-PCI bridge is invisible as long as the Linux device driver only accesses PCI I/O and PCI memory space through these windows. This is an important feature that makes life easier for authors of PCI device drivers for Linux. But it also makes the PCI-PCI bridge under Linux somewhat tricky to configure, as we'll see shortly.

6.5.2 PCI-PCI Bridges: PCI Configuration Cycles and PCI Bus Numbering

Since the CPU's PCI initialization code can locate devices that are not on the primary PCI bus, there must be a mechanism by which the bridge can decide whether to pass configuration cycles from its primary interface to the secondary interface. A cycle is the address it shows on the PCI bus. The PCI specification defines two PCI address configuration formats: Type 0 and Type 1, shown in Figure 6.3 and Figure 6.4, respectively. A PCI configuration cycle of type 0 does not contain a bus number and is interpreted by all PCI devices on this PCI bus for PCI address configuration. Bits 32:11 of the configuration cycle are considered the device selection field. One way to design a system is to have each bit select a different device. In this case, 11 may select the PCI device in slot 0, bit 12 may select the PCI device in slot 1, and so on. Another way is to write the device's slot number directly into bits 31:11. Which mechanism a system uses depends on the system's PCI memory controller.

A type 1 PCI configuration cycle that includes a PCI bus number is ignored by all PCI devices except PCI-PCI bridges. All PCI-PCI bridges that see a Type 1 PCI configuration cycle can pass this information downstream to them. Whether a PCI-PCI bridge ignores the PCI configuration cycle or passes it downstream depends on how the bridge is configured. Each PCI-PCI bridge has a primary bus interface number and a secondary bus interface number. The primary bus interface is closest to the CPU and the secondary bus interface is the farthest from the CPU. Each PCI-PCI bridge also has a secondary bus number, which is the maximum number of PCI buses that can be bridged outside of the second bus interface. In other words, the secondary bus number is the largest PCI bus number downstream of the PCI-PCI bridge. When a PCI-PCI bridge sees a type 1 PCI configuration cycle, it does the following:

If the specified bus number is not between the bridge's minor bus number and the bus's secondary number it is ignored.

Converts it to a configuration command of type 0 if the specified bus number matches the bridge's minor bus number

If the specified bus number is greater than the secondary bus number and less than or equal to the subordinate bus number, it is passed to the secondary bus interface unchanged.

So, if we wish to address device 1 on bus 3 in the topology of Figure 6.9, we must generate a configuration command of type 1 from the CPU. Bridge 1 passes it unchanged to Bus 1, Bridge 2 ignores it but Bridge 3 converts it into a configuration command of type 0 and sends it to Bus 3, causing Device 1 to respond to it.

Each individual operating system is responsible for assigning bus numbers during the PCI configuration phase, but regardless of the encoding scheme used, the following statements must be true for all PCI-PCI bridges in the system:

All PCI buses located behind a PCI-PCI bridge must be numbered between (inclusive) the minor bus number and the auxiliary bus number

If this rule is violated, the PCI-PCI bridge will not be able to correctly pass and translate the type 1 PCI configuration cycle, and the system will not be able to successfully find and initialize PCI devices in the system. To complete the encoding scheme, Linux configures these special devices in a specific order. See Section 6.6.2 for a description of the Linux PCI bridge and bus encoding scheme and a working example.

6.6 Linux PCI Initialization (Linux PCI initialization process)

The PCI initialization code in Linux is divided into three logical parts:

PCI Device Driver This pseudo device driver searches the PCI system from bus 0 and locates all PCI devices and bridges in the system. It builds a list of linked data structures that describe the topology of the system. In addition, it encodes all bridges in the system.

参见drivers/pci/pci.c and include/linux/pci.h

PCI BIOS This software layer provides the services described in the PCI BIOS ROM specification. Even though the Alpha AXP has no BIOS service, there is an equivalent code in the Linux kernel that provides the same functionality.

See arch/*/kernel/bios32.c

PCI Fixup System-related cleanup code to clean up system-related memory looseness at the end of PCI initialization.

See arch/*/kernel/bios32.c

6.6.1 Linux Kernel PCI Data Structures

When the Linux kernel initializes a PCI system it creates data structures that reflect the actual PCI topology of the system. Figure 6.5 shows the relationship between the data structures used to describe the PCI system illustrated in Figure 6.1.

Each PCI device (including PCI-PCI bridges) is described by a pci_dev data structure. Each PCI bus is described by a pci_bus data structure. The result is a tree of PCI buses, each with sub-PCI devices attached to it. Since a PCI bus can only be reached through a PCI-PCI bridge (except the main PCI bus, bus 0), each pci_bus includes a pointer to the PCI device it is going through (the PCI-PCI bridge). This PCI device is a child device of the parent bus of this PCI bus.

Not shown in Figure 6.5 is a pointer to all PCI devices in the system: pci_devices. The pci_dev data structures of all PCI devices in the system are queued in this queue. The Linux kernel uses this queue to quickly find all PCI devices in the system.

6.6.2 The PCI Device Driver

The PCI device driver is not a real device driver at all, but a function called by the operating system when the system is initialized. The PCI initialization code must scan all PCI buses in the system to find all PCI devices in the system (including PCI-PCI bridge devices). It uses the PCI BIOS code to see if every possible slot on the PCI bus it is currently scanning is occupied. If the PCI slot is occupied, it builds a pci_dev data structure describing the device and links it to the list of known PCI devices (pointed to by pci_deivices).

See drivers/pci/pci.c Scan_bus()

The PCI initialization code starts scanning from PCI bus 0. It tries to read the Vendor Identification and Device Identification fields of every possible PCI device in every possible PCI slot. When it finds an occupied slot it creates a pci_dev data structure to describe it. All pci_dev data structures created by the PCI initialization code (including all PCI-PCI bridges) are linked to a linked table: pci_devices.

If the found device is a PCI-PCI bridge, build a pci_bus data structure and link it to the tree of pci_bus and pci_dev data structures pointed to by pci_root. The initial code of PCI can determine whether a PCI device is a PCI-PCI bridge, because its class code is 0x060400. The Linux kernel then configures the PCI bus (downstream) on the other end of the PCI-PCI bridge it just found. If more PCI-PCI bridges are found, they are all configured the same. This process becomes the depthwize algorithm: the system unrolls in depth before searching for width. Looking at Figure 6.1, Linux will first configure PCI bus 1 and its Ethernet and SCSI devices, and then configure the display device on PCI bus 0.

When Linux looks downstream for a PCI bus it must configure the secondary and auxiliary bus numbers of the intervening PCI-PCI bridges. These are described in detail in Section 6.6.2 below:

Configuring PCI-PCI Bridges - Assigning PCI Bus Numbers

For PCI I/O, PCI memory, or PCI configuration address space reads and writes that pass through them, the PCI-PCI bridge must be up to the following:

Primary Bus Number The bus number just upstream of the PCI-PCI bridge

Secondary Bus Number The bus number just downstream of the PCI-PCI bridge

Subordinate Bus Number The highest bus number of all buses that can be reached down from this bridge.

PCI I/O and PCI Memory Windows The base and size of the PCI I/O address space and PCI memory space window for all addresses down from this PCI-PCI bridge.

The problem is that when you want to configure any given PCI-PCI bridge you don't know the number of busses attached to that bridge. You don't know if there are other PCI-PCI bridges downstream. Even if you do, you don't know what number they will be assigned. The answer is to use a depthwise recursive algorithm. Any PCI-PCI bridges are assigned numbers as they are found on each bus. For each PCI-PCI bridge found, it assigns a number to its secondary bus, assigns it a temporary secondary bus number 0xFF, and scans all PCI-PCI bridges downstream of it and assigns numbers. This seems rather complicated, but the following practical example will make the process clearer.

PCI-PCI Bridge Numbering: Step 1 Referring to the topology in Figure 6.6, the first bridge found by the scan is Bridge1. The PCI bus number downstream of bridge 1 is 1, and bridge 1 is assigned a secondary bus number 1 and a temporary auxiliary bus number 0xFF. This means that a PCI configuration address of type 1 that specifies PCI bus 1 or higher will pass through bridge 1 to PCI bus 1. If their bus number is 1, they are converted to configuration cycles of type 0, otherwise they are unchanged for other bus numbers. This is exactly what the Linux PCI initialization code needs to do in order to access and scan PCI bus 1.

PCI-PCI Bridge Numbering: Step 2 Linux uses a deep algorithm, so the initialization code starts scanning PCI bus 1. This means that it has found PCI-PCI bridge 2. There is no other PCI-PCI bridge besides bridge 2, so its subsidiary bus number becomes 2, which is the same as its secondary interface. Figure 6.7 shows how the bus and PCI-PCI bridge are encoded at this time.

PCI-PCI Bridge Numbering: Step 3 The PCI initialization code comes back to scan PCI bus 1 and finds another PCI-PCI bridge 3. Its primary bus interface is assigned a value of 1 and its secondary bus interface is 3, and its secondary bus number is 0xFF. Figure 6.8 shows how the system is now configured. PCI configuration cycles of type 1 with bus numbers 1, 2, or 3 are now correctly routed to the appropriate PCI bus.

 

6.6.3 PCI BIOS Functions

PCI BIOS functions are a series of standard routines that are common across platforms. For example, they are the same for Intel and Alpha AXP systems. They allow the CPU to control access to all PCI address spaces. Only the Linux kernel and device drivers need to use them.

See arch/*/kernel/bios32.c

6.6.4 PCI Fixup

The PCI collation code on Alpha AXP systems does more work than Intel (which basically does nothing). For Intel systems, the system BIOS that runs at startup has fully configured PCI systems. Linux doesn't need to do much more than map the PCI configuration. For non-Intel systems, more configuration needs to be done:

See arch/kernel/bios32.c

Allocate PCI I/O and PCI memory space for each device

PCI I/O and PCI memory address windows must be configured for each PCI-PCI bridge in the system

For the device to generate an Interrupt Line value, these control the interrupt processing of the device

The following describes how these codes work.

Finding Out How Much PCI I/O and PCI Memory Space a Device Needs

(Find out how much PCI I/O and PCI memory space a device needs)

Query each PCI device found to find out how much PCI I/O and memory address space it requires. To do this, write each Base Address Register as 1 and read it out. The device will return 1 on the address bits it doesn't care about, effectively specifying the required address space.

With two basic base address registers (Base Address Register), the first indicates the device's register and which address space the PCI I/O and PCI memory spaces must be in. This is represented by bit 0 of the register. Figure 6.10 shows two forms of the base address register for PCI memory and PCI I/O.

To find out how much address space is required for each given base address register, it is necessary to write to and read from all registers. The device sets the address bits it doesn't care about to 0, effectively specifying the address space it needs. This design implies that all address spaces used are indices of 2 and are inherently aligned.

For example, when you initialize the DECChip 21142 PCI Fast Ethernet device, it tells you that it needs an address of 0x100 bytes in the PCI I/O or PCI memory space. The initialization code allocates space for it. After it allocates space, the 21142's control and status registers can be seen at these addresses.

Allocating PCI I/O and PCI Memory to PCI-PCI Bridges and Devices

(Allocates PCI I/O and PCI memory to PCI-PCI bridges and devices)

Like all memory, PCI I/O and PCI memory space is limited, some of which are quite tight. The PCI collation code for non-Intel systems (and the BIOS code for Intel systems) must efficiently allocate each device the amount of memory it needs. PCI I/O and PCI memory allocations allocated to a device must be naturally aligned. For example, if a device requests PCI I/O address 0xB0, the assigned address must be a multiple of 0xB0. Additionally, the bases of PCI I/O and PCI memory addresses assigned to any bridge must be aligned to 4K and 1M boundaries, respectively. The address space given by the downstream device must be in the middle of the memory range of all its upstream PCI-PCI bridges. So effectively allocating address space is a more difficult problem.

The algorithm used by Linux relies on each device described by the bus/device tree established by the PCI device driver, which allocates address space in increasing order of PCI I/O memory. Again, a recursive algorithm is used to traverse the pci_bus and pci_dev data structures established by the PCI initialization code. The BIOS cleanup code starts at the root of the PCI bus (pointed by pci_root):

Align current global PCI I/O and memory bases on 4K and 1M boundaries, respectively

For each device on the current bus (in the order of required PCI I/O memory)

- allocate its PCI I/O and/or PCI memory

- Moved the global PCI I/O and memory base by the appropriate amount

- Allow the device to use the given PCI I/O and PCI memory

Allocate space separately for all buses downstream of the current bus, note that this changes the global PCI I/O and memory base.

Align the current global PCI I/O and memory bases on the 4K and 1M boundaries respectively, and point out the base and size of the PCI I/O and PCI memory windows required by the current PCI-PCI bridge

For the PCI-PCI bridge connected to the current bus, set its PCI-PCI I/O and PCI memory addresses and limits.

Enables the ability to bridge PCI I/O and PCI memory accesses on the PCI-PCI bridge. This means that any PCI I/O and PCI memory address seen on the bridge's primary PCI bus will be bridged to its secondary bus if it is in its PCI I/O and PCI memory address window.

Take the PCI system of Figure 6.1 as an example of PCI collation code:

Align the PCI base (initial) PCI I/O is 0x4000, PCI memory is 0x100000. This allows the PCI-ISA bridge to translate all addresses below this to ISA addresses.

The Video Device requests PCI memory of 0x200000, because it must be aligned according to the required size, so we start allocation from PCI memory 0x200000, the PCI memory base address is moved to 0x400000, and the PCI I/O address is still 0x4000.

The PCI-PCI Bridges We now cross the PCI-PCI bridge, where the memory is allocated. Note that we don't need their base addresses since they are already properly aligned.

The Ethernet Device requests 0xB0 bytes in both PCI I/O and PCI memory space. It is allocated at PCI I/O address 0x4000, PCI memory 0x400000. The base of PCI memory was moved to 0x4000B0, and the base of PCI I/O became 0x40B0.

The SCSI Device requests PCI memory at 0x1000, so it allocates at 0x401000 after alignment. The base address of PCI I/O is still 0x40B0, and the base of PCI memory is moved to 0x402000.

The PCI-PCI Bridge's PCI I/O and Memory Windows We now go back to the bridge and set its PCI I/O window to be between 0x4000 and 0x40B0, and its PCI memory window to be between 0x400000 and 0x402000. This means that the PCI-PCI bridge will ignore PCI memory access to the display device, if the access to the Ethernet or SCSI device can pass.

7. Interrupts and Interrupt Handling

Although the kernel has general mechanisms and interfaces for handling interrupts, most of the details of interrupt handling are architecture-dependent.

Linux uses a lot of different hardware for many different tasks. Display devices drive monitors, IDE devices drive disks, and so on. You can drive these devices synchronously, that is, you can issue a request to perform some operation (such as writing a block of memory to disk) and wait for the operation to complete. This approach, while working, is very inefficient, and the operating system spends a lot of time "busy doing nothing" as it waits for each operation to complete. A good, more efficient approach is to make the request and do something more useful, and then be interrupted by the device when the device completes the request. Under this scheme, there may be requests from many devices in the system at the same time.

There must be some hardware support for the device to interrupt the CPU's current work. Most, if not all, general-purpose processors such as the Alpha AXP use a similar approach. Some of the CPU's physical pins have circuitry that simply changes the voltage (from +5V to -5V, for example) to cause the CPU to stop what it's doing and start executing special code that handles interrupts: interrupt handling code. One of these pins might be connected to an internal medium that receives an interrupt every 1000th of a second, and the other might be connected to other devices in the system, such as a SCSI controller.

Systems typically use an interrupt controller to group the device's interrupts and then route the signals to a single interrupt pin on the CPU. This saves CPU interrupt management and brings flexibility to the design system. The interrupt controller has mask and status registers to control these interrupts. Interrupts can be enabled and disabled by setting bits in the mask register, and the status register returns the current interrupts in the system.

Interrupts in some systems may be hardwired, eg the internal clock of the real-time clock may be permanently connected to pin 3 of the interrupt controller. However, what other pins are connected to may be determined by what control card is inserted in a particular ISA or PCI slot. For example, the 4th pin of the interrupt controller may be connected to PCI slot 0, there may be an Ethernet card one day, and then it may be a SCSI controller card. Every system has its own interrupt relay mechanism, and the operating system must be flexible enough to handle it.

Most modern general-purpose microprocessors handle interrupts in the same way. When a hardware interrupt occurs, the CPU stops the instruction it is running and jumps to a location in memory where it either contains interrupt handling code or an instruction that jumps to interrupt handling code. This code usually works in a special mode of the CPU: interrupt mode, in which other interrupts cannot normally be generated. There are exceptions here: some CPUs divide interrupts into levels, and higher-level interrupts can occur. This means that the first level of interrupt handlers must be written very carefully. Interrupt handlers usually have their own stack, which is used to store the execution state of the CPU (all general-purpose registers and context of the CPU) and handle interrupts. Some CPUs have a set of registers that exist only in interrupt mode, which interrupt handling code can use to store most of the contextual information it needs to save.

When the interrupt is processed, the state of the CPU is restored and the interrupt ends. The CPU will continue to do what it was doing before the interrupt occurred. Important Interrupt handlers must be as efficient as possible, usually the operating system cannot block interrupts very often or for long periods of time.

7.1 Programmable Interrupt Controllers

System designers are free to use any interrupt architecture they wish, but IBM PCs all use the Intel 82C59A-2 CMOS Programmable Interrupt Controller or its derivatives. This controller has been used since the very beginning of the PC. It is programmable through registers at well-known locations in the ISA address space. Even very modern logic chipsets keep equivalent registers in the same locations in ISA memory. Non-Intel systems, such as the Alpha AXP PC, are exempt from these architectures and usually use a different interrupt controller.

Figure 7.1 shows two 8-bit controllers in series: each has a mask and an interrupt status register, PIC1 and PIC2. The mask registers are located at addresses 0x21 and 0xA1, while the status registers are located at 0x20 and 0xA0. Writing a 1 to a special bit in the mask register enables an interrupt, and writing a 0 disables it. So writing a 1 to bit 3 enables interrupt 3, and writing a 0 disables it. Unfortunately (and annoyingly), the interrupt mask register is only writable, you can't read back the value you wrote. This means that Linux must keep a local copy of the mask register it sets. It modifies these saved masks in interrupt enable and disable routines, each time writing the entire mask to a register.

When an interrupt signal is generated, the interrupt handler reads the two interrupt status registers (ISR). It regards the ISR of 0x20 as the 8th bit of the 16-bit interrupt register, and the ISR in 0xA0 as the upper 8 bits. Therefore, an interrupt that occurs at bit 1 of the ISR at 0xA0 is considered interrupt 9. Bit 2 of PCI1 is not available because it is used as an interrupt for serial PIC2, ​​any interrupt from PIC2 will set bit 2 of PIC1.

7.2 Initializing the Interrupt Handling Data Structures

The core interrupt-handling data structures are created when the device driver requires control of the system's interrupts. To do this, device drivers use a series of Linux kernel services for requesting an interrupt, enabling it, and disabling it. These device drivers call these routines to register the addresses of their interrupt handling routines.

参见arch/*/kernel/irq.c request_irq() enable_irq() and disable_irq()

The PC architecture fixes some interrupts for convenience, so the driver simply requests its interrupts during initialization. A floppy device driver is just that: it always requests interrupt 6. But it's also possible that a device driver doesn't know what interrupts the device will use. This is not a problem for PCI device drivers as they always know their interrupt number. Unfortunately for ISA devices there is no easy way to find their interrupt numbers, Linux allows device drivers to probe their interrupts to solve this problem.

First, the device driver causes the device to generate interrupts, and then all unassigned interrupts in the system are enabled. This means that interrupts awaiting processing by the device are now delivered through the programmable interrupt controller. Linux reads the interrupt status register and returns its contents to the device driver. A non-zero result indicates that one or more interrupts occurred during the probe. The driver now turns off probing and disables interrupts for all bit assignments. If the ISA device driver successfully finds its IRQ number, it can request control of it as normal.

see arch/*/kernel/irq.c irq_probe_*()

PCI systems are more dynamic than ISA systems. ISA device interrupts are usually set with jumpers on the hardware device and are fixed for the device driver. Conversely, interrupts for PCI devices are allocated at system startup by the PCI BIOS or the PCI subsystem during PCI initialization. Each PCI device can use one of four interrupt pins: A, B, C, or D. At this time, it is determined when the device is manufactured, and most devices use interrupt pin A by default. The PCI interrupt lines A, B, C, and D of each PCI slot go to the interrupt controller. So pin A of slot 4 may go to pin 6 of the interrupt controller, pin B of slot 4 may go to pin 7 of the interrupt controller, and so on.

How PCI interrupts are forwarded (routed) is entirely system-dependent, and there must be some setup code that understands this PCI interrupt routing topology. On Intel PCs, this is done by the system BIOS code at boot time. But for systems without BIOS (such as Alpha AXP systems), Linux makes this setup. The PCI setup code writes the pin number of the interrupt controller into each device's PCI configuration header. It uses the PCI interrupt topology it knows about and the PCI slot of the device and the PCI interrupt pin it is using to determine the interrupt pin (or IRQ) number. The interrupt pins used by the device are determined and placed in a field of the PCI configuration header. It writes this information to the interrupt line field (which is reserved for this purpose). When the device driver runs, it reads this information and uses it to request control of the interrupt from the Linux kernel.

See arch/alpha/kernel/bios32.c

Many PCI interrupt resources may be used in the system. For example, when using a PCI-PCI bridge. The number of interrupt sources may exceed the number of pins of the system's programmable interrupt controller. In this case, PCI devices can share interrupts: a pin on the interrupt controller receives interrupts from more than one PCI device. Linux supports interrupt sharing by letting the first source requesting an interrupt declare (declare) whether it can be shared. Shared interrupt results are data structures where one entry in the irq_action vector table can point to several irqactions. When a shared interrupt occurs, Linux calls all interrupt handlers for this source. All device drivers (which should be PCI device drivers) that can share interrupts must be prepared to be called when no interrupts are serviced.

7.3 Interrupt Handling

One of the main tasks of the Linux interrupt handling subsystem is to route interrupts to the correct interrupt handling code segment. Such code must understand the interrupt topology of the system. For example, if the floppy drive controller interrupts on pin 6 of the interrupt controller, it must be able to recognize that the interrupt is from the floppy drive and forward it to the floppy drive device driver's interrupt handler code. Linux uses a series of pointers to data structures that contain the addresses of routines that handle system interrupts. These routines belong to the device drivers of the devices in the system, and each device driver must be responsible for requesting the interrupts it wants when the driver is initialized. Figure 7.2 shows that irq_action is a vector table of pointers to the irqaction data structure. Each irqaction data structure contains information about the interrupt handler, including the address of the interrupt handler. The number of interrupts and how they are handled are different for different systems. Usually, between different systems, the Linux interrupt handling code is architecture-dependent. This means that the size of the irq_action vector table varies depending on the number of interrupt sources.

When an interrupt occurs, Linux must first determine its source by reading the status register of the system's programmable interrupt controller. Then convert this source to an offset in the irq_action vector table. For example, an interrupt at pin 6 of the interrupt controller from the floppy controller will be transferred to the 7th pointer in the interrupt handler vector table. If an interrupt occurs without a corresponding interrupt handler, the Linux kernel will log an error, otherwise, it will call the interrupt handler in all the irqaction data structures of this interrupt source.

When the Linux kernel calls the device driver's interrupt handling routine, it must effectively determine why it was interrupted and respond. To find out the cause of the interrupt, the device driver reads the interrupting device's status register. The device may respond: An error occurred or a requested operation was completed. For example, the floppy drive controller might report that it has positioned the floppy drive's read head to the correct sector on the floppy disk. Once the cause of the interrupt is determined, the device driver may need to do more work. If so, the Linux kernel has mechanisms to allow this operation to be delayed for a while. This avoids having the CPU spend too much time in interrupt mode.

8. Device Drivers

One of the purposes of an operating system is to hide the specifics of the system's hardware devices from the user. For example, a virtual file system presents a unified view of the mounted file system regardless of the underlying physical device. This section describes how the Linux kernel manages the physical devices in the system.

The CPU is not the only smart device in the system, each physical device is controlled by its own hardware. Keyboard, mouse and serial port are controlled by SuperIO chip, IDE disk is controlled by IDE controller, SCSI disk is controlled by SCSI controller, and so on. Each hardware controller has its own control and state controller (CSR), which varies from device to device. The CSR of an Adaptec 2940 SCSI controller is completely different from that of an NCR 810 SCSI controller. The CSR is used to start and stop the device, initialize the device and diagnose problems with it. The code to manage these hardware controllers is not placed in every application, but in the Linux kernel. These software feet that handle or manage hardware controllers are called device drivers. Device drivers in the Linux kernel are essentially a shared library of privileged, memory-resident low-level hardware control routines. It is the idiosyncrasies of Linux device drivers that deal with the devices they manage.

A fundamental feature of UN*X is that it abstracts the handling of devices. All hardware devices are treated like regular files: they can be opened, closed, and read and written using the same standard system calls as files. Each device in the system is represented by a device special file. For example, the first IDE hard disk in the system is represented by /dev/had. For block (disk) and character devices, these device special files are created with the mknod command and use major and minor device numbers to describe the device. Network devices are also represented by device special files, but they are created by Linux when it finds and initializes the network controller in the system. All devices controlled by the same device driver are numbered by a common major device. Minor device numbers are used to distinguish between different devices and their controllers. For example, the different partitions of the primary IDE disk are all numbered by a different minor device. So, /dev/hda2, the second partition of the main IDE disk has a major number of 3 and a minor number of 2. Linux uses the major device number table and some system tables (such as the character device table chrdevs) to map device special files passed in system calls (such as mounting a file system on a block device) to the device driver for this device.

See fs/devices.c

Linux supports three types of hardware devices: character, block, and network. Character devices are read and written directly, without buffers, such as the system's serial ports /dev/cua0 and /dev/cua1. Block devices can only be read and written in multiples of a block (usually 512 bytes or 1024 bytes). Block devices are accessed through the buffer cache and can be accessed randomly, that is, any block can be read or written regardless of where it is on the device. Block devices can be accessed through their device special files, but are more commonly accessed through the file system. Only one block device can support a mounted filesystem. Network devices are accessed through the BSD socket interface, and the networking subsystem is described in Section 10.

Linux has many different device drivers (which is also one of the strengths of Linux) but they all have some general properties:

Kernel code Device drivers, like other code in the kernel, are part of kenel and can seriously damage the system if an error occurs. A wrongly written driver can even destroy the system, possibly corrupting the file system and losing data.

Kenel interfaces A device driver must provide a standard interface to the Linux kernel or the subsystem in which it resides. For example, the terminal driver provides a file I/O interface to the Linux kernel, and the SCSI device driver provides the SCSI device interface to the SCSI subsystem, which in turn provides the kernel with file I/O and buffer cache interfaces.

Kernel mechanisms and services Device drivers use standard core services such as memory allocation, interrupt forwarding, and wait queues to do their work

Loadable Linux Most device drivers can be loaded as core modules when needed and unloaded when they are no longer needed. This makes the core very adaptable and efficient with respect to system resources.

Configurable Linux device drivers can be built into the core. Which devices are built into the core can be configured at core compile time.

Dynamic At system startup, it looks for the hardware devices it manages each time the device startup program is initialized. It doesn't matter if the device controlled by a device driver doesn't exist. At this point the device driver is just redundant and takes up very little system memory without causing harm.

8.1 Poling and Interrupts

Every time a command is given to the device, such as "move the read head to sector 42 of the floppy disk", the device driver can choose how it determines whether the command has ended. Device drivers can poll the device or use interrupts.

Polling a device usually means constantly reading its status register until the device's state changes to indicate that it has completed the request. Because the device driver is part of the core, it would be disastrous if the driver was constantly polling and the core couldn't run anything else until the device completed the request. So the polled device driver uses a system timer to let the system call a routine in the device driver at a later date. This timer routine checks the status of the command, which is how Linux's floppy disk driver works. Polling with a timer is the best approach, and a more efficient approach is to use interrupts.

An interrupt device driver issues a hardware interrupt when the hardware device it controls needs to be serviced. For example, an Ethernet device driver will be interrupted when the device receives an Ethernet packet on the network. The Linux kernel needs to be able to forward interrupts from hardware devices to the correct device drivers. This is accomplished by the device driver registering with the kernel the interrupts it uses. It registers the address of the interrupt handler routine and the interrupt number it wishes to have. You can see which interrupts are used by the device driver and how many times each type of interrupt is used in /proc/interrupts:

0: 727432 timer
1: 20534 keyboard
2: 0 cascade
3: 79691 + serial
4: 28258 + serial
5: 1 sound blaster
11: 20868 + aic7xxx
13: 1 math error
14: 247 + ide0
15: 170 + ide1

Requests for interrupt resources occur at driver initialization time. Some interrupts in the system were fixed, a legacy of the IBM PC architecture. For example, the floppy disk controller always uses interrupt 6. Other interrupts, such as PCI device interrupts, are dynamically allocated at boot time. At this time, the device driver must first find out the interrupt number of the device it controls, and then can request to have this interrupt. For PCI interrupts, Linux supports standard PCI BIOS callbacks to determine information about devices in the system, including their IRQs.

How an interrupt itself is forwarded to the CPU is architecture-dependent. But on most architectures, interrupts are delivered in a special mode that stops other interrupts from occurring in the system. A device driver should do as little work as possible in its interrupt handling routine so that the Linux kernel can end the interrupt and return to where it left off. Device drivers that do a lot of work after receiving an interrupt can use the core bottom half handler or a task queue to queue the routine for later invocation.

8.2 Direct Memory Access (DMA)

Using an interrupt-driven device driver to transfer data to or through a device works fairly well when the amount of data is small. For example, a 9600 baud modem can transmit approximately one character per millisecond (1/1000 of a second). If the interrupt latency, that is, the time taken from the hardware device to issue the interrupt to the start of calling the interrupt handler in the device driver, is relatively small (such as 2 milliseconds), then the image of the data transfer on the system as a whole is very small. Modem data at 9600 baud rate will only take up 0.002% of the CPU processing time. But for high-speed devices, such as hard disk controllers or Ethernet devices, the data transfer rate is quite high. A SCSI device can transmit up to 40Mbytes of information per second.

Direct Memory Access, or DMA, was invented to solve this problem. A DMA controller allows the device to create data into system memory without processor intervention. The PC's ISA DMA controller consists of 8 DMA channels, 7 of which are available to device drivers. Each DMA channel is associated with a 16-bit address register and a 16-bit count register. In order to initiate a data transfer, the device driver needs to establish the address and count registers of the DMA channel, plus the direction of the data transfer, read or write. When the transfer is over, the device interrupts the PC. This way, while the transfer is taking place, the CPU can do other things.

Device drivers must be careful when using DMA. First, all DMA controllers are unaware of virtual memory, it can only access physical memory in the system. Therefore, the memory that needs to be DMA transferred must be a contiguous block in physical memory. This means that you cannot have DMA access to a process's virtual address space. But you can also physically lock the process into memory while performing DMA operations. Second: The DMA controller cannot access all physical memory. The address register of the DMA channel represents the first 16 bits of the DMA address, and the next 8 bits come from the page register. This means that DMA requests are limited to the bottom 16M of memory.

DMA channels are scarce resources, only 7, and cannot be shared among device drivers. Like interrupts, a device driver must have the ability to discover which DMA channel it can use. Like interrupts, some devices have fixed DMA channels. For example, floppy drives always use DMA channel 2. Sometimes a device's DMA channel can be set with jumpers: some Ethernet devices use this technique. Some more flexible devices can tell it (via their CSR) which DMA channel to use, at which point the device driver can simply figure out an available DMA channel.

Linux uses a vector table of dma_chan data structures (one for each DMA channel) to track DMA channel usage. The Dma_chan data structure has only two jades: a character pointer that describes the owner of the DMA channel, and a flag that shows whether the DMA channel is allocated. When you cat /proc/dma, the dma_chan vector table is displayed.

8.3 Memory

Device drivers must use memory carefully. Since they are part of the Linux kernel, they cannot use virtual memory. Every time a device driver runs, it may receive an interrupt or schedule a buttom half handler or task queue, and the current process may change. A device driver cannot depend on a special process running. Like other parts of the kernel, a device driver uses data structures to keep track of the devices it controls. These data structures could be allocated statically in the code section of the device driver, but this would make the core unnecessarily large and wasteful. Most device drivers allocate kernel, nonpaged memory for their data.

The Linux kernel provides the kernel's memory allocation and deallocation routines, which are used by device drivers. Core memory is allocated in power-of-2 blocks. For example 128 or 512 bytes, even if the device driver doesn't ask for that much. The number of bytes requested by the device driver is rounded up to the size of the next block. This makes memory reclamation by the core easier, as smaller free blocks can be combined into larger blocks.

Linux also needs to do more additional work when requesting kernel memory. If the total amount of free memory is too low, physical pages need to be discarded or written to swap. Normally, Linux will suspend the requester and put the process on a wait queue until enough physical memory is available. Not all device drivers (or actually Linux kernel code) want this to happen, the kernel memory allocation routines can request to fail if the memory cannot be allocated immediately. If the device driver wishes to allocate memory for DMA access, it also needs to indicate that this memory is DMA-capable. Because it is necessary to let the Linux kernel understand which memory in the system is contiguous for DMA, rather than let the device driver decide.

8.4 Interfacing Device Drivers with the Kernel

The Linux kernel must be able to work with them in a standard way. Each type of device driver: character, block, and network, provides a common interface for the core to use when it needs to request their services. These common interfaces mean that the core can look at often very different devices and their device drivers in exactly the same way. For example, SCSI and IDE disks behave very differently, but the Linux kernel uses the same interface for them.

Linux is very dynamic, and every time the Linux kernel starts, it may encounter different physical devices and require different device drivers. Linux allows you to include device drivers through configuration scripts at kernel build time. When these device drivers initialize at startup, they may not find any hardware they can control. Other drivers can be loaded as core modules when needed. To handle this dynamic nature of device drivers, device drivers register with the kernel when they are initialized. Linux maintains a list of registered device drivers as part of interfacing with them. These lists include pointers to routines and information on the interfaces that support this type of device.

8.4.1 Character Devices

A character device, the simplest device in Linux, is accessed like a file. Applications use standard system calls to open, read, write, and close, exactly as if the device were an ordinary file. Even the modem used by the PPP daemon connecting a Linux system to the Internet is like this. When a character device is initialized, its device driver registers with the Linux kernel, adding a device_struct data structure entry to the chrdevs vector table. The device's major device identifier (eg, 4 for tty devices) is used as an index into this vector table. The MID of a device is fixed. Each entry in the Chrdevs vector table, a device_struct data structure, consists of two elements: a pointer to the name of the registered device driver and a pointer to a set of file operations. The file operations themselves reside in the device's character device drivers, each of which handles specific file operations such as open, read, write, and close. The contents of character devices in /proc/devices come from the chrdevs vector table

See include/linux/major.h

When a character special file representing a character device (eg /dev/cua0) is opened, the kernel must do something to get rid of the file manipulation routines that use the correct character device driver. Like ordinary files or directories, each device-specific file is represented by a VFS inode. The VFS inode for this character special file (actually all device special files) includes the major and minor identifiers of the device. The VFS inode is created by the underlying filesystem (eg EXT2) based on the actual filesystem when looking for this device-specific file.

See fs/ext2/inode.c ext2_read_inode()

Each VFS inode is associated with a set of file operations that vary depending on the file system object represented by the inode. Whenever a VFS inode representing a character special file is created, its file operation is set to the default operation for character devices. There is only one file operation: the open operation. When an application opens this character special file, the generic open file operation uses the device's major device identifier as an index into the chrdevs vector table to fetch the file operation block for this special device. It also builds the file data structure that describes the character special file, making its file operations point to operations in the device driver. Then all the file system operations of the application are mapped to the file operations of the character device.

See fs/devices.c chrdev_open() def_chr_fops

8.4.2 Block Devices

Block devices also support being accessed like files. The mechanism for providing the correct set of file operations for open block special files is very similar to that for character devices. Linux maintains registered block device files with the blkdevs vector table. Like the chrdevs vector table, it uses the major number of the device as an index. Its entries are also device_struct data structures. Unlike character devices, block devices are classified. SCSI is one category and IDE is another. Classes register with the Linux kernel and provide file operations to the kernel. A device driver for a block device class provides class-related interfaces to this class. For example, a SCSI device driver must provide an interface to the SCSI subsystem that can be used by the SCSI subsystem to provide file operations for such devices to the kernel

See fs/devices.c

Every block device driver must provide a common file operation interface and an interface to the buffer cache. Each block device driver populates its blk_dev_struct data structure in the blk_dev vector table. The index into this vector table is also the major number of the device. The blk_dev_struct data structure contains the address of a request routine and a pointer to a list of request data structures, each representing a request for the buffer cache to read or write a block of data to the device.

See drivers/block/ll_rw_blk.c include/linux/blkdev.h

Each time the buffer cache wishes to read or write a block of data to or from a registered device it adds a request data structure to its blk_dev_struc. Figure 8.2 shows that each request has a pointer to one or more buffer_head data structures, each of which is a request to read or write a block of data. The buffer_head data structure is locked (buffer cache), and there may be a process waiting for the blocking process of this buffer to complete. Each request structure is allocated from a static table, the all_request table. If the request is added to an empty request list, the driver's request function is called to process the request queue. Otherwise, the driver simply processes each request in the request queue.

Once the device driver completes a request, it must remove each buffer_head structure from the request structure, mark them as up-to-date, and unlock them. Unlocking the buffer_head will wake up any processes that are waiting for this blocking operation to complete. Examples of this include file parsing: must wait for the EXT2 filesystem to read the block containing the next EXT2 directory entry from the block device containing the filesystem, the process will sleep in the buff_head queue that will contain the directory entry, until the device driver wakes it up. This request data structure will be marked as free and can be used by another block request.

8.5 Hard Disks

Hard drives store data on spinning platters, providing a more permanent way to store data. To write data, tiny heads magnetize a tiny point on the surface of the disk. The magnetic head can detect whether the specified particle is magnetized, so that the data can be read.

A disk drive consists of one or more platters, each of which is made of fairly smooth glass or ceramic and covered with a fine layer of metal oxide. The disk is placed on a central axis and rotates at a steady speed. Rotation speed varies from 3000 to 1000 RPM (revolutions per minute) depending on the model. The read/write heads of the disk are responsible for reading and writing data, and each disk has a pair, one for each side. The read/write head and the platter surface are not in physical contact, but float on a thin air cushion (one-hundredth of an inch). The read and write heads are moved across the surface of the disk by a drive. All the heads stick together and move together on the surface of the disk.

The surface of each disk is divided into narrow concentric rings called tracks. Track 0 is the outermost track, and the highest numbered track is the track closest to the center axis. A cylinder is a combination of identically numbered tracks. So all the 5th tracks on each side of each disk are the 5th cylinder. Because the number of cylinders is the same as the number of tracks, the size of a disk is often described in cylinders. Each track is divided into sectors. A sector is the smallest unit of data that can be read or written from a hard disk, which is the block size of the disk. Usually the sector size is 512 bytes, and the sector size is usually set when the disk is formatted during manufacture.

A disk is usually described by its geometry: the number of cylinders, the number of heads, and the number of sectors. For example, Linux describes my IDE disk like this when booting:

hdb: Conner Peripherals 540MB - CFS540A, 516MB w/64kB Cache, CHS=1050/16/63

This means it consists of 1050 cylinders (tracks), 16 heads (8 disks) and 63 sectors/track. For a sector or block size of 512 bytes, the capacity of the disk is 529200K bytes. This is inconsistent with the disk's declared storage capacity of 516M, because some sectors are used to store the disk's partition information. Some disks can automatically find bad sectors and reindex them.

Hard disks can be subdivided into partitions. A partition is a large group of sectors allocated for a specific purpose. Partitioning a disk allows the disk to be used for several operating systems or for multiple purposes. Most single-disk Linux systems consist of 3 partitions: one contains the DOS filesystem, another is the EXT2 filesystem, and the third is the swap partition. The partition of the hard disk is described by the partition table, and each entry describes the start and end position of the partition with the head, sector and cylinder number. For DOS disks formatted with fdisk, there can be 4 primary disk partitions. Not all 4 entries of the partition table must be used. Fdisk supports three types of partitions: primary, extended and logical. An extended partition is not a real partition, it can include any number of logical partitions. Extended partitions and logical partitions were invented to break through the limit of 4 primary partitions. Here is the output of fdisk for a disk that includes 2 primary partitions:

Disk /dev/sda: 64 heads, 32 sectors, 510 cylinders
Units = cylinders of 2048 * 512 bytes
Device Boot Begin Start End Blocks Id System
/dev/sda1 1 1 478 489456 83 Linux native
/dev/sda2 479 479 510 32768 82 Linux swap
Expert command (m for help): p
Disk /dev/sda: 64 heads, 32 sectors, 510 cylinders
Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
1 00 1 1 0 63 32 477 32 978912 83
2 00 0 1 478 63 32 509 978944 65536 82
3 00 0 0 0 0 0 0 0 0 00
4 00 0 0 0 0 0 0 0 0 00

It shows that the first partition starts at cylinder or track 0, head 1 and sector 1, until cylinder 477, sector 32 and head 63. Because a track consists of 32 sectors and 64 read/write heads, the cylinders of this partition are fully included. By default, Fdisk aligns partitions on cylinder boundaries. It starts at the outermost cylinder (0) and expands 478 cylinders inwards towards the central axis. The second partition, the swap partition, starts at the next cylinder (478) and extends to the innermost cylinder of the disk.

During initialization, Linux maps the topology of the hard disks in the system. It finds out how many hard drives are in the system and the type of hard drives. Linux also finds out how each disk is partitioned. These are all list representations of a set of gendisk data structures pointed to by a list of gendisk_head pointers. For each disk subsystem, such as the IDE, initialization generates a gendisk data structure representing the disks it finds. This process occurs at the same time as it registers its file operations and increments its entry in the blk_dev data structure. Each gendisk data structure has a unique major device number, the same as that of a block-specific device. For example, the SCSI disk subsystem creates a separate gendisk entry ("sd") with major number 8 (the major number for all SCSI disk devices). Figure 8.3 shows two gendisk entries, the first being the SCSI disk subsystem and the second the IDE disk controller. Here is ide0, the main IDE controller.

Although the disk subsystem creates corresponding gendisk entries during initialization, Linux only uses it for partition checking. Each disk subsystem must maintain its own data structures that allow it to map device major and minor numbers to physical disk partitions. Whenever a block device is read or written, whether through the buffer cache or a file operation, the kernel directs the operation to the appropriate device based on the major and minor numbers it finds in the block special device file (eg /dev/sda2 ) . It is each device driver or subsystem that maps the minor number to the actual physical device.

8.5.1 IDE Disks

The most commonly used disks in Linux systems today are IDE disks (Integrated Disk Electronic). IDE, like SCSI, is a disk interface rather than an I/O bus. Each IDE controller can support up to 2 disks, one is the master and the other is the slave. Master and slave are usually set with jumpers on the disk. The first IDE controller in the system is called the master IDE controller, the next is called the slave controller, and so on. IDE can transfer 3.3M/sec from/to disk, and the maximum size of IDE disk is 538M bytes. Extended IDE or EIDE increases the maximum disk size to 8.6G bytes, and the data transfer rate is as high as 16.6M/sec. IDE and EIDE disks are less expensive than SCSI disks, and most modern PCs have one or more IDE controllers on the motherboard.

Linux names IDE disks in the order in which the controllers it discovers. The primary disk on the master controller is /dev/had and the slave disk is /dev/hdb. /dev/hdc is the master disk on the secondary IDE controller. The IDE subsystem registers IDE controllers with Linux instead of disks. The primary IDE controller's primary identifier is 3, and the secondary IDE controller's identifier is 22. This means that if a system has two IDE controllers, there will be entries for the IDE subsystem at indices 3 and 22 in the blk_dev and blkdevs vector tables. The block special files for IDE disks reflect this numbering: disks /dev/had and /dev/hdb, both attached to the main IDE controller, both have major number 3. The core uses the major device identifier as an index, and all file or buffer cache operations performed by the IDE subsystem for these block special files are directed to the corresponding IDE subsystem. When executing a request, the IDE subsystem is responsible for determining which IDE disk the request is for. To do this, the IDE subsystem uses the minor number in the device special file, information that allows it to direct the request to the correct partition on the correct disk. /dev/hdb, the device identifier of the slave IDE disk on the master IDE controller is (3, 64). The device identifier of its first partition (/dev/hdb1) is (3, 65).

8.5.2 Initializing the IDE Subsystem

IDE disks have been around for most of the history of IBM PCs. During this period, the interfaces of these devices have changed. This makes the initialization process of the IDE subsystem more complicated than when it first appeared.

The maximum number of IDE controllers that Linux can support is 4. Each controller is represented by an ide_hwif_t data structure in an ide_hwifs vector table. Each ide_hwif_t data structure contains two ide_drive_t data structures, representing the possible supported master and slave IDE drives, respectively. During IDE subsystem initialization, Linux first looks at the disk's information recorded in the system's CMOS memory. This battery-backed memory does not lose its contents when the PC is shut down. This CMOS memory is actually inside the system's real-time clock device, and it runs whether your PC is on or off. The location of the CMOS memory is set by the system's BIOS and tells Linux what IDE controllers and drives are found in the system. Linux gets the found disk's geometry from the BIOS, and uses this information to set the drive's ide_hwif_t data structure. Most modern PCs use PCI chipsets such as Intel's 82430 VX chipset, which includes a PCI EIDE controller. The IDE subsystem uses PCI BIOS callbacks to locate the PCI(E) IDE controller in the system. The query routines for these chipsets are then called.

Once an IDE interface or controller is found, its ide_hwif_t is set to reflect the controller and the disks on it. During operation, the IDE driver writes commands to the IDE command register in the I/O memory space. The default I/O addresses for the main IDE controller's control and status registers are 0x1F0-0x1F7. These addresses were conventions on early IBM PCs. The IDE driver registers each controller with the Linux buffer cache and VFS, adding it to the blk_dev and blkdevs vector tables, respectively. The IDE driver also requests control of the appropriate interrupts. Again, these interrupts have a convention, 14 for the primary IDE controller and 15 for the secondary IDE controller. However, like all IDE details, these can be changed with core command line options. The IDE driver also adds a gendisk entry to the gendisk list for each IDE controller found at startup. This list is later used to view the partition tables of all hard drives found at startup. The partition check code understands that each IDE controller can control two IDE disks.

8.5.3 SCSI Disks

SCSI (Small Computer System Interface) bus is an effective point-to-point data bus, each bus supports up to 8 devices, each host can have one or more. Each device must be assigned a unique identifier, usually set by jumpers on the disk. Data can be transferred synchronously or asynchronously between any two devices on the bus, and can be transferred in 32-bit wide data at speeds as high as 40M/sec. The SCSI bus can transfer data and status information between devices, and transactions between the initiator and the target involve up to 8 distinct phases. You can determine the current stage by 5 signals on the SCSI bus. The 8 stages are:

BUS FREE No device has control of the bus and no transaction is currently taking place.

ARBITRATION A SCSI device attempts to gain control of the SCSI bus by asserting its SCSI identifier on the address pin. The highest numbered SCSI identifier succeeds.

SELECTION A device has successfully arbitrated control of the SCSI bus and must now signal the SCSI target to which it wants to send commands. It declares the target's SCSI identifier on the address pins.

RESELECTION The SCSI device may disconnect while processing the request, and the target will reselect the initiator. Not all SCSI devices support this stage.

COMMAND Commands of 6, 10 or 12 bytes can be sent from the initiator to the target.

DATA IN, DATA OUT At this stage, data is transferred between the initiator and the target.

STATUS After completing all commands, enter this stage. Allows the target to send a status byte to the initiator indicating success or failure.

MESSAGE IN, MESSAGE OUT Additional information passed between the initiator and the target.

The Linux SCSI subsystem consists of two basic elements, each represented by a data structure:

Host A SCSI host is a physical piece of hardware, a SCSI controller. The NCR810 PCI SCSI controller is an example of a SCSI host. If a Linux system has more than one SCSI controller of the same type, each instance is represented by a SCSI host. This means that a SCSI device driver may control more than one instance of the controller. The SCSI host is usually always the initiator of the SCSI command.

Device SCSI devices are usually disks, but the SCSI standard supports several types: tape, CD-ROM, and generic SCSI devices. SCSI devices are usually the target of SCSI commands. These devices must be treated differently. For example, removable media such as CD-ROM or tape, Linux needs to detect whether the media is removed. Different disk types have different major numbers, allowing Linux to direct block device requests to the appropriate SCSI type.

Initializing the SCSI Subsystem

Initializing the SCSI subsystem is rather complex, reflecting the dynamic nature of the SCSI bus and devices. Linux initializes the SCSI subsystem at boot time: it looks for the SCSI controller (SCSI host) in the system, and probes every SCSI bus, looking for every device. These devices are then initialized so that the rest of the Linux kernel can access them through normal file and buffer cache block device operations. This initialization process has four stages:

First, Linux finds out which SCSI host adapter or controller built into the kernel has hardware it can control when the kernel is built. Each built-in SCSI host has a Scsi_Host_Template entry in the buildin_scsi_hosts vector table. The Scsi_Host_Template data structure contains pointers to routines that can perform SCSI host-related actions such as detecting what SCSI devices are attached to the SCSI host. These routines are called during SCSI subsystem configuration and are part of SCSI device drivers that support this host type. For each discovered SCSI controller (with a real SCSI device attached), its Scsi_Host_Template data structure is added to the scsi_hosts list, representing a valid SCSI host. Each instance of each detected host type is represented by a Scsi_Host data structure in the scsi_hostlist list. For example a system with two NCR810 PCI SCSI controllers would have two Scsi_Host entries in this list, one for each controller. Each Scsi_Host_Template pointed to by Scsi_Host represents its device driver.

Now that every SCSI host is found, the SCSI subsystem must find all SCSI devices on each host bus. SCSI device numbers range from 0 to 7, and each device number or SCSI identifier is unique on the SCSI bus to which it is attached. SCSI identifiers are usually set with jumpers on the device. The SCSI initialization code finds each SCSI device on a SCSI bus by sending the TEST_UNIT_READY command to each device. When a device responds, send it an ENQUIRY command to complete its determination. This gives Linux the name of the Vendor and the model and revision number of the device. SCSI commands are represented by a Scsi_Cmnd data structure, and these commands are passed to the device driver by calling the device driver routines in the Scsi_Host_Template data structure of this SCSI host. Each found SCSI device is represented by a Scsi_Device data structure, each pointing to its parent Scsi_Host. All Scsi_Device data structures are added to the scsi_devices list. Figure 8.4 shows the relationship between the main data structures and other data structures.

There are four SCSI device types: disk, tape, CD, and generic. Each SCSI type is registered with the core separately and has a different primary block device type. However, they only register themselves when one or more devices of a given SCSI device type are found. Each SCSI type, such as SCSI disk, maintains its own device table. It uses these tables to direct core block operations (file or buffer cache) to the correct device driver or SCSI host. Each SCSI type is represented by a Scsi_Type_Template data structure. It includes information about this type of SCSI device and addresses of routines that perform various tasks. The SCSI subsystem uses these templates to call the SCSI type routines for each SCSI device type. In other words, if the SCSI subsystem wishes to attach a SCSI disk device, it calls the SCSI disk type routine. If one or more SCSI devices of a certain type are detected, its Scsi_Type_Templates data structure is added to the scsi_devicelist list.

The final stage of SCSI subsystem initialization is to call the completion function of each registered Scsi_Device_Template. For SCSI disk types spin up all SCSI disks and record their disk size. It also adds a linked list of disks representing all SCSI disks to the gendisk data structure, as shown in Figure 8.3.

Delivering Block Device Requests

Once Linux has initialized the SCSI subsystem, SCSI devices can be used. Every valid SCSI device type registers itself in the kernel, so Linux can direct block device requests to it. These requests may be buffer cache requests via blk_dev or file operations via blkdevs. Take a SCSI disk drive that is partitioned by one or more EXT2 filesystems as an example, how are the kernel's buffer requests directed to the correct SCSI disk when its EXT2 partitions are mounted?

Each request to read/write a block of data to/from a SCSI disk partition adds a new request data structure to the current_request list for that SCSI disk in the blk_dev vector table. If the request list is being processed, the buffer cache does nothing. Otherwise it must let the SCSI disk subsystem handle its request queue. Each SCSI disk in the system is represented by a Scsi_Disk data structure. They are stored in the rscsi_disks vector table, indexed by part of the minor device number of the SCSI disk partition. For example, /dev/sdb1 has a major device number of 8 and a minor device number of 17, so it is 1. Each Scsi_Disk data structure includes a pointer to the Scsi_Device data structure representing the device. Scsi_Device in turn points to a Scsi_Host data structure that "owns it". The request data structure in the buffer cache is converted into a Scsi_Cmd data structure that describes the SCSI commands that need to be sent to the SCSI device, and is queued in the Scsi_Host data structure representing this device. Once the appropriate block of data is read/written, it is handled by the respective SCSI device driver.

8.6 Network Devices

A network device, as far as it relates to the Linux network subsystem, is an entity that sends and receives packets. Usually a physical device, such as an Ethernet card. But some network devices are software-only, such as loopback devices, that send data to themselves. Each network device is represented by a device data structure. A network device driver registers the devices it controls with Linux when the kernel initiates network initialization. The Device data structure contains information about this device and addresses of functions that allow a number of supported network protocols to use this device's services. Most of these functions are related to transferring data using this network device. The device transmits the received data to the appropriate protocol layer using standard network support mechanisms. All network data (packets) transmitted and received are represented by the sk_buff data structure, which is a flexible data structure that allows network protocol headers to be easily added and removed. How the network protocol layer uses network devices, and how they pass data back and forth using the sk_buff data structure, is described in detail in Section 10. The focus here is on the device data structure and how network devices are discovered and initialized.

See include/linux/netdevice.h

The device data structure includes information about network devices:

Name Unlike block and character devices, whose device special files are created with the mknod command, network device special files appear naturally when the system's network devices are discovered and initialized. Their names are standard, and each name indicates its device type. Multiple devices of the same type are numbered sequentially from 0 upwards. So ethernet devices are numbered /dev/eth0, /dev/eth1, /dev/eth2, and so on. Some common network devices are:

/dev/ethN 以太网设备
/dev/slN SLIP设备
/dev/pppN PPP设备
/dev/lo loopback 设备

Bus Information This is the information that the device driver needs to control the device. Irq is the interrupt used by the device. Base address is the address of the device's control and status registers in I/O memory. DMA channel is the DMA channel number used by this network device. All this information is set at boot time when the device is initialized.

Interface Flags These describe the characteristics and capabilities of this network device.

IFF_UP interface up, running

The broadcast address of the IFF_BROADCAST device is valid

The debug option of the IFF_DEBUG device is turned on

IFF_LOOPBACK This is a loopback device

IFF_POINTTOPOINT This is a point-to-point connection (SLIP and PPP)

IFF_NOTRAILERS No network trailers

IFF_RUNNING allocated resources

IFF_NOARP does not support ARP protocol

The IF_PROMISC device is in promiscuous receive mode, it will receive all packets regardless of their addresses.

IFF_ALLMULTI receive all IP Multicast frames

IFF_MULTICAST can receive IP multicast frames

Protocol Information Each device describes how it can be used by the network protocol layer:

The Mtu does not include the added link layer header for the maximum size packet that the network can transmit. This maximum value is used by protocol layers such as IP to select an appropriate packet size to send.

Family family shows the protocol family that the device can support. The family supported by all Linux network devices is AF_INET, the Internet address family.

Type The hardware interface type describes the medium to which this network device is connected. Linux network devices support a variety of media types. Includes Ethernet, X.25, Token Ring, Slip, PPP and Apple Localtalk.

The Addresses device data structure holds some addresses associated with this network device, including the IP address

Packet Queue This is a sk_buff packet queue waiting for network devices to transmit

Support Functions Each device provides a set of standard routines for the protocol layer to call as part of the interface to the device's link layer. Includes setup and frame transfer routines, as well as routines for adding standard frame headers and collecting statistics. These statistics can be seen with ifcnfig

8.6.1 Initializing Network Devices

Network device drivers, like other Linux device drivers, can be built into the Linux kernel. Each possible network device is represented by a device data structure in the list of network devices pointed to by the dev_base list pointer. If device-related operations are required, the network layer calls one of the network device service routines (in the device data structure). However, initially, each device data structure just holds the address of an initialization or detection routine.

Network drivers must address two issues. First, not all network device drivers built into the Linux kernel will have controlled devices; second, ethernet devices in the system are always called /dev/eth0, /dev/eth1, etc., regardless of the underlying device driver what is. The problem of "missing" network devices is easy to fix. When each network device's initialization routine is called, it returns a status that shows whether it has located an instance of the controller it drives. If the driver does not find any devices, its entry in the device list pointed to by dev_base is removed. If the driver can find a device, it fills the rest of the device data structure with information about the device and the addresses of support functions in the network device driver.

The second problem, dynamically assigning ethernet devices to the standard /dev/ethN device special file, is solved in a more elegant way. There are 8 standard entries in the Device list: eth0, eth1 to eth7. The initialization routine is the same for all entries. It sequentially tries to build each ethernet device driver in the core until it finds one. When the driver finds its ethernet device, it populates the ethN device data structure it now owns. At this point the network driver also initializes the physical hardware it controls and finds out which IRQs, DMAs, and so on, it uses. A driver may find several instances of the network device it controls, in which case it occupies several /dev/ethN device data structures. Once all 8 standard /dev/ethNs are allocated, no more ethernet devices will be probed.

9. The File System

One of the most important features of Linux allows it to support many different file systems. This makes it very flexible and can coexist with many other operating systems. inux has always supported 15 file systems: ext, ext2, xia, minix, umsdos, msdos, vfat, proc, smb, ncp, iso9660, sysv, hpfs, affs and ufs, and there is no doubt that over time, will be added More filesystems.

In Linux, like Unix, the different filesystems that the system can use are not accessed by device identifiers (such as drive numbers or device names), but are linked into a single tree-like structure, represented by a unified single entity File system. Linux adds it to this single filesystem tree at filesystem mount time. All filesystems, no matter what type, are mounted in a directory, and the files of the mounted filesystem mask the original contents of the directory. This directory is called the mount directory or mount point. When the filesystem is unmounted, the installation directory's own files can appear again.

When a disk is initialized (eg, with fdisk), a partition structure is used to divide the physical disk into a set of logical partitions. Each partition can hold a file system, such as an EXT2 file system. The file system organizes files into a logical tree structure on blocks of physical devices through directories, soft links, etc. Devices that can include file systems are block devices. The first partition of the first IDE disk drive in the system, the IDE disk partition /dev/hda1, is a block device. The Linux file system treats these block devices as simple linear combinations of blocks, and does not know or care about the size of the underlying physical disk. It is the task of every block device driver to map a read request for a specific block of the device into terms meaningful to the device: the track, sector, and cylinder where this block is stored on the hard disk. A filesystem should work the same way and have the same look and feel no matter what device it is stored on. Also, with Linux's filesystem, it doesn't matter (at least to the system user) whether these different filesystems are on different physical media under the control of different hardware controllers. The filesystem might not even be on the local system, it might be mounted remotely over a network connection. Consider the following example, where the root filesystem of a Linux system is on a SCSI disk.

A E boot etc lib opt tmp usr

C F cdrom fd proc root var sbin

D bin dev home mnt lost+found

Neither the user nor the program manipulating these files need to know that /C is actually a mounted VFAT filesystem on the system's first IDE disk. In this example (actually my Linux system at home), /E is the master IDE disk on the secondary IDE controller. It doesn't matter that the first IDE controller is a PCI controller, and the second is an ISA controller, which also controls the IDE CDROM. I can dial into my work network with a modem and PPP network protocol, and at this point, I can remotely mount the filesystem of my Alpha AXP Linux system to /mnt/remote.

Files in the filesystem contain collections of data: The file containing the source for this section is an ASCII file called filesystems.tex. A file system holds not only the data of the files it contains, but also the structure of the file system. It holds all the information that Linux users and processes see, such as files, directories, soft links, file protection information, and more. In addition, it must store this information securely, and the basic consistency of the operating system depends on its file system. No one can use an OS that randomly loses data and files (don't know if there is, though I've been hurt by an OS that has more lawyers than Linux developers).

Minix is ​​the first file system of Linux, which has considerable limitations and poor performance. Its filename cannot be longer than 14 characters (which is still better than 8.3 filenames), and the maximum corpus size is 64M bytes. At first glance, 64M bytes may seem large enough, but setting up a medium database requires a larger file size. The first file system designed specifically for Linux, the Extended File System or EXT (Extend File System), was introduced in April 1992 and solved many problems, but still felt low performance. So, in 1993, the Extended File System Version 2, or EXT2, was added. This file system is described in detail later in this section.

A major development took place when the EXT filesystem was added to Linux. The real file system is separated from the operating system and system services through an interface layer called the virtual file system or VFS. VFS allows Linux to support many (often different) file systems, each of which presents a common software interface to VFS. All details of the Linux filesystem are translated by software, so all filesystems appear the same to the rest of the Linux kernel and programs running on the system. Linux's virtual filesystem layer allows you to transparently mount many different filesystems at the same time.

The implementation of the Linux virtual file system makes access to its files as fast and efficient as possible. It must also ensure that files and file data are stored correctly. These two requirements may be unequal to each other. Linux VFS caches information in memory as each filesystem is mounted and used. These cached data are changed as files and directories are created, written, and deleted, and great care must be taken to properly update the file system. If you can see the data structures of the file system in the running kernel, you can see the file system read and write data blocks, data structures describing the files and directories being accessed are created and destroyed, and device drivers are not Stop and run, capture and save data. The most important of these caches is the Buffer Cache, which is incorporated when filesystems access their underlying block devices. When blocks are accessed they are placed in the Buffer Cache, which are placed in different queues depending on their state. The Buffer Cache not only caches data buffers, it also helps manage the asynchronous interface of the block device driver.

9.1 The Second Extended File System (EXT2)

EXT2 was invented (Remy Card) as a scalable and powerful file system for Linux. It is the most successful file system, at least in the Linux community, and is the basis for all current Linux distributions. The EXT2 file system, like all most file systems, is built on the premise that the data of a file is stored in data blocks. These data blocks are all the same length, although the block length of different EXT2 file systems can be different, but for a specific EXT2 file system, its block length is determined when it is created (using mke2fs). The length of each file is rounded up in blocks. If the block size is 1024 bytes, a 1025-byte file will occupy two 1024-byte blocks. Unfortunately this means that on average you waste half a block per file. Usually in computing you will trade disk utilization for CPU memory usage. In this case, Linux, like most operating systems, uses relatively inefficient disk utilization in exchange for less CPU load. Not all blocks in a file system contain data, some blocks must be used to place information that describes the structure of the file system. EXT2 describes each file in the system with an inode data structure, which defines the topology of the system. An inode describes which blocks the data in a file occupies, as well as the access rights of the file, the modification time of the file, and the type of the file. Each file in the EXT2 file system is described by an inode, and each inode is identified by a unique number. The inodes of the file system are kept together, in the inode table. EXT2 directories are simply special files (they are also described using inodes) that include pointers to the inodes of their directory entries.

Figure 9.1 shows an EXT2 filesystem occupying a series of blocks on a block-structured device. Whenever a file system is mentioned, a block device can be thought of as a series of blocks that can be read and written. The filesystem doesn't need to care which block of physical media it places itself on, that's the job of the device driver. When a file system needs to read information or data from the containing block device, it requests an integer number of blocks to be read from the device driver it supports. The EXT2 file system divides the logical partitions it occupies into Block Groups. In addition to holding real files and directories as information and data blocks, each group replicates information that is critical to file system consistency. This replicated information is necessary in the event of a disaster and the file system needs to be recovered. The content of each block group is described in detail below.

9.1.1 The EXT2 Inode

In the EXT2 file system, the inode is the cornerstone of construction: every file and directory in the file system is described by one and only one inode. The EXT2 inodes for each block group are placed in the inode table, along with a bitmap that allows the system to keep track of allocated and unallocated inodes. Figure 9.2 shows the format of an EXT2 inode, among other information, it includes some fields:

See include/linux/ext2_fs_i.h

mode contains two sets of information: what the inode describes and the user's rights to it. For EXT2, an inode can describe a file, directory, symbolic link, block device, character device, or FIFO.

Owner Information User and group identifiers for the data of this file or directory. This allows the file system to properly control file access permissions

Size The size of the file (bytes)

Timestamps The time this inode was created and the time it was last modified.

Datablocks Pointer to the blocks of data described by this inode. The first 12 are physical blocks that point to the data described by this inode, and the last 3 pointers include more levels of indirection. For example, a two-level indirect block pointer points to a block pointer that points to a block pointer of a data block. This means that files less than or equal to 12 data block size are faster to access than larger files.

You should note that EXT2 inodes can describe special device files. These are not real files that programs can use to access the device. All device files under /dev are designed to allow programs to access Linux devices. For example, the mount program takes the device file it wishes to mount as an argument.

9.1.2 The EXT2 Superblock

The superblock contains a description of the basic size and shape of the filesystem. The information in it allows the file system manager to use it to maintain the file system. Usually only the superblocks in block group 0 are read when the file system is mounted, but each block group contains a replicated copy for use in the event of a system crash. In addition to some other information, it includes:

See include/linux/ext2_fs_sb.h

Magic Number allows the installation software to check if this is a superblock of an EXT2 filesystem. For the current version of EXT2 is 0xEF53.

Revision Level major and minor revision levels allow the installation code to determine whether this filesystem supports features that are only available under this particular revision of the filesystem. This is also the feature compatibility field, which helps the installation code determine which new features are safe to use on this filesystem.

Mount Count and Maximum Mount Count These together allow the system to determine if this filesystem needs a full check. Each time the file system is mounted, the mount count increases. When it is equal to the maximum mount count, the warning message "maximal mount count reached, running e2fsck is recommended" is displayed.

Block Group Number Stores the block group number of this superblock copy.

Block Size The size in bytes of the file system's blocks, eg 1024 bytes.

Blocks per Group The number of blocks in the group. Like the block size, this is determined when the file system is created.

Free Blocks The number of free blocks in the file system.

Free Inodes Free inodes in the filesystem.

First Inode This is the number of the first inode in the system. The first inode in an EXT2 root filesystem is the directory entry for the '/' directory.

 

9.1.3 The EXT2 Group Descriptor

Each block group has a data structure description. Like superblocks, the group descriptors of all winning groups are replicated in each block group. Each group descriptor includes the following information:

See include/linux/ext2_fs.h ext2_group_desc

Blocks Bitmap The block number of the block allocation bitmap of this block group, used in the block allocation and reclamation process

Inode Bitmap The block number of the inode bitmap for this block group. Used in inode allocation and recycling.

Inode Table The block number of the starting block of the inode table of this block group. The inode represented by each EXT2 inode data structure is described below .

Free blocks count,Free Inodes count,Used directory count

The group descriptors are arranged in sequence, and together they form the group descriptor table. Each block group includes a complete copy of the block group descriptor table and its superblock. Only the first copy (in block group 0) is actually used by the EXT2 filesystem. Other copies, like other copies of the superblock, are only used when the primary copy is corrupted.

9.1.4 EXT2 Directories

In the EXT2 file system, directories are special files used to create and store access paths to files in the file system. Figure 9.3 shows the layout of a directory entry in memory. A directory file is a list of directory entries, each directory entry containing the following information:

See include/linux/ext2_fs.h ext2_dir_entry

inode The inode of this directory entry. This is an index into the inode array placed in the block group's inode table. Figure 9.3 The inode referenced by the directory entry for the file called file is i1.

Name length The length in bytes of this directory entry

Name The name of this directory entry

The first two entries in each directory are always the standard "." and "..", meaning "this directory" and "parent directory", respectively.

9.1.5 Finding a File in a EXT2 File System

Linux filenames have the same format as all Unix filenames. It is a series of directory names separated by "/" and ending with the file name. An example of a file name is /home/rusling/.cshrc, where /home and /rusling are the directory names, and the file name is .cshrc. Like other Unix systems, Linux doesn't care about the format of the filename itself: it can be of any length and consists of printable characters. In order to find the inode representing this file in the EXT2 filesystem, the system must parse the filenames in the directory one by one until the file is found.

The first inode we need is the inode of the root of this filesystem. We find its number through the filesystem's superblock. To read an EXT2 inode we must look in the inode table in the appropriate block group. For example, if the root's inode number is 42, then we need the 42nd inode in the inode table in block group 0. The root inode is an EXT2 directory, in other words the schema of the root inode describes it as a directory whose data blocks include EXT2 directory entries.

Home is one of these directory entries, this directory entry gives us the inode number describing the /home directory. We have to read this directory (first read its inode, then read the directory entry from the data block described by this inode), look for the rusling entry, and give the inode number describing the /home/rusling directory. Finally, we read the directory entry pointed to by the inode describing the /home/rusling directory to find the inode number of the .cshrc file, so we get the data block containing the information in the file.

9.1.6 Changing the size of a File in an EXT2 File System

A common problem with filesystems is that it tends to be more fragmented. Blocks containing file data are distributed throughout the file system. The more scattered the data blocks, the less efficient the sequential access to the file data blocks. The EXT2 file system attempts to overcome this situation by assigning new blocks to a file that are physically close to or at least within a block group of its current data blocks. Only if this fails it allocates data blocks in other block groups.

Whenever a process tries to write data to a file, the Linux file system checks to see if the data would exceed the end of the file's last allocated block. If it is, it must allocate a new data block for this file. Until this allocation is complete, the process cannot run, it must wait for the file system to allocate new data blocks and write the remaining data before it can continue. The first thing the EXT2 block allocation routine does is lock the EXT2 superblock for this filesystem. Allocating and freeing blocks requires changing the fields in the superblock, and the Linux filesystem cannot allow more than one process to make changes at the same time. If another process needs to allocate more data blocks, it has to wait until this process is done. A process waiting for a superblock is suspended and cannot run until control of the superblock is released by its current user. Access to the superblock is granted on a first come first serve basis, once a process has control of the superblock, it maintains control until it completes. After locking the superblock, the process checks to see if the filesystem has enough free blocks. If there are not enough free blocks, attempts to allocate more will fail and the process will relinquish control of the filesystem superblock.

If there are enough free blocks in the file system, the process will try to allocate a block. If the EXT2 file system has already created pre-allocated data blocks, we can access them. Preallocated blocks don't actually exist, they are just reserved blocks in the bitmap of the allocated blocks. The VFS inode uses two EXT2-specific fields to represent the file we are trying to allocate new data blocks to: prealloc_block and prealloc_count, which are the number of the first block in the preallocated block and the number of preallocated blocks, respectively. If there are no preallocated blocks or preallocation is disabled, the EXT2 file system must allocate a new data block. The EXT2 file system first checks whether the data block after the last data block of the file is free. Logically, this is the most efficient block that can be allocated because sequential accesses are made faster. If the block is not free, continue to search for the ideal data block in the next 64 blocks. This block, although not ideal, is at least fairly close to the rest of the file, in a block group.

See fs/ext2/balloc.c ext2_new_block()

If none of these blocks are free, the process starts sequentially looking at all other block groups until it finds a free block. The block allocation code looks for clusters of 8 free data blocks in these block groups. It lowers the requirement if it can't find 8 at a time. If block preallocation is desired, and allowed, it updates prealloc_block and prealloc_count accordingly.

Wherever a free data block is found, the block allocation code updates the block bitmap of the block group and allocates a data buffer from the buffer cache. This data buffer is uniquely identified using the device identifier that underpins the file system and the block number of the allocated block. The data in the buffer is set to 0, and the buffer is marked "dirty" to indicate that its contents have not been written to the physical disk. Finally, the superblock itself also marks the bit "dirty", indicating that it has made changes, and then its locks are released. If there is a process waiting for the superblock, the first process in the queue is allowed to run, obtains exclusive control of the superblock, and performs its file operations. The data of the process is written to a new data block. If the data block is full, the whole process is repeated and other data blocks are allocated.

9.2 The Virtual File System (VFS)

Figure 9.4 shows the relationship between the Linux kernel's virtual filesystem and its real filesystem. The virtual file system must manage all the different file systems mounted at any one time. To this end it manages the data structures describing the entire file system (virtual) and the individual real, mounted file systems.

Quite confusingly, VFS also uses the terms superblock and inode to describe the system's files, much in the same way that the EXT2 filesystem uses superblocks and inodes. Like EXT2 inodes, VFS inodes describe files and directories in the system: the contents and topology of the virtual file system. From now on, to avoid confusion, I will use VFS inodes and VFS superblocks to distinguish them from EXT2 inodes and superblocks.

see fs/*

When each file system is initialized, it registers itself with the VFS. This happens when the system boots the operating system to initialize itself. The actual filesystem itself is built into the kernel or as a loadable module. Filesystem modules are loaded when the system needs them, so if a VFAT filesystem is implemented as a core module, it is only loaded when a VFAT filesystem is installed. When a block device filesystem is mounted, (including the root filesystem), the VFS must read its superblock. The superblock read routine for each filesystem type must find out the filesystem topology and map this information to a VFS superblock data structure. A VFS keeps a list of filesystems mounted on the system and their VFS superblock list. Each VFS superblock contains file system information and pointers to routines that perform specific functions. For example, the superblock representing a mounted EXT2 filesystem contains a pointer to the read routine of an EXT2-related inode. This EXT2 inode read routine, like all filesystem related inode read routines, populates the VFS inode field. Each VFS superblock contains a pointer to a VFS inode in the file system. For the root filesystem, this is the inode representing the "/" directory. This information mapping is fairly efficient for the EXT2 filesystem, but relatively inefficient for other filesystems.

 

When system processes access directories and files, they call system routines to traverse the VFS inodes in the system. For example, enter ls or cat a file in another directory, and let VFS find the VFS inode representing this file system. Each file and directory in the mapping system is represented by a VFS inode, so some inodes will be accessed repeatedly. These inodes are kept in the inode cache, which makes accessing them faster. If an inode is not in the inode cache, then a routine related to the file system must be called to read the appropriate inode. The act of reading the inode causes it to be placed in the inode cache, and future accesses to the inode cause it to remain in the cache. Less used VFS inodes are removed from this cache.

see fs/inode.c

All Linux file systems use a common buffer cache to cache the data buffer of the underlying device, which can speed up the access to the physical device that stores the file system, thereby speeding up the access to the file system. This buffer cache is independent of the file system and is integrated into the Linux kernel's mechanism for allocating, reading and writing data buffers. There are special benefits to having a Linux filesystem independent of the underlying media and supporting device drivers. All block-structured devices register with the Linux kernel and present a uniform, block-based, usually asynchronous interface. This is true even for relatively complex block devices such as SCSI devices. When the real filesystem reads data from the underlying physical disk, it causes block device drivers to read physical blocks from the device they control. The buffer cache is integrated in this block device interface. When the filesystem reads blocks, they are stored in a global buffer cache shared by all filesystems and the Linux kernel. The buffers are marked with their block number and a unique identifier of the device being read. So, if the same data is needed frequently, it will be read from the buffer cache instead of the disk (which will take more time). Some devices support read ahead, where blocks of data are read ahead of time for possible later reads.

See fs/buffer.c

The VFS also maintains a directory lookup cache, so the inode of a frequently used directory can be found quickly. As an experiment, try listing a directory that you haven't listed recently. The first time you make a list, you will notice a brief pause, and the second time you make a list, the results will come out immediately. The directory cache itself does not store the inodes in the directory, which is the responsibility of the inode cache. The directory cache only stores the full names of directory items and their inode numbers.

See fs/dcache.c

9.2.1 The VFS Superblock

Each mounted filesystem is represented by a VFS superblock. Among other information, the VFS superblock includes:

See include/linux/fs.h

Device This is the device identifier of the block device containing the file system. For example, /dev/hda1, the first IDE disk in the system, has a device identifier of 0x301

Inode pointers The mounted inode pointer points to the first inode of the file system. The Covered inode pointer points to the inode of the directory into which the filesystem is mounted. For the root filesystem, there is no covered pointer in its VFS superblock.

Blocksize Filesystem block size in bytes, for example 1024 bytes.

Superblock operations Pointer to a set of superblock routines for this file system. Among other types, VFS uses these routines to read and write inodes and superblocks

File System type A pointer to the file_system_type data structure for this mounted file system

File System Specific a pointer to the information required by this file system

9.2.2 The VFS Inode

Like the EXT2 file system, each file, directory, etc. in the VFS is represented by one and only one VFS inode. Information in each VFS inode is retrieved from the underlying filesystem using filesystem-specific routines. VFS inodes exist only in the core memory, and are kept in the VFS inode cache as long as they are useful to the system. Among other information, the VFS inode includes some fields:

See include/linux/fs.h

device The device identifier of the device holding this file (or other entity represented by this VFS inode).

Inode nunber The number of this inode, unique in this file system. The combination of Device and inode number is unique in the entire virtual file system.

Mode Like EXT2, this field describes what this VFS inode represents and the access rights to it.

User ids owner identifier

Times Created, modified and written

Block size The size in bytes of the block of this file, e.g. 1024 bytes

Inode operations Pointer to a set of routine addresses. These routines are related to the file system and perform operations on this inode, such as truncate the file represented by this inode

Count The number of system components currently using this VFS inode. Count 0 means that the inode is free and can be discarded or reused.

Lock This field is used to lock the VFS inode. For example when reading it from the filesystem

Dirty shows whether this VFS inode has been written to, and if so, the underlying filesystem needs to be updated.

File system specific information

9.2.3 Registering the File Systems

When you build the Linux kernel, you are asked if you need every supported filesystem. When the kernel is built, the filesystem initialization code includes calls to all built-in filesystem initialization routines. Linux filesystems can also be built as modules, in which case they can be loaded on demand or manually using insmod. When home is in a filesystem module, it registers itself with the kernel, and when unloaded, it unregisters. Each filesystem's initialization routine registers itself with the virtual filesystem and is represented by a file_system_type data structure, which contains the filesystem's name and a pointer to its VFS superblock read routine. Figure 9.5 shows that the file_system_type data structure is placed into a list pointed to by the file_systems pointer. Each file_system_type data structure includes the following information:

See fs/filesystems.c sys_setup()

see include/linux/fs.h file_system_type

Superblock read routine This routine is called by the VFS when an instance of this filesystem is mounted

File Systme name The name of the file system, such as ext2

Device needed Does this filesystem need a device support? Not all file systems require a device to hold them. For example the /proc filesystem does not require a block device

You can check /proc/filesystems to see which filesystems are registered, for example:

ext2

nodev proc

iso9660

9.2.4 Mounting a File System

When the superuser tries to mount a filesystem, the Linux kernel must first validate the parameters passed in the system call. While mount can perform some basic checks, it doesn't know if the core build is a supportable filesystem or if the proposed mount point exists. Consider the following mount command:

$ mount –t iso9660 –o ro /dev/cdrom /mnt/cdrom

The mount command passes three pieces of information to the core: the name of the filesystem, the physical block device that contains the filesystem, and where in the existing filesystem topology the new filesystem is to be mounted.

The first thing a virtual filesystem does is find the filesystem. It first looks at each file_system_type data structure in the list pointed to by file_systems, looking at all known filesystems. If it finds a matching name, it goes until the core supports the filesystem type and gets the address of the filesystem-related routines to read the filesystem's superblock. If it cannot find a matching filesystem name, it can continue if the kernel has built-in support for on-demand loading of core modules (see Section 12). In this case, the kernel will request the kernel daemon to load the appropriate filesystem module before continuing.

see fs/super.c do_mount()

See fs/super.c get_fs_type()

In the second step, if the physical device passed by mount has not been mounted, the VFS inode of the directory that will become the mount point of the new file system must be found. The VFS inode may be in the inode cache or must be read from the block device of the filesystem that supports this mount point. Once the inode is found, check if it is a directory and no other filesystem is mounted there. The same directory cannot be used as a mount point for more than one file system.

At this point, the VFS mount code must allocate a VFS superblock and pass the mount information to the filesystem's superblock read routine. All VFS superblocks in the system are stored in the super_blocks vector table consisting of the super_block data structure, and a structure must be allocated for this installation. The superblock read routine must populate the fields of the VFS superblock based on the information it reads from the physical device. For the EXT2 file system, the mapping or conversion of this information is quite easy, it just needs to read the EXT2 superblock and fill it into the VFS superblock. For other filesystems, such as the MS DOS filesystem, it's not such a simple task. Regardless of the filesystem, populating the VFS superblock means that information describing that filesystem must be read from the block device that supports it. The mount command will fail if the block device cannot be read or if it does not contain this type of filesystem.

Each mounted filesystem is described by a vfsmount data structure, see Figure 9.6. They are queued in a list pointed to by vfsmntlist. Another pointer, vfsmnttail, points to the last entry in the list, and the mru_vfsmnt pointer points to the most recently used filesystem. Each vfsmount structure includes the device number of the block device that stores the file system, the directory where the file system is mounted, and a pointer to the VFS superblock allocated when the file system was mounted. The VFS superblock points to the file_system_type data structure of this type of file system and the root inode of this file system. This inode resides in the VFS inode cache during this filesystem mount.

See fs/super.c add_vfsmnt()

9.2.5 Finding a File in the Virtual File System

In order to find the VFS inode of a file in the virtual file system, the VFS must sequentially name, one directory at a time, to find the VFS inode for each directory in between. Each directory lookup calls the lookup routine associated with the filesystem (address placed in the VFS inode representing the parent directory). Because there is always the root inode of the filesystem in the VFS superblock of the filesystem, indicated by a pointer in the superblock, the whole process can continue. Every time an inode in the real filesystem is looked up, the directory cache for this directory is checked. If the directory cache doesn't have this entry, the real filesystem either gets the VFS inode from the underlying filesystem or from the inode cache.

9.2.6 Creating a File in the Virtual File System

9.2.7 Unmounting a File System

My workbooks usually describe assembly as the inverse of disassembly, but it's a little different for unmounting filesystems. A filesystem cannot be unmounted if something on the system is using a file on the filesystem. For example, if a process is using the /mnt/cdrom directory or its subdirectories, you cannot unmount /mnt/cdrom. If something is using the filesystem being unmounted, its VFS inode will be in the VFS inode cache. The unload code examines the entire inode list looking for inodes that belong to the device occupied by this filesystem. If the VFS superblock of the mounted filesystem is dirty, that is, it has been modified, then it must be written back to the filesystem on disk. Once it writes to disk, the memory occupied by this VFS superblock is returned to the kernel's free memory pool. Finally, this mounted vmsmount data structure is also removed from vfsmntlist and freed.

See fs/super.c do_umount()

See fs/super.c remove_vfsmnt()

9.2.8 The VFS Inode Cache

When traversing mounted filesystems, their VFS inodes are constantly being read, and sometimes written. The virtual filesystem maintains an inode cache, which is used to speed up access to all mounted filesystems. Each time a VFS inode is read from the inode cache, the system can save access to physical devices.

see fs/inode.c

The VFS inode cache is implemented in the form of a hash table, and the entries are pointers to the VFS inode list with the same hash value. An inode's hash value is calculated from its inode number and the device number of the underlying physical device that contains the filesystem. Whenever the virtual file system needs to access an inode, it first looks in the VFS inode cache. To look up an inode in the inode hash table, the system first calculates its hash value, and then uses it as an index into the inode hash table. This gives a pointer to a list of inodes with the same hash value. It then reads all the inodes at once until it finds an inode with the same inode number and the same device identifier as the inode it was looking for.

If the inode can be found in the cache, its count is incremented, indicating that it has another user, and filesystem access continues. Otherwise, a free VFS inode must be found for the filesystem to read the inode into memory. How to get a free inode, VFS has a range of options. If the system can allocate more VFS inodes, it does so: it allocates core pages and divides them into new, free inodes, which are put into the inode list. All VFS inodes in the system are also in a list pointed to by first_inode in addition to the inode hash table. If the system already has all the inodes it allows, it must find an inode that can be reused. Good candidates are inodes that have a count of 0: this means that the system is not currently using them. Really important VFS inodes, such as the filesystem's root inode, already have a usage greater than 0, so are never chosen for reuse. Once a reuse candidate is located, it is cleared. This VFS inode may be dirty, in which case the system must wait for it to be unlocked before continuing. Candidates for this VFS inode must be cleared before reuse.

While a new VFS inode has been found, a filesystem-related routine must be called to populate the inode with information from the underlying real filesystem poisoning. When it fills, this new VFS inode has a usage of 1 and is locked, so no other process but it can access it until it is filled with valid information.

In order to get the VFS inodes it actually needs, the filesystem may need to access some other inodes. This happens when you read a directory: only the inode of the final directory is needed, but the inode of the intermediate directory must also be read. When the VFS inode cache is in use and fills up, less-used inodes are discarded and more-used inodes remain in the cache.

9.2.9 The Directory Cache

To speed up access to frequently used directories, VFS maintains a cache of directory entries. When the real filesystem looks for directories, the details of those directories are added to the directory cache. The next time you look up the same directory, such as listing or opening a file in it, it will be found in the directory cache. Only short directory entries (up to 15 characters) are cached, but this is reasonable since shorter directory names are the most commonly used. For example: /usr/X11R6/bin is accessed very frequently when the X server starts.

See fs/dcache.c

The directory cache contains a hash table, each entry points to a list of directory cache entries with the same hash value. The Hash function uses the device number and directory name of the device holding this filesystem to calculate the offset or index in the hash table. It allows to quickly find cached directory entries. A cache is useless if it takes too long to look up, or is not found at all.

To keep these caches valid and up-to-date, the VFS maintains a list of least recently used (LRU) directory cache entries. When a directory entry is put into the cache for the first time, that is, when it is first looked up, it is added to the end of the first-level LRU list. For a full cache, this removes entries that exist at the front of the LRU list. When the directory entry is accessed again, it is moved to the end of the second LRU cache list. Again, this time it removes the L2 cache directory entry at the front of the LRU cache list at the second level. There is no problem with removing directory entries from the primary and secondary LRU lists. These entries are at the front of the list only because they have not been accessed recently. If accessed, they will be at the end of the list. Entries in the L2 LRU cache list are more secure than entries in the L1 LRU cache list. Because these entries are not only looked up but once referenced repeatedly.

9.3 The Buffer Cache

When using mounted filesystems, they generate a large number of requests to read and write data blocks to block devices. All block data read and write requests are passed to the device driver in the form of a buffer_head data structure through standard core routine calls. These data structures give all the information a device driver needs: the device identifier uniquely identifies the device, and the block number tells the driver which block to read. All block devices are seen as a linear combination of blocks of the same size. To speed up access to physical block devices, Linux maintains a cache of block buffers. All block buffers in the system are stored in this buffer cache, even those new, unused buffers. This buffer is shared by all physical block devices: at any one time there are many block buffers in the buffer, which can belong to any one of the system block devices, usually with different states. This saves the system access to physical devices if there is valid data in the buffer cache. Any block buffer used to read/write data from/to the block device goes into this buffer cache. Over time, it may be removed from this buffer to make room for other buffers that are more needed, or it may remain in the buffer if it is accessed frequently.

Block buffers in this buffer are uniquely identified by the device identifier and block number to which this buffer belongs. The buffer cache consists of two functional parts. The first part is a list of free block buffers. A list for each buffer of the same size (that the system can support). The system's free block buffers are queued in these lists when they are first created or discarded. Currently supported buffer sizes are 512, 1024, 2048, 4096 and 8192 bytes. The second functional part is the cache itself. This is a hash table, a vector table of pointers used to link buffers with the same hash index. The Hash index is generated from the device identifier and block number to which the data block belongs. Figure 9.7 shows this hash table and some entries. The block buffer is either on one of the free lists or in the buffer cache. When they are in the buffer cache, they are also queued in the LRU list. A list of LRUs per buffer type that the system uses to perform operations on a type of buffer. For example, writing a buffer with new data to disk. The type of the buffer reflects its state, and Linux currently supports the following types:

clean 未使用,新的缓冲区(buffer)
locked 锁定的缓冲区,等待被写入
dirty 脏的缓冲区。包含新的有效的数据,将被写到磁盘,但是直到现在还没有调度到写
shared 共享的缓冲区
unshared 曾经共享的缓冲区,但是现在没有共享

Whenever the filesystem needs to read a buffer from its underlying physical device, it tries to get a block from the buffer cache. If it cannot get a buffer from the buffer cache, it takes a clean buffer from the appropriately sized free list, and this new buffer goes into the buffer cache. If the buffer it needs is already in the buffer cache, it may or may not be up to date. If it's not up-to-date, or if it's a new block buffer, the filesystem must request the device driver to read the appropriate block from disk.

Like all caches, the buffer cache must be maintained so that it operates efficiently and distributes cache entries fairly among the block devices that use the buffer cache. Linux uses the core daemon bdflush to perform a lot of cleanup on this buffer, but some other things are done automatically in the process of using the buffer.

9.3.1 The bdflush Kernel Daemon (the core daemon bdflsuh)

The core daemon bdflush is a simple core daemon that provides dynamic responses for systems with many dirty buffers (buffers containing data that must be written to disk simultaneously). It starts as a core thread when the system boots, which is quite confusing, and it calls itself "kflushd", which is the name you'll see when you use ps to display the processes in the system. This process sleeps most of the time, waiting for the number of dirty buffers in the system to grow until they get too large. When buffers are allocated and freed, check the number of dirty buffers in the system and wake up bdflush. The default threshold is 60%, but bdflush will also be woken up if the system needs a lot of buffers. This value can be checked and set with the update command:

#update –d
bdflush version 1.4
0: 60 Max fraction of LRU list to examine for dirty blocks
1: 500 Max number of dirty blocks to write each time bdflush activated
2: 64 Num of clean buffers to be loaded onto free list by refill_freelist
3: 256 Dirty block threshold for activating bdflush in refill_freelist
4: 15 Percentage of cache to scan for free clusters
5: 3000 Time for data buffers to age before flushing
6: 500 Time for non-data (dir, bitmap, etc) buffers to age before flushing
7: 1884 Time buffer cache load average constant
8: 2 LAV ratio (used to determine threshold for buffer fratricide).

Whenever data is written and becomes a dirty buffer, all dirty buffers are linked in the BUF_DIRTY LRU list, and bdflush will try to write a reasonable number of buffers to their disk. This number can also be checked and set with the update command, the default is 500 (see example above).

9.3.2 The update Process

The update command is not just a command, it is also a daemon. When running as superuser (system initialization), it periodically writes all old dirty buffers to disk. It performs these tasks by calling system service routines, more or less the same as bdflush. When a dirty buffer is generated, it is marked with the system time at which it should be written to its own disk. Each time update runs, it looks at all dirty buffers in the system, looking for buffers with expired write times. Each expired buffer is written to disk.

See fs/buffer.c sys_bdflush()

9.3.3 The /proc File System

The /proc file system truly embodies the capabilities of the Linux virtual file system. It doesn't actually exist (another Linux trick), neither does /proc nor its subdirectories and the files in them exist. But why can you cat /proc/devices? The /proc filesystem, like a real filesystem, also registers itself with the virtual filesystem, but the /proc filesystem only uses the inodes in the core when its files and directories are opened and the VFS executes calls that request its inodes. information to create these files and directories. For example, the kernel's /proc/devices file is generated from the kernel's data structures describing its devices.

The /proc filesystem represents a user-readable window into the kernel's internal workspace. Some Linux subsystems, such as the Linux kernel modules described in Section 12, create entries in the /proc filesystem.

9.3.4 Device Special Files

linux, like all versions of Unix, represents its hardware devices as special files. For example, /dev/null is an empty device. A device file doesn't take up any data space in the file system, it's just an access point for device drivers. Both the EXT2 filesystem and Linux's VFS treat device files as a special type of inode. There are two types of device files: character and block special files. Inside the core itself, device drivers implement the basic operations of files: you can open, close, and so on. Character devices allow character-mode I/O operations, while block devices require all I/O to go through the buffer cache. When an I/O request is performed on a device file, it is forwarded to the appropriate device driver on the system. Usually this is not a real device driver, but a pseudo-device driver for subsystems such as the SCSI device driver layer. Device files are referenced by a major device number (which identifies the device type) and a minor type (which identifies a unit, or instance of a major type). For example, for the IDE disk on the first IDE controller in the system, the major device number is 3, and the minor device number of the first partition of the IDE disk should be 1, so, ls -l /dev/hda1 outputs

$ brw-rw---- 1 root disk 3, 1 Nov 24 15:09 /dev/hda1

See all Linux major numbers in /include/linux/major.h

In the kernel, each device is uniquely described by a kdev_t data type. This type is two bytes long, the first contains the minor number of the device and the second contains the major number. The IDE device above is saved as 0x0301 in the core. An EXT2 inode representing a block or character device places the major and minor device numbers of the device in its first direct block pointer. When it is read by the VFS, the I_rdev field of the VFS inode data structure representing it is set to the correct device identifier.

See include/linux/kdev_t.h

10. Networks

Linux and networking are almost synonymous. In fact Linux is a product of the Internet or WWW. Its developers and users use the web to exchange information, ideas, and code, and Linux itself is often used to support the networking needs of some organizations. This subsection describes how Linux supports the network protocols collectively known as TCP/IP.

The TCP/IP protocol was designed to support communication between computers connected to the ARPANET. ARPANET is an American research network funded by the US government. ARPANET is a precursor to some networking concepts, such as message exchange and protocol layering, which allow one protocol to utilize services provided by other protocols. ARPANET exited in 1988, but its successors (NSF NET and Internet) have grown even bigger. The World Wide Web as it is now known was developed in ARPANET, which itself is supported by the TCP/IP protocol. Unix is ​​heavily used on ARPANET, the first network version of Unix released was 4.3BSD. Linux's network implementation is based on the 4.3BSD model, which supports BSD sockets (and some extensions) and the full range of TCP/IP network functions. This programming interface was chosen because of its popularity and the ability to port programs between Linux and other Unix platforms.

10.1 An Overview of TCP/IP Networking

This section gives an overview of the main principles of TCP/IP networking. This is not an exhaustive description. For a more detailed description, read Reference Book 10 (Appendix).

In an IP network, each machine is assigned an IP address, which is a 32-bit number that uniquely identifies this machine. The WWW is a very large and growing IP network, and every machine connected to it is assigned a unique IP address. IP addresses are represented by four numbers separated by dots, for example, 16.42.0.9. An IP address is actually divided into two parts: the network address and the host address. These addresses can vary in size (dimensions) (there are several classes of IP addresses), in the case of 16.42.0.9, the network address is 16.42 and the host address is 0.9. Host addresses can be further divided into subnets and host addresses. Taking 16.42.0.9 as an example again, the subnet address can be 16.42.0 and the host address is 16.42.0.9. Further division of IP addresses allows organizations to divide their own networks. For example, assuming that 16.42 is the network address of ACME Computer Corporation, 16.42.0 could be subnet 0, and 16.42.1 could be subnet 1. These subnets can be in separate buildings, perhaps connected by dedicated telephone lines or even by microwave. IP addresses are assigned by network administrators, and using IP subnets is a good way to decentralize network management tasks. Administrators of IP subnets are free to assign IP addresses within their own subnets.

However, usually IP addresses are difficult to remember, while names are easier to remember. Linux.acme.com is better remembered than 16.42.0.9. A mechanism must be used to convert network names to IP addresses. These names can be stored statically in the /etc/hosts file or by having Linux query a Distributed Name Server DNS to resolve the names. In this case, the localhost must know the IP addresses of one or more DNS servers, specified in /etc/resolv.conf.

Whenever you connect to another machine, such as reading a web page, use its IP address to exchange data with that machine. This data is included in IP packets, each of which has an IP header (including the IP addresses of the source and destination machines, a checksum, and other useful information. This checksum is derived from the IP packet). It can be obtained from the data of the IP message, which allows the receiver of the IP message to determine whether the IP message is damaged during transmission (perhaps a noisy telephone line). The data transmitted by the application may be broken down into smaller pieces that are easier to handle. The size of IP datagrams varies depending on the connected medium: Ethernet packets are usually larger than PPP packets. The destination host must reassemble these datagrams before they can be handed over to the receiving program. If you pass an equivalent Access a web page that includes a large number of graphical images over a slow serial connection, and you can graphically see the disaggregation and reorganization of the data.

Hosts connected to the same IP subnet can directly send IP packets to each other, while other IP packets must be sent through a special host (gateway). Gateways (or routers) are connected to more than one subnet, and they re-send IP packets received on one subnet to another. For example, if subnets 16.42.1.0 and 16.42.0.0 are connected through a gateway, then all packets sent from subnet 0 to subnet 1 must be sent to the gateway before they can be forwarded. The local host establishes a routing table so that it can send the IP packets to be forwarded to the correct machine. For each IP destination, there is an entry in the routing table that tells Linux which host to send the IP packet to before reaching the destination. These routing tables are dynamic and constantly change as applications use the network and the network topology changes.

The IP protocol is a transport layer protocol used by other protocols to carry their data. Transmission Control Protocol (TCP) is a reliable end-to-end protocol that uses IP to transmit and receive its packets. Just like IP packets have their own headers, TCP also has its own headers. TCP is a connection-oriented protocol, two network applications are connected together by a virtual connection, and there may even be many subnets, gateways and routers between them. TCP transmits and receives data reliably between two applications and guarantees that there will be no loss or duplicate data. When TCP uses IP to transmit its messages, the data contained in the IP message is the TCP message itself. The IP layer of each communicating host is responsible for transmitting and receiving IP packets. User Datagram Protocol (UDP) also uses the IP layer to transmit its messages, but unlike TCP, UDP is not a reliable protocol, it only provides datagram services. Other protocols can also use IP, which means that when an IP packet is received, the receiving IP layer must know which upper layer protocol to pass the data contained in the IP packet. For this purpose, the header of every IP packet has one byte that contains a protocol identifier. When TCP requests the IP layer to transmit an IP packet, the header of the IP packet indicates that it contains a TCP packet. The receiving IP layer, uses this protocol identifier to decide which protocol to pass the received data up to, in this case, the TCP layer. When applications communicate via TCP/IP, they must specify not only the IP address of the target, but also the port address of the target application. A port address uniquely identifies an application, and standard web applications use standard port addresses: for example, web servers use port 80. These registered port addresses can be found in /etc/services.

Protocol layering doesn't stop at just TCP, UDP, and IP. The IP protocol itself uses many different physical media and other IP hosts to transmit IP packets. The media themselves may also add their own protocol headers. Examples of this are the Ethernet layer, PPP, and SLIP. An Ethernet allows many hosts to be connected simultaneously on a single physical cable. Each transmitted Ethernet frame can be seen by all connected hosts, so each Ethernet device has a unique address. Every Ethernet frame sent to that address will be received by the host at that address and ignored by other hosts connected to the network. This unique address is built into the device when each Ethernet device is manufactured, usually stored in the SROM of the Ethernet card. An ether address is 6 bytes long, for example, it might be 08-00-2b-00-49-4A. Some Ethernet addresses are reserved for multicast use, and Ethernet frames sent with such destination addresses are received by all hosts on the network. Because Ethernet frames may carry many different protocols (as data), like IP packets, they all contain a protocol identifier in the header. In this way, the Ethernet layer can correctly receive IP packets and transmit the data to the IP layer.

In order to transmit IP packets over various connection protocols, such as over Ethernet, the IP layer must find out the Ethernet address of the IP host. This is because IP addresses are just an addressing concept, and Ethernet devices themselves have their own physical addresses. IP addresses can be assigned and reassigned as needed by the network administrator, while the network hardware responds only to Ethernet frames with its own physical address, or a special multicast address (which all machines must receive). Linux uses the Address Resolution Protocol (ARP) to let machines translate IP addresses into real hardware addresses such as Ethernet addresses. To get the hardware address to which an IP address is associated, a host sends an ARP request packet, containing the IP address it wishes to translate, to a multicast address that can be received by all points on the network. The target host with this IP address replies with an ARP reply, which includes its physical hardware address. APR is not only limited to Ethernet devices, it can also resolve IP addresses of other physical media, such as FDDI. Devices that cannot ARP are marked so that Linux doesn't need to try to ARP them. There is also an opposite function, reverse ARP, or RARP, which translates physical addresses to IP addresses. This is used for gateways, responding to ARP requests for IP addresses representing the remote network.

10.2 The Linux TCP/IP Networking Layers

Like network protocols, Figure 10.2 shows Linux's implementation of the internet protocol address family as a series of connected software layers. BSD sockets are supported by generic socket management software that is only associated with BSD sockets. Supporting these is the INET socket layer, which manages communication endpoints for the IP-based protocols TCP and UDP. UDP is a connectionless protocol, while TCP is a reliable end-to-end protocol. When sending UDP packets, Linux doesn't know and doesn't care if they reach their destination safely. TCP packets are numbered, and each end of a TCP connection ensures that the transmitted data is received correctly. The IP layer includes the code implementation of the Internet Protocol. This code adds an IP header to the transmitted data and knows how to forward incoming IP packets to the TCP or UDP layer. Beneath the IP layer, supporting Linux networking are network devices such as PPP and Ethernet. Network devices do not always behave as physical devices: some of them, such as loopback devices, are purely software devices. Unlike standard Linux devices created with the mknod command, network devices do not appear until the underlying software finds and initializes them. You will only be able to see the device file /dev/eth0 after you have built a kernel containing the appropriate ethereum device drivers. The ARP protocol sits between the IP layer and the protocols that support ARP.

10.3 The BSD Socket Interface

This is a general interface that not only supports various forms of networking, but also an inter-process communication mechanism. A socket describes one end of a communication connection, and two communicating processes will each have a socket describing their own part of the communication connection between them. Sockets can be imagined as a special form of pipes, but unlike pipes, sockets have no limit to the amount of data they can hold. Linux supports several types of sockets, these classes are called address families. This is because each class has its own communication addressing method. Linux supports the following socket address families or domains:

UNIX Unix domain sockets,
INET The Internet address family supports communications via
TCP/IP protocols
AX25 Amateur radio X25
IPX Novell IPX
APPLETALK Appletalk DDP
X25 X25

There are several socket types, each representing the type of service supported on the connection. Not all address families support all types of services. Linux BSD socket supports the following socket types.

Stream This kind of socket provides a reliable, bidirectional sequential data stream, ensuring that data is not lost, damaged or duplicated during transmission. Stream socket is supported by TCP protocol in INET address family

Datagram sockets also provide bidirectional data transfer, but unlike stream sockets, they do not guarantee that messages will arrive. Even if it arrives there is no guarantee they will arrive sequentially or without duplication or corruption. This type of socket is supported by the UDP protocol in the Internet address family.

RAW This allows the process to directly (hence the name "raw") access to the underlying protocol. For example, you can open a raw socket to an Ethernet device and observe the raw IP data stream.

Reliable Delivered Messages This is much like a datagram but the data is guaranteed to arrive

Sequenced Packets are like stream sockets but the data packet size is fixed

Packet This is not a standard BSD socket type, it is a Linux-specific extension that allows processes to access packets directly at the device layer

Processes that communicate using sockets use a client-server model. The server provides the service, and the client consumes the service. An example of this is a web server that serves web pages and a web client (or browser) that reads those pages. A server using sockets first creates a socket and then binds a name to it. The format of this name is related to the socket's address family, which is the server's local address. The name or address of the Socket is specified using the sockaddr data structure. An INET socket is bound to an IP port address. The registered port number can be seen in /etc/services: for example, the port for the web server is 80. After binding an address to the socket, the server listens to incoming connection requests to the bound address. The initiator of the request, the client, creates a socket and executes a connection request on it, specifying the target address of the server. For an INET socket, the server's address is its IP address and its port address. These incoming requests have to go through a number of protocol layers, find their way, and then wait on the server's listening port. Once the server receives an incoming request, it can accept or reject it. To accept an incoming request, the server must create a new socket to accept it. Once a socket has been used to listen for incoming connection requests, it can no longer be used to support a connection. After the connection is established, both ends are free to send and receive data. Finally, when a connection is no longer needed, it can be closed. Care must be taken to ensure correct processing of datagrams being transmitted.

The exact meaning of operations on a BSD socket depends on its underlying address family. Establishing a TCP/IP connection is very different from establishing an amateur radio X.25 connection. Like the virtual file system, Linux abstracts the socket interface between BSD sockets and applications at the BSD socket layer supported by software associated with independent address families. When the kernel is initialized, the address family built into the kernel registers itself with the BSD socket interface. Later, when an application creates and uses a BSD socket, an association is established between the BSD socket and its supporting address family. This linkage is achieved through a table of intersecting data structures and address family support routines. For example, when an application creates a new socket, the BSD socket interface uses the socket creation routines associated with the address family.

When configuring the core, a set of address families and protocols are built into the protocols vector table. Each is represented by its name (eg "INET") and the address of its initialization routine. When started, the socket interface is initialized, and the initialization code for each protocol is called. For socket address families, a series of protocol operations are registered in them. These are routines, each of which performs a special operation associated with the address family. Registered protocol operations are stored in the pops vector table, which holds pointers to the proto_ops data structure. The Proto_ops data structure includes the protocol family type and a collection of pointers to socket operation routines associated with a particular address family. The Pops vector table is indexed with an address family identifier, such as the Internet address family identifier (AF_INET is 2).

see include/linux/net.h

10.4 The INET Socket Layer

The INET socket layer supports the internet address family including the TCP/IP protocol. As discussed above, these protocols are layered, each using the services of the other. Linux's TCP/IP code and data structures reflect this layering. It interfaces with the BSD socket layer through the internet address family socket operations it registers with the BSD socket layer when the network is initialized. These are placed in the pops vector table along with other registered address families. The BSD socket layer does its work by calling the socket support routines of the INET layer in the registered proto_ops data structure. For example, a BSD socket creation request whose address family is INET will use the underlying INET socket creation function. The socket data structure representing the BSD socket is passed to the INET layer for each operation of the BSD socket layer. The INET socket layer uses its own data structure socket, which connects to the BSD socket data structure, rather than cluttering the BSD socket with TCP/IP related information. See Figure 10.3 for this connection. It uses the data pointer in the BSD socket to link the sock data structure to the BSD socket data structure. This means that subsequent INET socket calls can easily obtain this sock data structure. Protocol operation pointers to the sock data structure are also created at creation time. These pointers depend on the requested protocol. If TCP is requested, the protocol operation pointer of the sock data structure will point to a series of TCP protocol operations required by the TCP connection.

see include/net/sock.h

10.4.1 Creating a BSD Socket

The system call that creates a new socket needs to pass its address family identifier, socket type and protocol. First, look up a matching address family in the pops vector table with the requested address family. It may be a special address family implemented using a kernel module, if so, the kerneld core process must load this module before we can proceed. Then allocate a new socket data structure to represent this BSD socket. In fact, the socket data structure is physically part of the VFS inode data structure, and allocating a socket is actually allocating a VFS inode. This seems odd, unless you consider sockets that can be manipulated in the same way as regular files. As all files are represented by a VFS inode data structure, to support file operations, a BSD socket must also be represented by a VFS inode data structure.

The newly created BSD socket data structure contains a pointer to the socket routine associated with the address family, this pointer is set to the proto_ops data structure taken from the pops vector table. Its type is set to the requested socket type: one of SOCK_STREAM, SOCK_DGRAM, etc., and then the creation routine associated with the address family is called with the address stored in the proto_ops data structure.

Then allocate a free file descriptor from the fd vector table of the current process, and the file data structure it points to is also initialized. This includes setting the file manipulation pointer to the BSD socket file manipulation routines supported by the BSD socket interface. All future operations are directed to the socket interface, which in turn is passed to the corresponding address family by calling the operation routine of the supported address family.

10.4.2 Binding an Address to an INET BSD Socket

In order to listen for incoming Internet connection requests, each server must create an INET BSD socket and bind its own address to it. Most of Bind's operations are handled by the INET socket layer, and some require the support of the underlying TCP and UDP protocol layers. A socket that has been bound to an address cannot be used for other communications. This means that the state of this socket must be TCP_CLOSE. The sockaddr passed to the bind operation includes the IP address to bind to and a port number (optional). Typically, the bound address will be one of the addresses assigned to network devices that support the INET address family, and the interface must be up and available for use. You can use the ifconfig command to see which network interfaces are currently active on the system. An IP address can also be an IP broadcast address (all 1s or 0s). This is a special address that means "send to everyone". This IP address can also be set to any IP address if the machine acts as a transparent proxy or firewall. However, only processes with superuser privileges can bind to any IP address. This bound IP address is stored in the recv_addr and saddr fields of the sock data structure. They are used for hash lookup and sending IP addresses, respectively. The port number is optional, if not set, a free one will be requested from the supporting network. By convention, port numbers less than 1024 cannot be used by processes without superuser privileges. If the underlying network assigns a port number, it always assigns a port greater than 1024.

When the underlying network device receives packets, these packets must be forwarded to the correct INET and BSD sockets to be processed. To this end, UDP and TCP maintain hash tables, which are used to look up the addresses of incoming IP messages and forward them to the correct socket/sock pair. TCP is a connection-oriented protocol, so processing TCP packets contains more information than processing UDP packets.

UDP maintains a hash table of allocated UDP ports, udp_table. This includes a pointer to the sock data structure, indexed by a hash function based on the port number. Because the UDP hash table is much smaller than the allowed port numbers (udp_hash is only 128, UDP_HTABLE_SIZE) some entries in the table point to a linked list of sock data structures, linked together by the next pointer of each sock.

TCP is more complicated because it maintains several hast tables. However, during the bind operation, TCP does not actually add the bound sock data structure to its hash table, it just checks that the requested port is not currently in use. The sock data structure is added to the TCP hash table during the listen operation.

10.4.3 Making a Connection to an INET BSD Socket

Once a socket is created, it can be used to establish outgoing connection requests if it is not listening for incoming connection requests. For connectionless protocols like UDP, this socket operation doesn't have to do much, but for connection-oriented protocols like TCP, it involves establishing a virtual circuit between two applications.

An outgoing connection can only be made on an INET BSD socket that is in the correct state: that is, a connection has not been established, and there is no listening for incoming connections. This means that the BSD socket data structure must be in the SS_UNCONNECTED state. The UDP protocol does not establish a virtual connection between two applications, all messages sent are datagrams, which may or may not arrive at its destination. However, it also supports the connect operation of BSD sockets. A connect operation on a UDP INET BSD socket simply establishes the address of the remote application: its IP address and its IP port number. In addition, it also sets up a buffer of routing table entries, so that UDP datagrams sent on this BSD socket do not need to check the routing table database (unless the route becomes invalid). This cached routing information is pointed to by the ip_route_cache pointer in the INET sock data structure. If no address information is given, messages sent by this BSD socket automatically use this cached routing and IP address information. UDP changes the state of the sock to TCP_ESTABLISHED.

For a connect operation on a TCP BSD socket, TCP must establish a TCP message containing the connection information and send it to the given IP target. This TCP message includes information about the connection: a unique starting message sequence number, the maximum size of messages that the originating host can manage, the window size for sending and receiving, and so on. In TCP, all messages are numbered, and the initial sequence number is used as the first message number. Linux chooses a reasonable random number to avoid malicious protocol attacks. Every message sent from one end of a TCP connection and successfully received by the other end is acknowledged, telling it that it arrived successfully and undamaged. Unacknowledged messages will be resent. The send and receive window size is the number of messages allowed before acknowledging. If the maximum message size supported by the recipient's network device is smaller, the connection uses the smallest of the two. An application performing an outgoing TCP connection request must now wait for a response from the target application to accept or reject the connection request. For the TCP sock expecting incoming messages, it is added to tcp_listening_hash so that incoming TCP messages can be directed to this sock data structure. TCP also starts timers so that outgoing connection requests will time out if the target application does not respond to the request.

10.4.4 Listening on an INET BSD Socket

Once a socket has a bound address, it can listen for incoming connection requests specifying the bound address. A network application can listen directly on a socket without binding an address, in which case the INET socket layer finds an unused port number (for this protocol) and automatically binds it to the socket. The socket's listen function puts the socket into the TCP_LISTEN state and performs the required network-related work while allowing incoming connections.

For UDP sockets, changing the state of the socket is enough, but TCP has activated it now to add the socket's sock data structure to its two hash tables. Here are the tcp_bound_hash and tcp_listening_hash tables. Both tables are indexed by a hash function based on the IP port number.

Whenever an incoming TCP connection request to an active listening socket is received, TCP creates a new sock data structure to represent it. The sock data structure becomes the buttom half of the TCP connection before it is finally accepted. It also clones the incoming sk_buff containing the connection request and queues it in the receive_queue queue of the listening sock data structure. This cloned sk_buff includes a pointer to this newly created sock data structure.

10.4.5 Accepting Connection Requests

UDP does not support the concept of connection. Accepting the connection request of the INET socket is only applied to the TCP protocol. The accept operation on a listening sock will clone a new socket data structure from the original listening socket. The accept operation is then passed to the supporting protocol layer, in this case, INET, to accept any incoming connection requests. If the underlying protocol, such as UDP, does not support connections, the accept operation at the INET protocol layer will fail. Otherwise, the connection request is passed to the real protocol, in this case, TCP. The accept operation may be blocking or non-blocking. In the non-blocking case, if there are no incoming connections to accept, the accept operation will fail and the newly created socket data structure will be discarded. In the case of blocking, the network application performing the accept operation will be added to a waiting queue and then suspended until a TCP connection request is received. Once a connection request is received, the sk_buff containing the request is discarded and the sock data structure is returned to the INET socket layer, where it is connected to the new socket data structure previously created. The file descriptor (fd) of the new socket is returned to the network application, and the application can use this file descriptor to perform socket operations on the newly created INET BSD socket.

10.5 The IP Layer

10.5.1 Socket Buffers

One problem with such network protocols is that each protocol needs to add protocol headers and trailers to the data when transmitting, and when processing the received data, it needs to be divided into many layers. delete. This makes transferring data buffers between protocols rather difficult, as each layer needs to find out where its specific protocol headers and trailers are. A workaround is to copy the buffer at each layer, but this is inefficient. Instead, Linux uses socket buffers or sock_buffs to transfer data between the protocol layer and the network device driver. Sk_buffs include pointer and length fields that allow each protocol layer to use standard functions or methods to manipulate application data.

Figure 10.4 shows the sk_buff data structure: each sk_buff has its associated piece of data. Sk_buff has four data pointers for manipulating and managing socket buffer data.

See include/linux/skbuff.h

head points to the beginning of the data area in memory. Determined when sk_buff and its associated data block are allocated.

Data points to the current start of the protocol data so far. This pointer varies with the protocol layer that currently owns the sk_buff.

Tail points to the current end of the protocol data. Again, this pointer varies with the protocol layer you have.

End points to the end of the data area in memory. This is determined when this sk_buff is allocated.

There are also two length fields, len and truesize, which describe the length of the current protocol message and the total length of the data buffer, respectively. Sk_buff handling code provides standard mechanisms for adding and removing protocol headers and trailers from application data. This code safely manipulates the data, tail and len fields in sk_buff.

Push This moves the data pointer to the beginning of the data area and increments the len field. Used to add data or protocol headers in front of the transmitted data

See include/linux/skbuff.h skb_push()

Pull moves the data pointer from the beginning to the end of the data area and reduces the len field. Used to remove data or protocol headers from received data.

See include/linux/skbuff.h skb_pull()

Put moves the tail pointer to the end of the data area and adds the len field to add data or protocol information at the end of the transmitted data

See include/linux/skbuff.h skb_put()

trim moves the tail pointer to the beginning of the data area and reduces the len field. Used to remove data or protocol trailers from received data

See include/linux/skbuff.h skb_trim()

The sk_buff data structure also includes some pointers. Using these pointers, this data structure can be stored in the doubly linked list of sk_buff during processing. There are general sk_buff routines that add and remove sk_buffs from the head and tail of these lists.

10.5.2 Receiving IP Packets

Section 8 describes how Linux network device drivers are built into the kernel and initialized. This produces a series of device data structures, linked together in the dev_base list. Each device data structure describes its device and provides a set of callback routines that the network protocol layer can call when the network driver needs to work. Most of these functions are related to transferring data and addresses of network devices. When a network device receives a data packet from its network, it must convert the received data into the sk_buff data structure. These received sk_buffs are added to the backlog queue by the network driver as they are received. If the backlog queue grows too large, the received sk_buff is discarded. If there is work to be performed, the button half of this network is marked ready to run.

See net/core/dev.c netif_rx()

When the network's bottom half handler is called by the scheduler, it first processes any network packets waiting to be delivered, and then processes the sk_buff's backlog backlo queue to determine which protocol layer the received packet needs to be delivered to. When the Linux network layer is initialized, each protocol registers itself and adds a packet_type data structure to the ptype_all list or ptype_base hash table. The packet_type data structure includes the protocol type, a pointer to the network driver device, a pointer to the protocol's data reception processing routine, and a pointer to the next packet_type data type in this list or hash table. The Ptype_all linked list is used to snoop all data packets received from any network device, and is usually not used. The Ptype_base hash table uses the protocol identifier hash to determine which protocol should receive incoming network packets. The bottom half of the network matches the protocol type of the incoming sk_buff with one or more packet_type entries in either table. The protocol may match one or more entries, for example when snooping all network traffic, in which case the sk_buff will be cloned. This sk_buff is passed to the processing routine of the matching protocol.

See net/core/dev.c net_bh()

See net/ipv4/ip_input.c ip_recv()

10.5.3 Sending IP Packets

Messages are transmitted in the process of exchanging data between applications, or may also be generated by network protocols to support established connections or to establish connections. Regardless of how the data is generated, a sk_buff is created containing the data, and a number of headers are added as it passes through the protocol layer.

This sk_buff needs to be passed to the network device that is transmitting. But first, a protocol, such as IP, needs to decide which network device to use. This depends on the best route for this packet. For computers connected to a network via a modem, such as via the PPP protocol, this routing is easier. Packets should either be passed through the loopback device to the local host, or to the gateway on the other end of the PPP modem connection. For computers connected to Ethernet, this choice is difficult because there are many computers connected to the network.

For each IP packet sent, IP uses the routing table to resolve the route to the destination IP address. For each IP destination lookup in the routing table, success returns an rtable data structure describing the route to use. This includes the source IP address used, the address of the network device data structure, and sometimes a pre-built hardware header. This hardware header is associated with network devices and contains source and destination physical addresses and other media-related information. If the network device is an Ethernet device, the hardware header will be shown in Figure 10.1, where the source and destination addresses will be physical Ethernet addresses. The hardware header and the route are cached together, because each IP packet transmitted on this route needs to append this header, and it takes time to build this header. The hardware header may contain physical addresses that must be resolved using the ARP protocol. At this time, the outgoing packets will be suspended until the address resolution is successful. Once the hardware address is resolved and the hardware header is established, the hardware header is cached so that future IP packets using this interface do not need to perform ARP.

see include/net/route.h

10.5.4 Data Fragmentation

Every network device has a maximum packet size, and it cannot transmit or receive larger data packets. The IP protocol allows this kind of data, breaking it up into smaller units of packet size that the network device can handle. The IP protocol header contains a split field containing a marker and the offset of the split.

When an IP packet is to be transmitted, IP looks up the network device used to send the IP packet. Find the device through the IP routing table. Every device has a field describing its maximum transmission unit (bytes), which is the mtu field. If the mtu of the device is smaller than the packet size of the IP packet waiting to be transmitted, then the IP packet must be divided into smaller fragments (mtu size). Each fragment is represented by an sk_buff: its IP header marks that it was fragmented, and the offset of this IP packet in the data. The last packet is marked as the last IP fragment. If the IP cannot allocate an sk_buff during the fragmentation process, the transfer fails.

Receiving IP fragments is more difficult than sending, because IP fragments may be received in any order, and they must all be received before reassembly. Every time an IP packet is received, it is checked whether it is an IP fragment. Upon receiving the first fragment of a message, IP builds a new ipq data structure and connects to the ipqueue list of IP fragments waiting to be assembled. When more IP fragments are received, the correct ipq data structure is found and a new ipfrag data structure is created to describe the fragment. Each ipq data structure uniquely describes a fragmented IP receive frame, including its source and destination IP addresses, the upper-layer protocol identifier, and the IP frame's identifier. When all the fragments are received, they are assembled together into a single sk_buff and passed to the next protocol layer for processing. Each ipq includes a timer that restarts each time a valid fragment is received. If the timer expires, the ipq data structure and its ipfrag are removed, and it is assumed that the message was lost in transit. The higher-level protocol is then responsible for retransmitting the message.

See net/ipv4/ip_input.c ip_rcv()

10.6 The Address Resolution Protocol (ARP)

The task of the Address Resolution Protocol is to provide translation of IP addresses to physical hardware addresses, such as Ethernet addresses. IP only needs this conversion when it transfers data (in the form of an sk_buff) to the device driver for transfer. It does some checks to see if this device needs a hardware header, and if so, whether the hardware header for this packet needs to be rebuilt. Linux caches hardware headers to avoid frequent rebuilds. If the hardware header needs to be rebuilt, it calls the hardware header rebuild routine associated with the device. All a device uses the same common header reconstruction routine, and then uses the ARP service to translate the target's IP address to a physical address.

See net/ipv4/ip_output.c ip_build_xmit()

See net/ethernet/eth.c rebuild_header()

The ARP protocol itself is very simple and consists of two message types: ARP request and ARP reply. The ARP request includes the IP address that needs to be translated, and the reply (hopefully) includes the translated IP address and the hardware address. ARP requests are broadcast to all hosts connected to the network, so for an Ethernet all machines connected to the Ethernet can see the ARP request. The machine with the IP address included in the request will respond to the ARP request with an ARP reply containing its own physical address.

The ARP protocol layer in Linux is built around a table of the arp_table data structure. Each describes a correspondence between an IP and a physical address. These entries are created when IP addresses need to be translated and deleted when they become stale over time. Each arp_table data structure contains the following fields:

Last used 这个ARP条目上一次使用的时间
Last update 这个ARP条目上一次更新的时间
Flags 描述这个条目的状态:它是否完成等等
IP address 这个条目描述的IP地址
Hardware address 转换(翻译)的硬件地址
Hardware header 指向一个缓存的硬件头的指针
Timer 这是一个timer_list的条目,用于让没有回应的ARP请求超时
Retries 这个ARP请求重试的次数
Sk_buff queue 等待解析这个IP地址的sk_buff条目的列表

The ARP table contains a table of pointers (the arp_tables vector table) linking the arp_table entries together. These entries are cached to speed up access to them. Each entry is looked up using the last two bytes of its IP address as an index into the table, and the chain of entries is followed until the correct entry is found. Linux also caches pre-built hardware headers from arp_table entries, in the form of the hh_cache data structure.

When requesting an IP address translation without a corresponding arp_table entry, ARP MUST send an ARP request message. It creates a new arp_table entry in the table, and puts the sk_buff including network packets that need address translation into the sk_buff queue of this new entry. It issues an ARP request and lets the ARP stale timer run. If there is no response, ARP will retry several times. If there is still no response, ARP will delete the arp_table entry. Any sk_buff data structures queued for translation by this IP address are notified, and it is up to the upper-layer protocol transmitting them to handle such failures. UDP doesn't care about lost packets, but TCP will try to resend on an established TCP connection. If the owner of this IP address replies with its hardware address, the arp_table entry is marked complete, any queued sk_buffs are removed from the pair queue, and transmission continues. The hardware address is written into the hardware header of each sk_buff.

The ARP protocol layer must also respond to ARP requests specifying its IP address. It registers its protocol type (ETH_P_ARP), producing a packet_type data structure. This means that all ARP packets received by a network device will be passed to it. Like ARP replies, this also includes ARP requests. It generates an ARP reply using the hardware address in the receiving device's device data structure.

Network topologies are constantly changing and IP addresses may be reassigned to different hardware addresses. For example, some dial-up services assign an IP address to each connection it makes. To keep the ARP table with the most recent entries, ARP runs a periodic timer that checks all arp_table entries to see which ones have timed out. It is very careful not to remove entries containing hardware headers containing one or more caches. Deleting these entries is dangerous because other data structures depend on them. Some arp_table entries are permanent and marked so they are not freed. The ARP table cannot grow too large: each arp_table entry consumes some core memory. Whenever a new entry needs to be allocated and the ARP table reaches its maximum size, the table is trimmed by finding the oldest entries and deleting them.

10.7 IP Routing

The IP routing function determines where IP packets destined for a specific IP address should be sent. When transmitting IP packets, there are many options. Is the destination reachable? If so, which network device should be used to send it? Is there more than one network device that can be used to reach the destination and which one is best? The information maintained by the IP routing database can answer these questions. There are two databases, the most important being the Forwarding Information Database. This database is an exhaustive list of known IP destinations and their best routes. Another smaller, faster database, the route cache, is used to quickly find routes to IP targets. Like all caches, it must include only the most frequently accessed routes, and its content is derived from the forwarding information database.

Routes add and delete IOCTL requests through the BSD socket interface. These requests are passed to specific protocols for processing. The INET protocol layer only allows processes with superuser privileges to add and delete IP routes. These routes can be fixed, or dynamic and constantly changing. Most systems use fixed routing unless they are routers themselves. Routers run routing protocols that constantly check for available routes to all known IP destinations. Systems that are not routers are called end systems. Routing protocols are implemented as daemons, such as GATED, which also use the IOCTL of the BSD socket interface to add and delete routes.

10.7.1 The Route Cache

Whenever an IP route is looked up, the route cache is first checked for a matching route. If there is no matching route in the route cache, the forwarding information database is searched. If the route cannot be found here, the IP packet transmission will fail and the application will be notified. If the route is in the forwarding information database but not in the route cache, a new entry is created for the route and added to the route cache. The route cache is a table (ip_rt_hash_table) that includes pointers to a chain of rtable data structures. The index of the routing table is based on the hash function of the minimum two bytes of the IP address. These two bytes are usually very different in the destination, allowing the hash value to best spread out. Each rtable entry includes routing information: the destination IP address, the network device (device structure) to be used to reach this IP address, the maximum message size that can be used, and so on. It also has a refrence count, a usage count and last used timestamp (in jiffies). This reference counter is incremented each time the route is used, showing the number of network connections using the route, and decremented when the application stops using the route. The usage counter is incremented each time a route is looked up, and is used to age the rtable entries in this chain of hash entries. The last used timestamp of all entries in the routing cache is used to periodically check whether this rtable is too old. If the route has not been used recently, it is discarded from the routing table. If routes are stored in the route cache, they are ordered so that the most frequently used entries are at the front of the hash chain. This means it will be faster to find these routes when looking up routes.

See net/ipv4/route.c check_expire()

10.7.2 The Forwarding Information Database

The forwarding information database (shown in Figure 10.5) contains the routes available to the system from an IP point of view at the time. It's a very complex data structure, and while it's arranged reasonably efficiently, it's not a fast database for reference. Especially if every IP packet transmitted is looking up the target in this database it can be very slow. That's why there is a route cache: to speed up the delivery of IP packets that already know the best route. The route cache gets from this forwarding information database and represents its most frequently used entries.

Each IP subnet is represented by a fib_zone data structure. All are pointed to by the fib_zones hash table. Hash index is taken from IP subnet mask. All routes to the same subnet are described by pairs of fib_node and fib_info data structures queued in the fz_list queue of each fib_zone data structure. If the number of routes on this subnet becomes too large, a hash table is generated to make the lookup of the fib_node data structure easier.

For the same IP subnet, there may be multiple routes that may pass through one of multiple gateways. The IP routing layer does not allow more than one route to a subnet using the same gateway. In other words, if there are multiple routes for a subnet, make sure that each route uses a different gateway. Associated with each route is its metric, which is used to measure the benefit of that route. The measure of a route is, basically, the number of subnets it has to skip before reaching the destination subnet. The higher this metric, the worse the routing.

11. Kernel Mechanisms

11.1 Bottom Half Handling

Often in the core there are times when you don't want to perform work. A good example is during interrupt handling. When an interrupt is raised, the processor stops what it is doing, and the operating system passes the interrupt to the appropriate device driver. Device drivers shouldn't spend too much time processing interrupts, because during this time nothing else in the system can run. Usually some work can be done at a later time. Linux invented the boffom half handler so that device drivers and other parts of the Linux kernel could queue work that could be done later. Figure 11.1 shows the core data structures associated with bottom half processing. There are up to 32 different bottom half handlers: bh_base is a vector table of pointers to each bottom half handler of the core, bh_active and bh_mask set their bits according to which handlers are installed and activated. If bit N of bh_mask is set, the Nth element in bh_base contains the address of a bottom half routine. If the Nth bit of bh_active is set, then the bottom half handler for the Nth bit will be called as soon as the scheduler deems it reasonable. These indices are statically defined: the timer bottom half handler has the highest priority (index 0), the console bottom half handler has the next highest priority (index 1), and so on. Usually the bottom half handler will have a task list associated with it. For example, the immediate buttom half handler works through an immediate task queue (tq_immediate) that contains tasks that need to be executed immediately.

See include/linux/interrupt.h

Some of the bottom half handlers of the core are device specific, but others are more general:

TIMER This handler is marked as active every time the system timing clock interrupt is used to drive the core's clock queuing mechanism

CONSOLE This handler is used to handle console messages

TQUEUE This handler is used to handle TTY messages

NET This handler is used to handle general network processing

IMMEDIATE Generic handler, used by some device drivers to queue work for later

When a device driver, or other part of the core, needs to schedule work to be done later, it adds the work to the appropriate system queue, such as the clock queue, and then sends a signal to the core that some bottom half processing needs to happen. It does this by setting the appropriate bits in bh_active. Bit 8 is set if the driver queues something in the immediate queue and expects the immediate bottom half handler to run and process it. At the end of each system call, the bh_active bitmask is checked before returning control to the calling program. If any of the bits are set, the corresponding active bottom half handler routine is called. Bit 0 is checked first, then 1 until bit 31. The corresponding bit in bh_active is cleared each time the bottom half handler is called. Bh_active is volatile: it only makes sense between calls to the scheduler, and by setting it, the corresponding bottom half handler can not be called when there is no work to do.

Kernel/softirq.c do_bottom_half()

11.2 Task Queues

Task queues are the method the core uses to defer work to a later time. Linux has a general mechanism for queuing jobs and processing them at a later time. Task queues are often used with bottom half handlers: the timer task queue is processed while the timer bottom half handler is running. A task queue is a simple data structure, see Figure 11.2, consisting of a singly linked list of tq_struct data structures, each containing a pointer to a routine and a pointer to some data.

See include/linux/tqueue.h This routine is called when the unit of the task queue is processed, and a pointer to the data is passed to it.

Anything in the core, such as device drivers, can create and use task queues, but there are three task queues that are created and managed by the core:

timer This queue is used to enqueue work to run as long as possible after the next system clock. Every clock cycle, this queue is checked for an entry, and if so, the clock queue's bottom half handler is marked active. This clock queue bottom half handler and other bottom half handlers are processed when the scheduler runs in one run. Don't confuse this queue with a system timer, that's a more complicated mechanism

immediate This queue is also processed when the scheduler processes the active bottom half handler. The immediate bottom half handler has no higher priority than the timer queue bottom half handler, so these tasks will be hesitant to run.

Scheduler This task queue is handled directly by the scheduler. It is used to support other task queues in the system, in which case the task to be run would be a routine that handles the task queue (eg a device driver).

When processing a task queue, a pointer to an element in the queue is removed from the queue and replaced with a null pointer. In fact, this deletion is an atomic operation that cannot be interrupted. Its handler routine is then called sequentially for each element in the queue. The cells in the queue are usually statically allocated data. But there is no inherent mechanism to discard allocated memory. The task queue processing routine simply moves to the next cell in the list. It is the job of the task itself to ensure that any allocated core memory is properly cleared.

11.3 Timers

An operating system needs to be able to schedule an activity to a time in the future, which requires a mechanism to allow activities to be scheduled to run at a relatively accurate time. Any microprocessor wishing to support an operating system needs a moderately programmable interval, interrupting the processor periodically. This periodic interrupt is the system clock tick, which acts like a metronome and directs the activity of the system. Linux looks at time in a very simple way: it measures time in clock cycles since the system boots. Any system time is based on this measure, called jiffers, which has the same name as the global variable.

Linux has two types of system timers, each arranging routines to be called at specific system times, but they differ slightly in the way they are implemented. Figure 11.3 shows two mechanisms. The first, the old timer mechanism, has a static array of 32 pointers to the timer_struct data structure and a mask of active clocks, timer_active. Where the timer is placed in this timer table is statically defined (unlike bh_base in the bottom half handler). Entries are added to this table during system initialization. The second mechanism uses a linked list of timer_list data structures to arrange data by expiration time.

See include/linux/timer.h

Each method uses the time in jiffies as the expiration time, so a timer that wants to run for 5 seconds will have a jiffies unit that can be converted to 5 seconds plus the current system time to get the system time when the timer expires (in jiffies units). Every system clock cycle, the timer bottom half handler is marked active, so the next time the scheduler runs, the timer queue is processed. The Timer bottom half handler handles both types of system timers. For old system timers, check that the timer_active bitmask is set. If an active timer expires (expiration time is less than the current system jiffies), its timer routine is called and its active bit is cleared. For new system timers, check the entries in the linked table of the timer_list data structure. Each expired timer is removed from this list and its routine is called. The advantage of the new timer mechanism is that it can pass parameters to timer routines.

See kernel/sched.c timer_bh() run_old_timers() run_timer_list()

11.4 Wait Queues

Many times a process must wait for a system resource. For example, a process may need a VFS inode describing a directory in the filesystem, but this inode may not be in the buffer cache. At this point, the system must wait for the inode to be fetched from the physical medium containing the filesystem before continuing.

The Linux kernel uses a simple data structure, a wait queue (see Figure 11.4), containing a pointer to the process' task_struct and a pointer to the next element in the wait queue.

See include/linux/wait.h

When processes are added to the end of a wait queue, they may or may not be interruptible. Interruptible processes can be interrupted by events while waiting in the waiting queue, such as an expired timer or a signal sent. The status of the waiting process will be reflected, which can be INTERRUPTIBLE or UNINTERRUPTIBLE. Since the process cannot continue running now, the scheduler starts running, and when it chooses a new process to run, the waiting process is suspended.

When processing the wait queue, the status of each process in the wait queue is set to RUNNING. If the process is removed from the run queue, it is put back on the run queue. The next time the scheduler is run, the processes that were in the waiting queue are now candidates to run because they are no longer waiting. When a process waiting on a queue is scheduled, the first thing to do is remove itself from the waiting queue. Waiting queues can be used to synchronize access to system resources, and Linux implements its semaphores this way.

11.5 Buzz Locks

Often called spin locks, this is a primitive method of protecting a data structure or code segment. They allow only one process at a time to be in an important area of ​​code. Linux uses them to restrict access to fields in data structures, using an integer field as a lock. Each process wishing to enter this area attempts to change the lock's starting value from 0 to 1. If the current causes 1, the process retries, spinning in a tight code loop. Access to the memory location holding this lock must be atomic, read its value, check that it is 0 and then change it to 1, this action must not be interrupted by any other process. Most CPU architectures support this with special instructions, but you can also implement this buzz lock using uncached main memory.

When the owner process leaves this important code area, it reduces the buzz lock, returning its value to 0. Any process looping on this lock will now read 0, and the first process to do so will increment it to 1 and enter this important region.

11.6 Semaphores

Semaphores are used to protect important code regions or data structures. Remember that every access to important data structures such as a VFS inode describing a directory is performed by the kernel for the process. It is very dangerous to allow one process to change important data structures used by another process. One way to do this is to use a buzz lock on the important piece of code to be accessed, although this is the easiest way but won't have very good system performance. Linux uses a semaphore implementation to allow only one process to access important code and data areas at a time: all other processes wishing to access this resource are forced to wait until the semaphore is free. The waiting process is scraped, and other processes in the system run normally as usual.

A Linux semaphore data structure includes the following information:

See include/asm/semaphore.h

count This field records the number of processes that wish to use this resource. A positive number indicates that the resource is available. A negative value or 0 indicates that a process is waiting. A starting value of 1 means that one and only one process can use the resource at a time. When processes wish to use the resource they decrement the count and when they finish using the resource they increment the count

waking The number of processes waiting for this resource, which is also the number of processes waiting to be woken up when the resource is free.

Wait queue When processes are waiting for this resource they are placed on this wait queue

Lock Buzz lock used when accessing the waking domain

Assuming the semaphore starts with a value of 1, the first process to arrive will see that count is positive and decrement it by 1 to 0. The process now "owns" the important piece of code or resource protected by the semaphore. It increments the semaphore count when the process leaves this important area. Ideally, no other processes are competing for ownership of this important area. The Linux implementation of semaphores works very efficiently in this most common case.

It also decrements this count if another process wishes to enter this important area while it is already owned by a process. Because the count is now -1, the process cannot enter this important region. It must wait until the owning process exits. Linux puts the waiting process to sleep until the owning process exits this important area to wake it up. The waiting process adds itself to the semaphore's waiting queue, and loops to check the value of the waking field, calling the scheduler until waking is non-zero.

The owner of this important area increases the count of the semaphore. If it is less than or equal to 0, then there are still processes sleeping, waiting for this resource. Ideally, the count of the semaphore would return to its starting value of 1, so that no work is required. The owning process increments the waking counter and wakes up processes sleeping in the semaphore waiting queue. When the waiting process is woken up, the waking counter is now 1, and it knows that it can now enter this important area. It decrements the waking counter, returns it to 0 and continues. All access to the waking field of this semaphore is protected by the buzz lock of the semaphore's lock.

12、Modules

How does the Linux kernel dynamically load functions only when needed, e.g. the filesystem?

Linux is a complete core, that is, it is a single huge program, and the functional components of the core have access to all of its internal data structures and routines. Another method is to use a microkernel structure, where the functional pieces of the core are divided into independent units that have strict communication mechanisms with each other. This way it doesn't take much time to add new components to the core through the configuration process. For example if you wish to add a SCSI driver for an NCR 810 SCSI card, you do not need to connect it to the core. Otherwise you have to configure and build a new core to use this NCR 810. As a workaround, Linux allows components of the operating system to be dynamically loaded and unloaded as you need them. A Linux module is a block of code that can be dynamically linked to the kernel at any time after the system is started. They can be removed from the core and uninstalled when not needed. Most Linux kernel modules are device drivers, pseudo-device drivers such as network drivers or filesystems.

You can explicitly load and unload Linux kernel modules using the insmod and rmmod commands, or the kernel itself can ask the kernel daemon (kerneld) to load and unload these modules when they are needed. Loading code dynamically when needed is quite attractive because it keeps the core minimal and the core is very flexible. My current Intel core makes heavy use of modules and it's only 406K in size. I usually only use the VFAT filesystem, so I build my Linux kernel to automatically mount the VFAT filesystem when I mount a VFAT partition. When I unmounted the VFAT file system, the system detected that I no longer needed the VFAT file system module and removed it from the system. Modules can also be used to try out new core code without creating and restarting the core each time. However, there are no such nice things, and using core modules usually comes with a slight performance and memory overhead. A loadable module must provide more code, and this code and additional data structures take up a little more memory. In addition, because of indirect access to core resources, the efficiency of the module is slightly reduced.

Once the Linux kernel is loaded, it becomes part of the kernel just like normal kernel code. It has the same rights and obligations as any kernel code: in other words, a Linux kernel module is just as likely to crash the kernel as any kernel code or device driver.

Since modules can use core resources when they need them, they must be able to find them. For example, a module needs to call kmalloc(), the kernel memory allocation routine. When built, the module does not know where kmalloc() is in memory, so when the module is loaded, the kernel must sort out all references to kmalloc() by the module before the module can work. The core keeps a list of all core resources in the core symbol table, so when the module is loaded it can resolve references to those resources in the module. Linux allows stacking (stacking) of modules, where one module needs the services of another module. For example, the VFAT file system module needs the services of the FAT file system module because the VFAT file system is more or less an extension on the FAT file system. The situation in which a module needs the services or resources of another module is very similar to the situation in which a module needs its own services and resources, except that the requested service is in another, previously loaded module. As each module is loaded, the core modifies its symbol table, adding all exported resources or symbols of the newly loaded module to the core symbol table. This means that when the next module loads, it can access the services of the already loaded module.

When the graph is unloading a module, the core needs to know that the module is no longer in use, and it also needs some way to notify it of the module it is ready to unload. In this way a module can free any system resources it occupies, such as kernel memory or interrupts, before it is removed from the kernel. When a module is unloaded, the kernel removes all symbols that the module exports to the core symbol table.

Besides the possibility of corrupting the operating system by poorly written loadable modules, there is another danger. What happens if you load a module built for a core earlier or later than the one you are currently running? Problems can arise if the module executes a core routine with the wrong arguments. The core can choose to prevent this by doing strict version checking when the module is loaded.

12.1 Loading a Module

A core module can be loaded in two ways. The first is to manually insert it into the core using the insmod command. A second, smarter way is to load the module when needed: this is called demand loading. When the kernel finds that a module is needed, such as when the user mounts a filesystem that is not in the kernel, the kernel asks the kernel daemon (kerneld) to try to load the appropriate module.

Kerneld and insmod, lsmod and rmmod are all in the modules package.

The core daemon is usually a normal user process with superuser privileges. When it starts (usually at system boot), it opens an IPC channel to the kernel. The kernel uses this connection to send messages to kerneld asking it to perform a number of tasks. Kerneld's main function is to load and unload core modules, but it can also perform other tasks, such as starting a PPP connection on a serial line when needed and closing it when not needed. Kerneld itself does not perform these tasks, it runs the necessary programs such as insmod to get the job done. Kerneld is just a proxy to the core, scheduling its work.

See include/linux/kerneld.h

The insmod command must find the requested core module for it to load. Core modules loaded by Xu are usually placed in the /lib/mmodules/kernel-version directory. Kernel modules are linked program object files like other programs on the system, but they are linked into a relocatable image. It is an image that is not connected to a specific address to run. They can be object files in a.out or elf format. Insmod points to a privileged system call to find out the system's output symbols. They are stored in pairs in the form of a symbolic name and a value such as its address. The output symbol table of the core is placed in the first module data structure in the list of modules maintained by the core, pointed to by the module_list pointer. Only specially designated symbols are added to this table when the core is compiled and linked, not every symbol in the core exports its module. For example the symbol "request_irq" is a system routine that must be called when a driver wishes to control a particular system interrupt. On my current core its value is 0x0010cd30. You can inspect the file /proc/ksyms or use the ksyms tool to simply view the output of the core symbols and their values. The Ksyms tool can show you all exported core symbols or only those exported by loaded modules. Insmod reads the module into its virtual memory and uses the kernel's export symbols to sort out the module's unresolved references to kernel routines and resources. This cleanup process is done by patching the module image in memory, insmod physically writes the address of the symbol to the appropriate location in the module.

See kernel/module.c kernel_syms() include/linux/module.h

When insmod finishes sorting out the module's reference to the exported core symbols, it requests enough space from the core to place the new core, again through a privileged system call. The kernel allocates a new module data structure and enough kernel memory to hold the new module and places it at the end of the kernel's module list. This new module is marked UNINITIALIZED. Figure 12.1 shows that the last two modules in the core module list: FAT and VFAT are loaded into memory. Not shown in the figure is the first module of the list: this is a pseudo-module where the output symbol table of the core is placed. You can use the command lsmod to list all loaded core modules and their dependencies. Lsmod simply rearranges /proc/modules extracted from the list of core module data structures. The memory allocated by the kernel for the module is mapped into the address space of the insmod process, so it can access it. Insmod copies the module to the allocated space and relocates it so that it can be run from the allocated core address. Relocations are necessary because a module cannot be expected to be loaded at the same address twice or be loaded at the same address on two different Linux systems. This time, relocation is about patching the module's image with the appropriate address.

See kernel/module.c create_module()

New modules also export symbols to the kernel, and Insmod builds an export map. Each kernel module must contain the process of module initialization and module cleanup. These symbols must be private rather than exported, but insmod must know their addresses and be able to pass them to the kernel. After all this is done, Insmod is now ready to initialize the module, which executes a privileged system call passing the address of the module's initialization and cleanup routines to the kernel.

See kernel/module.c sys_init_module()

When a new module is added to the core, it must update the core's symbol table and change the modules used by the new module. Modules that other modules depend on must maintain a reference list after their symbol table, pointed to by their module data structure. Figure 12.1 shows that the VFAT file system module depends on the FAT file system module. So the FAT module contains a reference to the VFAT module: this reference is incremented when the VFAT module is loaded. The core calls the module's initialization routine, and if successful, it starts installing the module. The address of the module's cleanup routine is stored in its module data structure and is called by the kernel when the module is unloaded. Finally, the state of the module is set to RUNNING.

12.2 Unloading a Module

Modules can be removed using the rmmod command, but kerneld can remove all unused modules loaded on demand from the system. Every time its idle timer expires, kerneld executes a system call requesting that all unneeded on-demand modules be removed from the system. The value of this timer is set when you start kerneld: my kerneld checks every 180 seconds. If you install an iso9660 CD ROM and your iso9660 filesystem is a loadable module, then the iso9660 module will be removed from the core shortly after the CD ROM is unmounted.

If other components in the core depend on a module, it cannot be removed. For example, if you have one or more VFAT filesystems installed, you cannot uninstall the VFAT module. If you check the ls output, you will see a counter associated with each module. E.g:

Module: #pages: Used by:
msdos 5 1
vfat 4 1 (autoclean)
fat 6 [vfat msdos] 2 (autoclean)

The count (count) is the number of core entities that depend on this module. In the above example, both vfat and msdos depend on the fat module, so the counter of the fat module is 2. Both the Vfat and msdos modules have a dependency count of 1 because they both have a mounted filesystem. If I load another VFAT filesystem, the vfat module's counter becomes 2. A module's counter is placed in the first longword of its image.

Because it also places the AUTOCLEAN and VISITED flags, this field has some slight overload. Both of these flags are used to load modules on demand. These modules are marked AUTOCLEAN so that the system can identify which ones it can unload automatically. The VISITED flag indicates that this module is used by one or more system components: this flag is set whenever another component uses it. Every time kerneld asks the system to remove an unused on-demand module, it looks at all the modules in the system and finds a suitable candidate. It only looks at modules marked AUTOCLEAN and whose status is RUNNING. If the candidate VISITED flag is cleared, then it deletes the module, otherwise it clears the VISITED flag and moves on to the next module in the system.

Assuming a module can be unloaded, its cleanup routine is called to free the core resources it allocated. The module data structure is marked as DELTED and removed from the list of core modules. The reference lists of any other modules it depends on are modified so that they no longer consider it a dependant. All core memory needed by this module is freed.

See kernel/module.c delete_module()

13. The Linux Kernel Sources

Where does the Linux core source program start to look at specific core functions?

Practice viewing the core source code to gain an in-depth understanding of the Linux operating system. This section gives an overview of the core source programs: how they are organized and where you should start looking for specific code.

Where to Get The Linux Kernel Sources

All major Linux distributions (Craftworks, Debian, Slackware, RedHat, etc.) have core sources in the middle. Usually the Linux kernel installed on your Linux system is built with these source programs. In fact these sources appear to be somewhat outdated, so you may wish to get the latest source from the web sites mentioned in Appendix C. They are placed on ftp://ftp.cs.helsinki.fi and all other mirrored web sites. Helsinki's web site is up to date, but other sites such as MIT and Sunsite are not too far behind.

If you don't have access to the web, there are many CDROM manufacturers that offer blocks of the world's major web sites for very reasonable fees. Some even offer subscription services, with quarterly or monthly updates. Your local Linux user group is also a good source of source code.

The Linux core source programs have a very simple numbering system. Any even-numbered core (eg 2.0.30) is a stable released core, and any odd-numbered core (eg 2.1.42) is a developing core. This book is based on the stable 2.0.30 source code. The development cores have all the latest features and support for all the latest devices, but they may not be stable and may not be what you want, but it's important to have the Linux community test the latest cores. This allows the entire community to be tested. Remember, even if you're testing non-production cores, it's a good idea to back up your system.

Changes to the core source program are distributed as patch files. The tool patch can apply a series of modifications to a series of source files. For example, if you have a 2.0.29 source tree and you want to move to 2.0.30, you can take the 2.0.30 patch files and apply those patches (edits) to the source tree:

$ cd /usr/src/linux
$ patch -p1 < patch-2.0.30

This saves you from having to copy the entire source tree, especially for slow serial connections. A good source of core patches (official and informal) is http://www.linuxhq.com

How The Kernel Sources Are Arranged

At the top of the source tree you will see some directories:

arch The arch subdirectory contains all architecture-related core code. It also has deeper subdirectories, each representing a supported architecture, such as i386 and alpha.

The include subdirectory includes most of the include files needed to compile the core. It also has deeper subdirectories, one for each supported architecture. Include/asm is a soft link to the actual include directory required by this architecture, eg include/asm-i386. In order to change the architecture, you need to edit the core makefile and rerun the Linux core configurer

Init This directory contains the initialization code for the core and is a good starting point for studying how the core works.

Mm This directory contains all memory management code. Architecture-related memory management code is located in arch/*/mm/, such as arch/i386/mm/fault.c

Drivers All device drivers of the system are in this directory. They are divided into device driver classes such as block.

Ipc This directory contains the core inter-process communication code

Modules This is just a directory for storing established modules

Fs All filesystem code. is divided into subdirectories, one for each supported filesystem, such as vfat and ext2

The main core code of Kernel. Similarly, the core code related to the system is placed in arch/*/kernel

Net core network code

Lib This directory houses the core library code. The library code related to the architecture is in arch/*/lib/

Scripts This directory contains scripts (such as awk and tk scripts) that configure the core

Where to Start Looking

It is quite difficult to look at a program as large and complex as the Linux kernel. It's like a giant ball of thread that shows no end. Looking at a part of the core code usually leads to looking at several other related files, otherwise you will forget what you looked at. The next section gives you a hint as to where it is best to look in the source tree for a given topic.

System Startup and Initialization

On an Intel system, the kernel starts when loadlin.exe or LILO loads the kernel into memory and passes control to it. See arch/i386/kernel/head.S for this part. head.S performs some architecture-dependent setup work and jumps to the main() routine in init/main.c.

Memory Management

Most of the code is in mm but architecture-related code is in arch/*/mm. The page fault handling code is in mm/memory.c, and the memory map and page cache code is in mm/filemap.c. The Buffer cache is implemented in mm/buffer.c, and the swap cache is in mm/swap_state.c and mm/swapfile.c.

Kernel

Most of the relatively general code is in the kernel, and the code related to the architecture is in arch/*/kernel. The scheduler is in kernel/sched.c, and the fork code is in kernel/fork.c. The bottom half processing code is in include/linux/interrupt.h. The task_struct data structure can be found in include/linux/sched.h

PC

The PCI pseudo driver is in drivers/pci/pci.c, and the system-wide definition is in include/linux/pci.h. Each architecture has some special PCI BIOS code, the Alpha AXP's is located in arch/alpha/kernel/bios32.c

Interprocess Communication

All in the ipc directory. All System V IPC objects include the ipc_perm data structure, which can be found in include/linux/ipc.h. System V messages are implemented in ipc/msg.c, shared memory is in ipc/shm.c, and semaphores are in ipc/sem.c. Pipes are implemented in ipc/pipe.c.

Interrupt Handling

The core interrupt handling code is almost always microprocessor (and usually platform) specific. The Intel interrupt handling code is in arch/i386/kernel/irq.c and its definition is in inude/asm-i386/irq.h.

Device Drivers

Most of the Linux core source code lines are in its device drivers. All device driver source code for Linux is in drivers, but they are further categorized:

/block block device drivers such as ide (ide.c). If you want to see how all devices that may contain filesystems are initialized, you can look at device_setup() in drivers/block/genhd.c. It not only initializes the hard disk, but also initializes the network, because you need the network when you mount the nfs file system. Block devices include IDE-based and SCSI devices.

/char Here you can view character-based devices such as tty, serial port, etc.

/cdrom All CDROM code for Linux. Special CDROM devices (eg Soundblaster CDROM) can be found here. Note that the ide CD driver is ide-cd.c in drivers/block and the SCSI CD driver is in drivers/scsi/scsi.c

/pci PCI pseudo driver. This is a good place to observe how the PCI subsystem is mapped and initialized. The Alpha AXP PCI collation code is also worth looking at in arch/alpha/kernel/bios32.c

/scsi Here you can find not only drivers for all scsi devices supported by Linux, but also all SCSI codes

/net Here you can find network device drivers like DEC Chip 21040 PCI Ethernet driver in tulip.c

/sound the location of all sound card drivers

File Systems

The source programs of the EXT2 file system are in the fs/ext2/ subdirectory, and the data structures are defined in include/linux/ext2_fs.h, ext2_fs_i.h and ext2_fs_sb.h. The data structure of the virtual file system is described in include/linux/fs.h, the code is fs/*. Buffer cache and update core daemons are implemented with fs/buffer.c

Network

The network code is placed in the net subdirectory, and most of the include files are in include/net. The BSD socket code is in net/socket.c, and the Ipv4 INET socket code is in net/ipv4/af_inet.c. The support code for common protocols (including sk_buff handling routines) is in net/core, and the TCP/IP network code is in net/ipv4. Network device drivers are in drivers/net

Modules

The core module code is partly in the core and partly in the modules package. The core code is all in kernel/modules.c, the data result and the message of the core daemon kerneld are in include/linux/module.h and include/linux/kerneld.h respectively. You may also want to look at the structure of an ELF object file in include/linux/elf.h.

Appendix A

Linux Data Structures

This appendix lists the main data structures used by Linux as described in this book. They have been lightly edited for accessibility on the page.

Block_dev_struct

The block_dev_struct data structure is used to register available block devices for buffer cache use. They are placed in the blk_dev vector table.

see include/linux/blkdev.h

struct blk_dev_struct {
void (*request_fn)(void);
struct request * current_request;
struct request plug;
struct tq_struct plug_tq;
};

 

buffer_head

The buffer_head data structure stores information about a block buffer in the buffer cache.

See include/linux/fs.h

device

Each network device in the system is represented by a device data structure.

See include/linux/netdevice.h

device_struct

The device_struct data structure is used to register character and block devices (holding the name of the device and possible file operations). Each valid member of the Chrdevs and blkdevs vector tables represents a character or block device, respectively.

See fs/devices.c

struct device_struct {
const char * name;
struct file_operations * fops;
};

file

Each open file, socket, etc. is represented by a file data structure.

See include/linux/fs.h

file_struct

The file_struct data structure describes the files opened by a process.

see include/linux/sched.h

gendisk

The gendisk data structure stores information about the hard disk. Used in the initialization process to find the disk, when detecting partitions.

See include/linux/genhd.h

inode

The VFS inode data structure stores information about a file or directory on disk.

See include/linux/fs.h

ipc_perm

The ipc_perm data structure describes the access permissions for a System V IPC object.

See include/linux/ipc.h

irqaction

The irqaction data structure describes the system's interrupt handlers.

See include/linux/interrupt.h

linux_binfmt

Every binary file format understood by Linux is represented by a linux_binfmt data structure.

See include/linux/binfmt.h

mem_map_t

The mem_map_t data structure (also called page) is used to store information about each physical memory page.

see include/linux/mm.h

mm struct

The mm_struct data structure is used to describe the virtual memory of a task or process.

see include/linux/sched.h

pci_bus

Each PCI bus in the system is represented by a pci_bus data structure.

see include/linux/pci.h

pci_dev

Each PCI device in the system, including PCI-PCI and PCI-ISA bridge devices, is represented by a pci_dev data structure.

see include/linux/pci.h

request

request is used to make a request to a block device in the system. Requests are to read/write data blocks from/to the buffer cache.

see include/linux/blkdev.h

rtable

Each rtable data structure stores information about the route for sending packets to an IP host. The Rtable data structure is used in the IP route cache.

see include/net/route.h

semaphore

Semaphores are used to protect important data structures and areas of code.

See include/asm/semaphore.h

sk_buff

The sk_buff data structure describes network data as it moves between protocol layers.

See include/linux/sk_buff.h

sock

Each sock data structure stores protocol-related information in a BSD socket. For example, for an INET socket, this data structure will hold all TCP/IP and UDP/IP related information.

see include/linux/net.h

socket

Each socket data structure stores information about a BSD socket. It doesn't stand on its own and is actually part of the VFS inode data structure.

see include/linux/net.h

task_struct

Each task_struct describes a task or process in the system.

see include/linux/sched.h

timer_list

The timer_list data structure is used to implement the real-time timer of the process.

See include/linux/timer.h

tq_struct

Each task queue (tq_struct) data structure holds information about the work being queued. Usually a task that a device driver needs, but does not need to be done immediately.

See include/linux/tqueue.h

vm_area_struct

Each vm_area_struct data structure describes a virtual memory area of ​​a process.

see include/linux/mm.h

Additional Supplements:

1. Hardware Basic (hardware basics)

1)CPU

The CPU, or microprocessor, is the heart of any computer system. The microprocessor performs mathematical operations, logical operations, and reads and executes instructions from memory, thereby controlling the flow of data. In the early days of computer development, the various functional modules of microprocessors were composed of separate (and huge in size) units. This is also the origin of the term "central processing unit". Modern microprocessors combine these functional blocks on an integrated circuit made of a very small silicon wafer. In this book, the terms CPU, microprocessor, and processor are used interchangeably.

Microprocessors deal with binary data: these data consist of 1s and 0s. These 1s and 0s correspond to the on or off of the electrical switch. Just as 42 represents 4 units of 10 and 2, a binary number consists of a series of numbers that represent powers of 2. Here, power means the number of times a number is multiplied by itself. The first power of 10 is 10, the second power of 10 is 10x10, the third power of 10 is 10x10x10, and so on. Binary 0001 is decimal 1, binary 0010 is decimal 2, binary 0011 is decimal 3, binary 0100 is decimal 4, and so on. So, 42 in decimal is 101010 in binary or (2+8+32 or 21+23+25). In addition to using binary to represent numbers in computer programs, another base, hexadecimal, is often used. In this base, each digit represents a power of 16. Because decimal numbers just go from 0 to 9, 10 to 15 in hexadecimal are represented by the letters A, B, C, D, E, F respectively. For example, E in hexadecimal is 14 in decimal, and 2A in hexadecimal is 42 in decimal (2 16+10). In C notation (used throughout this book), hexadecimal numbers are prefixed with "0x": 2A in hexadecimal is written as 0x2A.

The microprocessor can perform arithmetic operations such as addition, multiplication and division, as well as logical operations such as "is X greater than Y".

The execution of the processor is controlled by an external clock. This clock, the system clock, generates steady clock pulses for the processor, and in each clock pulse, the processor performs some work. For example, a processor can execute one instruction per clock pulse. The speed of the processor is described by the frequency of the system clock. A 100Mhz processor receives 100,000,000 clock pulses per second. It is a misunderstanding to describe a CPU's capabilities in terms of clock frequency, because different processors perform different amounts of work on each clock pulse. Nonetheless, all things being equal, a faster clock frequency means a more capable processor. The instructions executed by the processor are very simple, for example: "Read the contents of memory location X into register Y". Registers are the internal storage space of microprocessors, used to store data and perform operations. The operation performed may cause the processor to stop the current operation and instead execute instructions elsewhere in memory. It is these tiny collections of instructions that give the modern microprocessor nearly limitless capabilities, as it can execute millions or even billions of instructions per second.

Instructions must be fetched from memory when they are executed, and the instructions themselves may reference data in memory, which must also be fetched into memory and saved to memory when needed.

The size, number, and type of a microprocessor's internal registers are entirely determined by its type. An Intel 80486 processor and an Alpha AXP processor have completely different register sets. Also, Intel is 32 bits wide and Alpha AXP is 64 bits wide. In general, however, all specific processors will have some general-purpose registers and a few special-purpose registers. Most processors have dedicated registers for the following special purposes:

Program Counter (PC) program counter

This register records the address of the next instruction to be executed. The contents of the PC are automatically incremented each time an instruction is fetched.

Stack Pointer (SP) stack pointer

The processor must have access to a large amount of external read-write random access memory (RAM) used to temporarily store data. The stack is a method for storing and restoring temporary data in external memory. Typically, processors provide special instructions for pushing data on the stack and fetching it later as needed. The stack uses LIFO (last in first out) method. In other words, if you push two values ​​x and y onto the stack, and then pop a value from the stack, you get the value of y.

Some processors have stacks that grow toward the top of memory, while others grow toward the bottom of memory. There are also some processors that can support both ways, for example: ARM.

Processor Status(PS)

Instructions may produce results. For example: "Is the contents of the X register greater than the contents of the Y register?" may yield a true or false result. The PS register holds these results and other information about the current state of the processor. Most processors have at least two modes: kernel (kernel mode) and user (user mode), and the PS register will record those information that can determine the current mode.

2) Memory

All systems have a hierarchical memory structure consisting of memory at different levels of speed and capacity.

The fastest memory is cache memory, as its name implies - used to temporarily store or cache the contents of main memory. This type of memory is very fast but relatively expensive, so most processor chips have a small amount of cache memory built into them, and most of the cache memory is on the system motherboard. Some processors use one cache memory to cache both instructions and data, while others have two caches - one for instructions and one for data. The Alpha AXP processor has two built-in in-memory cache stores: one for data (D-Cache) and another for instructions (I-Cache). Its external cache (or B-Cache) mixes the two.

The last type of memory is main memory. Very slow relative to external cache memory, main memory is literally a crawl for the cache memory built into the CPU.

Cache memory and main memory must be synchronized (coherent). In other words, if a word in main memory is held in one or more locations in cache memory, the system must ensure that the contents of cache memory and main memory are the same. Part of the work of synchronizing the caches is done by the hardware and the other part is done by the operating system. For some other major tasks of the system, hardware and software must also work closely together.

3) Buses

The various components of the system board are interconnected by a system of connections called a bus. The system bus is divided into three logical functions: address bus, data bus and control bus. The address bus specifies the memory location (address) of the data transfer, and the data bus holds the transferred data. The data bus is bidirectional, it allows the CPU to read and also allows the CPU to write. The control bus contains various signal lines used to send clock and control signals in the system. There are many different bus types, the ISA and PCI buses are the common ways systems use to connect peripherals.

4) Controllers and Peripherals

Peripherals refer to physical devices such as graphics cards or disks controlled by a control chip on the system board or system board add-in cards. The IDE controller chip controls IDE disks, while the SCSI controller chip controls SCSI disks. These controllers are connected to the CPU and to each other through different buses. Most systems manufactured today use the PCI or ISA bus to connect the major components of the system together. The controller itself is also a processor like the CPU, they can be regarded as the intelligent assistant of the CPU, and the CPU has the highest control of the system.

All controllers are different, but generally they have registers used to control them. Software running on the CPU must be able to read and write these control registers. One register may contain a status code describing the error, and another register may be used for control purposes, changing the mode of the controller. Each controller on a bus can be individually addressed by the CPU, so that software device drivers can read and write its registers to control it. An IDE cable is a good example, it gives you the ability to access each drive on the bus separately. Another good example is the PCI bus, which allows each device (such as a graphics card) to be accessed independently.

5) Address Spaces

The system bus connecting the CPU and main memory and the bus connecting the CPU and system hardware peripherals are separate. The memory space owned by hardware peripherals is called I/O space. The I/O space itself can be further divided, but we won't discuss it for now. The CPU can access the system memory space and I/O space, while the controller can only access the system memory indirectly through the CPU. From the perspective of a device, such as a floppy drive controller, it only sees the address space (ISA) where its control registers reside, not system memory. A CPU uses different instructions to access memory and I/O space. For example, there might be an instruction that says "read a byte from I/O address 0x3f0 into the X register". This is also the method by which the CPU controls the peripherals by reading and writing the registers of the system hardware peripherals in the I/O address space. In the address space, the registers of common peripherals (such as IDE controllers, serial ports, floppy disk controllers, etc.) have become the norm in the development of PC peripherals over the years. The address 0x3f0 of the I/O space is the address of the control register of the serial port (COM1).

Sometimes the controller needs to read large amounts of memory directly from system memory, or write large amounts of data directly to system memory. For example, write user data to the hard disk. In this case, a direct memory access (DMA) controller is used, allowing hardware devices to directly access system memory, but of course this access must be done under the strict control and supervision of the CPU. 

6) Timer (clock)

All operating systems need to know the time, and modern PCs include a special peripheral called a real-time clock (RTC). It provides two things: reliable dates and precise time intervals. The RTC has its own battery, so even if the PC isn't powered up, it's still running. This is also why the PC always "knows" the correct date and time. Interval timing allows the operating system to precisely schedule essential work.

The Alpha AXP architecture is a 64-bit load/store RISC architecture designed for speed. All registers are 64 bits long: 32 integer registers and 32 floating point registers. The 31st integer register and the 31st floating point register are used for null operations: reading them gives 0, writing to them yields nothing. All instructions and memory operations (whether read or write) are 32-bit. Different implementations are allowed as long as the concrete implementation follows this architecture.

There are no instructions for manipulating values ​​directly in memory: all data operations are performed between registers. So, if you want to increment a counter in memory, you must first read it into a store, modify it and then write it back. Interaction between instructions is only possible if one instruction writes to one register or memory location and another reads that register or memory location. An interesting feature of Alpha AXP is that it has instructions that can generate flag bits, such as testing whether two integers are equal, this structure is not stored in one of the processor's status registers, but in a third register. It might seem odd at first, but not relying on the status register means it's easier to get the CPU to execute multiple instructions per cycle. Instructions that use unrelated registers during execution do not need to wait for each other, and must wait if there is only one status register. There is no direct manipulation of memory and the large number of registers is also helpful for multiple instructions at the same time.

The Alpha AXP architecture uses a series of subroutines called the privileged architecture library code PALcode. The specific implementation of PALcode and the operating system and the CPU of the Alpha AXP system is related to the system hardware. These subroutines provide the operating system with basic support for context switching, interrupts, exceptions, and memory management. These subroutines can be called by hardware or by the CALL_PAL instruction. PALcode is written in standard Alpha AXP assembler, with the implementation of some special extensions to provide direct access to low-level hardware, such as internal processor registers. PALcode runs in PAL mode, a privileged mode that stops some system events from being sent and allows PALcode to take full control of the system's physical hardware.

2. Software Basic

A program is a combination of computer instructions for performing a specific task. Programs can be written in assembly language, a very low-level computer language, or in high-level machine-independent languages ​​such as C. An operating system is a special program that allows users to run applications through it, such as spreadsheets and word processors.

2.1 Computer Languages

2.1.1. Assembly language

The instructions that the CPU reads and executes from memory are incomprehensible to humans. They are machine codes that tell the computer exactly what to do. For example, the hexadecimal number 0x89E5 is an Intel 80486 instruction to copy the contents of the register ESP to the register EBP. One of the first software tools in early computers was the assembler, which took human-readable source files and assembled them into machine code. Assembly language explicitly handles operations on registers and on data that are specific to a particular microprocessor. The assembly language of the Intel X86 microprocessor is completely different from the assembly language of the Alpha AXP microprocessor. The following Alpha AXP assembly code demonstrates the types of operations a program can perform:

Ldr r16, (r15) ; 第一行
Ldr r17, 4(r15) ; 第二行
Beq r16,r17,100; 第三行
Str r17, (r15); 第四行
100: ; 第五行

The first statement (line 1) loads the contents of the address specified by register 15 into register 16. The second instruction loads the contents of the following memory into register 17. The third line compares the register 16 and the register 17, if they are equal, branch to the label 100, otherwise, continue to execute the fourth line, and store the contents of the register 17 into the memory. If the data in memory is the same, there is no need to store the data. Writing assembly-level programs is tricky, tedious, and error-prone. Few parts of the core of the Linux system are written in assembly language, and the reason why these parts use assembly language is only to improve efficiency and is related to specific microprocessors.

2.1.2 The C Programming Language and Compiler

Writing large programs in assembly language is difficult, time-consuming, error-prone, and the resulting programs are not portable and are tied to a particular processor family. A better option is to use a machine-independent language such as C. C allows you to describe programs and data to be processed in logical algorithms. A special program called a compiler reads a C program and converts it into assembly language, which in turn produces machine-dependent code. A good compiler can generate assembly instructions that are close to the efficiency of a program written by a good assembly programmer. Most of the Linux kernel is written in C language. The following C snippet:

if (x != y)
x = y;

Performs exactly the same operation as the assembly code in the previous example. If the contents of variable x are not the same as the contents of variable y, the contents of variable y are copied to variable x. C code is composed of routines, each of which performs a task. Routines can return any number or data type supported by C. Large programs such as the Linux kernel are composed of many C language modules, each with its own routines and data structures. These C source code modules collectively constitute the processing code for logical functions such as the file system.

C supports many types of variables. A variable is a specific location in memory that can be referenced by a symbolic name. In the C snippet above, x and y refer to locations in memory. The programmer does not need to care about the exact location of the variable in memory, this is what the linker (described below) has to deal with. Some variables contain various data such as integers, floating point numbers, etc. and others contain pointers.

A pointer is a variable that contains the address of other data in memory. Assuming a variable x, located at memory address 0x80010000, you might have a pointer px that points to x. Px may be at address 0x80010030. The value of Px is the address of variable x, 0x80010000.

C allows you to group related variables into structures. E.g:

Struct {
Int I;
Char b;
} my_struct;

is a data structure called my_struct, consisting of two elements: an integer (32-bit) I and a character (8-bit data) b.

2.1.3 Linkers

The linker links several object modules and library files together into a single complete program. An object module is the machine code output of an assembler or compiler, which includes machine code, data, and linker information for use by the linker. For example, one object module might include all of the program's database functions, while another object module includes functions that handle command-line arguments. The linker determines the reference relationship between target modules, that is, determines the actual location of routines and data referenced by one module in another module. The Linux core is a large independent program connected by multiple object modules.

2.2 What is an Operating System

Without software, a computer is just a bunch of hot electronic components. If hardware is the heart of a computer, software is its soul. An operating system is a set of system programs that allow users to run applications. The operating system abstracts the hardware of the system and presents a virtual machine in front of users and applications. It is software that characterizes a computer system. Most PCs can run one or more operating systems, each of which looks and feels very different. Linux is made up of parts with different functions, and the overall combination of these parts makes up the Linux operating system. The most obvious part of Linux is the Kernel itself, but it's useless without a shell or libraries.

To understand what an operating system is, take a look at what happens when you type the simplest command:

$ls
Mail c images perl
Docs tcl
$

The $ here is the prompt output by the logged in shell (bash in this case): it means that the shell is waiting for you (the user) to enter a command. Entering ls causes the keyboard driver to recognize the entered characters, and the keyboard driver passes the recognized characters to the shell for processing. The shell first looks for an executable image of the same name, it finds /bin/ls, and then calls the core service to load the ls executor into virtual memory and start executing. The ls executor finds files by executing system calls of the core's file subsystem. The file system may use the cached file system information or read the file information from the disk through the disk device driver, or it may read the detailed information of the remote file accessed by the system by exchanging information with the remote host through the network device driver ( Filesystems can be mounted remotely via NFS network filesystems). Regardless of how the file information is obtained, ls outputs the information and displays it on the screen through the display driver.

The above process looks quite complicated, but it shows that even the simplest commands are the result of cooperation between various functional modules of the operating system, and only in this way can they provide you (the user) with a complete view of the system.

2.2.1 Memory management

With unlimited resources, such as memory, a lot of what the operating system has to do may be redundant. A fundamental trick of all operating systems is to make a small amount of physical memory work as if there is a considerable amount of memory. This superficially large memory is called virtual memory, and when the software is running it makes it believe that it has a lot of memory. The system divides memory into manageable pages and swaps these pages to the hard disk while the system is running. The application software does not know, because the operating system also uses another technology: multiprocessing.

2.2.2 Processes

A process can be viewed as an executing program, and each process is an independent entity of a specific program that is running. If you look at your Linux system, you will see that there are many processes running. For example: typing ps on my system shows the following processes:

$ ps
PID TTY STAT TIME COMMAND
158 pRe 1 0:00 -bash
174 pRe 1 0:00 sh /usr/X11R6/bin/startx
175 pRe 1 0:00 xinit /usr/X11R6/lib/X11/xinit/xinitrc --
178 pRe 1 N 0:00 bowman
182 pRe 1 N 0:01 rxvt -geometry 120x35 -fg white -bg black
184 pRe 1 < 0:00 xclock -bg grey -geometry -1500-1500 -padding 0
185 pRe 1 < 0:00 xload -bg grey -geometry -0-0 -label xload
187 pp6 1 9:26 /bin/bash
202 pRe 1 N 0:00 rxvt -geometry 120x35 -fg white -bg black
203 ppc 2 0:00 /bin/bash
1796 pRe 1 N 0:00 rxvt -geometry 120x35 -fg white -bg black
1797 v06 1 0:00 /bin/bash
3056 pp6 3 < 0:02 emacs intro/introduction.tex
3270 pp6 3 0:00 ps
$

If my system has multiple CPUs then each process might (at least in theory) be running on a different CPU. Unfortunately, there's only one, so the OS again uses tricks to run each process in turn for a short period of time. This time period is called a time slice. This trick is called multiprocessing or scheduling, and it tricks every process into appearing as if they are the only one. Processes are protected from each other, so if one process crashes or fails to work, it won't affect other processes. The operating system implements protection by giving each process a separate address space, and a process can only access its own address space.

2.2.3 Device Drivers

Device drivers form the main part of the Linux kernel. Like other parts of the operating system, they work in a high-priority environment, and if something goes wrong, they can cause serious problems. Device drivers control the interaction between the operating system and the hardware devices it controls. For example, the file system writes data blocks to IDE disks using the generic block device interface. The driver controls the details and handles the device-specific parts. A device driver is related to the specific controller chip it drives, so if your system has an NCR810 SCSI controller, then you need the NCR810 driver.

2.2.4 The Filesystems

Like Unix, in Linux, the system does not use device identifiers (such as drive numbers or drive names) to access individual filesystems, but is linked into a tree structure. When Linux installs a new filesystem, it installs it into a specified installation directory, such as /mnt/cdrom, thus merging into this single filesystem tree. An important feature of Linux is that it supports many different file systems. This makes it very flexible and can coexist well with other operating systems. The most commonly used filesystem for Linux is EXT2, which is supported by most Linux distributions.

The file system provides the files and directories stored on the system hard disk to the user in an understandable and unified form, so that the user does not have to consider the type of the file system or the characteristics of the underlying physical device. Linux transparently supports multiple file systems (such as MS-DOS and EXT2), and integrates all installed files and file systems into a virtual file system. So, users and processes usually don't need to know exactly what type of filesystem the files they use are on, just use them.

Block device drivers mask the distinction between physical block device types (eg IDE and SCSI). For a file system, a physical device is a linear collection of data blocks. The block size of different devices may be different. For example, floppy drives are generally 512 bytes, while IDE devices are usually 1024 bytes. Again, these differences are masked for system users. The EXT2 filesystem looks the same no matter what device it uses.

2.3 Kernet Data Structures

The operating system must record a lot of information about the current state of the system. If something happens in the system, these data structures must change accordingly to reflect the current reality. For example, when a user logs in to the system, a new process needs to be created. The kernel must accordingly create the data structures representing this new process, linked to the data structures representing other processes in the system.

Such data structures are mostly in physical memory and can only be accessed by the core and its subsystems. Data structures include data and pointers (addresses of other data structures or routines). At first glance, the data structures used by the Linux kernel can be quite confusing. In fact, each data structure has a purpose, and although some data structures are used in multiple subsystems, they are actually much simpler than when you first see them.

The key to understanding the Linux kernel is to understand its data structures and the large number of functions that the kernel uses to process those data structures. This book describes the Linux kernel in terms of data structures. Discusses the algorithms of each core subsystem, the way they are processed, and their use of core data structures.

2.3.1 Linked Lists

Linux uses a software engineering technique to wire its data structures together. Most of the time it uses a linked list data structure. If each data structure describes a single instance of an object or event, such as a process or a network device, the kernel must be able to find all instances. In a linked list, the root pointer contains the address of the first data structure or unit, and each data structure in the list contains a pointer to the next element of the list. The next pointer to the last element may be 0 or NULL, indicating that this is the end of the list. In a doubly linked list structure, each element includes not only a pointer to the next element in the list, but also a pointer to the previous element in the list. Using a doubly linked list makes it easier to add or remove elements from the middle of the list, but it requires more memory access. This is a typical operating system dilemma: the number of memory accesses or the number of CPU cycles.

2.3.2 Hash Tables

Linked lists are a common data structure, but traversing linked lists may not be efficient. If you are looking for a specific element, you may have to search the entire table to find it. Linux uses another technique: Hashing to address this limitation. Hash table is an array or vector table of pointers. Arrays or vector tables are objects that are placed sequentially in memory. A bookshelf can be said to be an array of books. Arrays are accessed using indices, which are offsets within the array. Going back to the bookshelf example, you can use the position on the bookshelf to describe each book: say book 5.

A Hash table is an array of pointers to data structures whose indexes are derived from the information in the data structures. If you use a data structure to describe the population of a village, you can use age as an index. To find out the data of a given person, you can use his age as an index to look up in the population hash table, and find the data structure including the detailed information by pointer. Unfortunately, there may be many people in a village of the same age, so the hash table pointer points to another linked list data structure, each element describing a peer. Even then, looking up these smaller linked lists is still faster than looking up all the data structures.

Hash tables can be used to speed up access to commonly used data structures, and hash tables are commonly used in Linux to implement buffering. Buffering is information that needs to be accessed quickly and is a subset of the total available information. Data structures are placed in buffers and kept there because the core frequently accesses these structures. Using buffers also has side effects, as it is more complicated to use than a simple linked list or hash table. If the data structure can be found in the buffer (this is called a buffer hit) then everything is perfect. But if the data structure is not in the buffer, then the relevant data structure used must be looked up, and if found, it is added to the buffer. Adding new data structures to the buffer may require discarding an old buffer entry. Linux has to decide which data structure to deprecate, at the risk of deprecating which data structure Linux might access next.

2.3.3 Abstract Interfaces

The Linux kernel often abstracts its interfaces. An interface is a series of routines and data structures that work in a specific way. For example: All network device drivers must provide specific routines to deal with specific data structures. In the way of abstract interface, the general code layer can use the services (interfaces) provided by the underlying special code. For example, the network layer is generic, while it is supported by underlying device-specific code that conforms to a standard interface.

Usually these lower layers register with the higher layers at startup. This registration process is usually implemented by adding a data structure to the linked list. For example, each filesystem linked to the kernel is registered when the kernel starts (or if you use modules, the first time the filesystem is used). You can view the file /proc/filesystems to check which filesystems are registered. The data structures used for registration usually include pointers to functions. This is the address of a software function that performs a specific task. Using the file system registration example again, each data structure passed to the Linux kernel during file system registration includes the address of a routine associated with the specific file system, which must be called when the file system is mounted.

5. Linux Kernel Management

1. Virtual folder

1. Introduction to Virtual Folders

Virtual folders, because their data content is stored in memory, not in the hard disk, /proc and /sys are virtual folders. Some of these files will return a lot of information when viewed with the view command, but the size of the file itself will be displayed as 0 bytes. In addition, the time and date attributes of most of these special files are usually the current system time and date, as they are refreshed (stored in RAM) at any time.

External performance view of the /proc filesystem, and provides users with a view of the kernel's internal data structures. It can be used to view and modify certain internal data structures of the kernel, thereby changing the behavior of the kernel.

The /proc file system provides an easy way to improve application and overall system performance by fine-tuning system resources. The /proc file system is a virtual file system that is dynamically created by the kernel to generate data. It is organized into directories, each of which corresponds to tunable options for a particular subsystem.

The /proc directory contains many subdirectories named with numbers, and these numbers represent the process number of the currently running process in the system, which contains multiple information files related to the corresponding process.

[root@rhel5 ~]# ll /proc
total 0
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 1
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 10
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 11
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 1156
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 139
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 140
dr-xr-xr-x  5 root      root              0 Feb  8 17:08 141
dr-xr-xr-x  5 root      root              0 Feb  8 17:09 1417
dr-xr-xr-x  5 root      root              0 Feb  8 17:09 1418

Listed above are some process-related directories in the /proc directory, and each directory is a file with information about the process itself. The following are related files of a process saslauthd with PID 2674 running on the author's system (RHEL5.3), some of which are files that every process will have.

[root@rhel5 ~]# ll /proc/2674
total 0
dr-xr-xr-x 2 root root 0 Feb  8 17:15 attr
-r-------- 1 root root 0 Feb  8 17:14 auxv
-r--r--r-- 1 root root 0 Feb  8 17:09 cmdline
-rw-r--r-- 1 root root 0 Feb  8 17:14 coredump_filter
-r--r--r-- 1 root root 0 Feb  8 17:14 cpuset
lrwxrwxrwx 1 root root 0 Feb  8 17:14 cwd -> /var/run/saslauthd
-r-------- 1 root root 0 Feb  8 17:14 environ
lrwxrwxrwx 1 root root 0 Feb  8 17:09 exe -> /usr/sbin/saslauthd
dr-x------ 2 root root 0 Feb  8 17:15 fd
-r-------- 1 root root 0 Feb  8 17:14 limits
-rw-r--r-- 1 root root 0 Feb  8 17:14 loginuid
-r--r--r-- 1 root root 0 Feb  8 17:14 maps
-rw------- 1 root root 0 Feb  8 17:14 mem
-r--r--r-- 1 root root 0 Feb  8 17:14 mounts
-r-------- 1 root root 0 Feb  8 17:14 mountstats
-rw-r--r-- 1 root root 0 Feb  8 17:14 oom_adj
-r--r--r-- 1 root root 0 Feb  8 17:14 oom_score
lrwxrwxrwx 1 root root 0 Feb  8 17:14 root -> /
-r--r--r-- 1 root root 0 Feb  8 17:14 schedstat
-r-------- 1 root root 0 Feb  8 17:14 smaps
-r--r--r-- 1 root root 0 Feb  8 17:09 stat
-r--r--r-- 1 root root 0 Feb  8 17:14 statm
-r--r--r-- 1 root root 0 Feb  8 17:10 status
dr-xr-xr-x 3 root root 0 Feb  8 17:15 task
-r--r--r-- 1 root root 0 Feb  8 17:14 wchan

cmdline — the full command to start the current process, but this file in the zombie process directory contains no information;

[root@rhel5 ~]# more /proc/2674/cmdline 
/usr/sbin/saslauthd

environ — a list of environment variables for the current process, separated from each other by a NULL character; variables are represented by uppercase letters and their values ​​are represented by lowercase letters;

[root@rhel5 ~]# more /proc/2674/environ 
TERM=linuxauthd

cwd — a symbolic link to the directory where the current process is running;

exe — a symbolic link to the executable (full path) that started the current process, a copy of the current process can be started via /proc/N/exe;

fd — this is a directory containing a file descriptor for each file opened by the current process, which is a symbolic link to the actual file;

[root@rhel5 ~]# ll /proc/2674/fd
total 0
lrwx------ 1 root root 64 Feb  8 17:17 0 -> /dev/null
lrwx------ 1 root root 64 Feb  8 17:17 1 -> /dev/null
lrwx------ 1 root root 64 Feb  8 17:17 2 -> /dev/null
lrwx------ 1 root root 64 Feb  8 17:17 3 -> socket:[7990]
lrwx------ 1 root root 64 Feb  8 17:17 4 -> /var/run/saslauthd/saslauthd.pid
lrwx------ 1 root root 64 Feb  8 17:17 5 -> socket:[7991]
lrwx------ 1 root root 64 Feb  8 17:17 6 -> /var/run/saslauthd/mux.accept

limits — soft limits, hard limits and management units for each limited resource used by the current process; this file can only be read by the UID user who actually started the current process; (this feature is supported in kernel versions after 2.6.24);

maps — a list of mapped regions in memory and their access permissions for each executable and library file associated with the current process;

[root@rhel5 ~]# cat /proc/2674/maps 
00110000-00239000 r-xp 00000000 08:02 130647     /lib/libcrypto.so.0.9.8e
00239000-0024c000 rwxp 00129000 08:02 130647     /lib/libcrypto.so.0.9.8e
0024c000-00250000 rwxp 0024c000 00:00 0 
00250000-00252000 r-xp 00000000 08:02 130462     /lib/libdl-2.5.so
00252000-00253000 r-xp 00001000 08:02 130462     /lib/libdl-2.5.so

mem — the memory space occupied by the current process, used by system calls such as open, read and lseek, and cannot be read by the user;

root — a symbolic link to the root directory of the current process; on Unix and Linux systems, the chroot command is usually used to make each process run in a separate root directory;

stat — status information of the current process, including a data column formatted by the system, with poor readability, usually used by the ps command;

statm — status information about the memory occupied by the current process, usually expressed in "pages";

status — similar to the information provided by stat, but with better readability, as shown below, each line represents one attribute information; please refer to the man page of proc for details;

[root@rhel5 ~]# more /proc/2674/status 
Name:   saslauthd
State:  S (sleeping)
SleepAVG:       0%
Tgid:   2674
Pid:    2674
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 32
Groups:
VmPeak:     5576 kB
VmSize:     5572 kB
VmLck:         0 kB
VmHWM:       696 kB
VmRSS:       696 kB
…………

task — a directory file that contains information about each thread run by the current process. The relevant information files for each thread are saved in a directory named by the thread number (tid), which is similar to the content of each thread. The contents of the process directory; (this function is supported after kernel version 2.6).

2. Introduction to common files in the /proc directory

/proc

proc is short for process. This directory file stores the relevant information of the process. The process information and kernel state in the system are placed in the proc, which is a virtual folder, and the corresponding data information is the state in the memory;

/proc/approx

Advanced Power Management (APM) version information and battery-related status information, usually used by the apm command;

/proc/buddyinfo

Relevant information files for diagnosing memory fragmentation problems;

/proc/cmdline

The relevant parameter information passed to the kernel at startup, which is usually passed by boot management tools such as lilo or grub;

[root@rhel5 ~]# more /proc/cmdline 
ro root=/dev/VolGroup00/LogVol00 rhgb quiet

/proc/cpuinfo

A file with information about the processor; 

/proc/crypto

A list of cryptographic algorithms used by the installed kernels on the system and details of each algorithm;

[root@rhel5 ~]# more /proc/crypto 
name         : crc32c
driver       : crc32c-generic
module       : kernel
priority     : 0
type         : digest
blocksize    : 32
digestsize   : 4
…………

/proc/devices

Information about all block devices and character devices loaded by the system, including the major device number and device group (device type corresponding to the major device number) name; 

[root@rhel5 ~]# more /proc/devices 
Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  …………

Block devices:
  1 ramdisk
  2 fd
  8 sd
  …………

/proc/diskstats

List of disk I/O statistics for each disk device; (versions after kernel 2.5.69 support this function);

/proc/dma

A list of information about each ISA DMA channel in use and registered;

[root@rhel5 ~]# more /proc/dma
2: floppy
4: cascade

/proc/execdomains

A list of information about the execution domains currently supported by the kernel (each operating system's unique "personality"); 

[root@rhel5 ~]# more /proc/execdomains 
0-0     Linux                   [kernel]

/proc/fb

Frame buffer device list file, including the device number and related driver information of the frame buffer device; 

/proc/filesystems

The file system type list file currently supported by the kernel, the file system marked as nodev indicates that block device support is not required; usually when mounting a device, if the file system type is not specified, this file will be used to determine the required file system. type;

[root@rhel5 ~]# more /proc/filesystems 
nodev   sysfs
nodev   rootfs
nodev   proc
        iso9660
        ext3
…………
…………

/proc/interrupts

A list of interrupt numbers related to each IRQ on an X86 or X86_64 architecture system; each CPU on a multiprocessor platform has its own interrupt number for each I/O device; 

[root@rhel5 ~]# more /proc/interrupts 
           CPU0       
  0:    1305421    IO-APIC-edge  timer
  1:         61    IO-APIC-edge  i8042
185:       1068   IO-APIC-level  eth0
…………

/proc/iomem

The mapping information of the memory (RAM or ROM) on each physical device in the system memory;

[root@rhel5 ~]# more /proc/iomem 
00000000-0009f7ff : System RAM
0009f800-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
  …………

/proc/ioports

A list of input-output port range information that is currently in use and has been registered to communicate with physical devices; as shown below, the first column represents the registered I/O port range, followed by related devices;

[root@rhel5 ~]# less /proc/ioports 
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
…………

/proc/kallsyms

The module management tool is used to dynamically link or bind the symbol definitions of loadable modules, which are output by the kernel; (versions after kernel 2.5.71 support this function); usually the amount of information in this file is quite large;

[root@rhel5 ~]# more /proc/kallsyms 
c04011f0 T _stext
c04011f0 t run_init_process
c04011f0 T stext
  …………

/proc/kcore

The physical memory used by the system, stored in the ELF core file (core file) format, the file size of which is the used physical memory (RAM) plus 4KB; this file is used to check the current state of the kernel data structure, so, usually by GBD Usually used by debugging tools, but cannot use the file view command to open this file;

/proc/kmsg

This file is used to save the information output by the kernel, usually used by programs such as /sbin/klogd or /bin/dmsg, do not try to open this file with the view command;

/proc/loadavg

Save the load averages about CPU and disk I/O. The first three columns represent the load averages every 1 second, every 5 seconds, and every 15 seconds, respectively, similar to the relevant information output by the uptime command; the fourth column is Two values ​​separated by slashes, the former represents the number of entities (processes and threads) currently being scheduled by the kernel, and the latter represents the number of currently surviving kernel scheduling entities in the system; the fifth column represents the most recent file before viewing the PID of a process created by the kernel;

[root@rhel5 ~]# more /proc/loadavg 
0.45 0.12 0.04 4/125 5549

[root@rhel5 ~]# uptime
06:00:54 up  1:06,  3 users,  load average: 0.45, 0.12, 0.04

/proc/locks

Save information about files currently locked by the kernel, including debug data inside the kernel; each lock occupies one line and has a unique number; the second column of each line in the following output information indicates the lock type used by the current lock, POSIX Indicates the current newer type of file lock, which is generated by the lockf system call. FLOCK is a traditional UNIX file lock, which is generated by the flock system call; the third column is also usually composed of two types. ADVISORY means that other users are not allowed to lock this file, but Read is allowed, MANDATORY means that other users are not allowed to access in any form during this file locking period;

[root@rhel5 ~]# more /proc/locks 
1: POSIX  ADVISORY  WRITE 4904 fd:00:4325393 0 EOF
2: POSIX  ADVISORY  WRITE 4550 fd:00:2066539 0 EOF
3: FLOCK  ADVISORY  WRITE 4497 fd:00:2066533 0 EOF

/proc/mdstat

Save the current status information of multiple disks related to RAID. On a machine that does not use RAID, it is displayed as follows:

[root@rhel5 ~]# less /proc/mdstat 
Personalities : 
unused devices: <none>

/proc/meminfo

The information about the current memory utilization in the system is often used by the free command; you can use the file view command to directly read this file, and its content is displayed in two columns, the former is the statistical attribute, and the latter is the corresponding value;

[root@rhel5 ~]# less /proc/meminfo 
MemTotal:       515492 kB
MemFree:          8452 kB
Buffers:         19724 kB
Cached:         376400 kB
SwapCached:          4 kB
…………

Check the amount of free memory:

grep MemFree /proc/meminfo    

/proc/mounts

Before the kernel version 2.4.29, the content of this file is all the file systems currently mounted by the system. In the kernel after 2.4.19, the method of using an independent mount namespace for each process is introduced, and this file changes accordingly. becomes a symbolic link to the /proc/self/mounts (a list of all mount points in each process's own mount namespace) file; /proc/self is a unique directory, which will be described later; 

[root@rhel5 ~]# ll /proc |grep mounts
lrwxrwxrwx  1 root      root             11 Feb  8 06:43 mounts -> self/mounts

As shown below, the first column indicates the mounted device, the second column indicates the mount point in the current directory tree, the third point indicates the type of the current file system, and the fourth column indicates the mount attribute (ro or rw) , the fifth and sixth columns are used to match the dump attribute in the /etc/mtab file;

[root@rhel5 ~]# more /proc/mounts 
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
/proc/bus/usb /proc/bus/usb usbfs rw 0 0
…………

/proc/modules

A list of all module names currently loaded into the kernel, which can be used by the lsmod command or viewed directly; as shown below, the first column indicates the module name, the second column indicates the memory space occupied by the module, and the third column indicates the module How many instances are loaded, the fourth column indicates which other modules this module depends on, the fifth column indicates the loading status of this module (Live: already loaded; Loading: loading; Unloading: unloading), sixth column Indicates the offset of this module in kernel memory;

[root@rhel5 ~]# more /proc/modules 
autofs4 24517 2 - Live 0xe09f7000
hidp 23105 2 - Live 0xe0a06000
rfcomm 42457 0 - Live 0xe0ab3000
l2cap 29505 10 hidp,rfcomm, Live 0xe0aaa000
…………

/proc/partitions

Information such as the major device number (major) and the minor device number (minor) of each partition of the block device, including the number of blocks contained in each partition (as shown in the third column of the output below);

[root@rhel5 ~]# more /proc/partitions 
major minor  #blocks  name

   8     0   20971520 sda
   8     1     104391 sda1
   8     2    6907950 sda2
   8     3    5630782 sda3
   8     4          1 sda4
   8     5    3582463 sda5

/proc/pci

A list of all PCI devices and their configuration information found during kernel initialization. The configuration information is mostly IRQ information related to a PCI device, which is not very readable. You can use the "/sbin/lspci –vb" command to obtain more understandable related information. ; After the 2.6 kernel, this file has been replaced by the /proc/bus/pci directory and the files under it;

/proc/slabinfo

Objects that are frequently used in the kernel (such as inode, dentry, etc.) have their own cache, namely slab pool, and the /proc/slabinfo file lists the slap information related to these objects; for details, please refer to the slapinfo manual page in the kernel documentation ;

[root@rhel5 ~]# more /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <ac
tive_slabs> <num_slabs> <sharedavail>
rpc_buffers            8      8   2048    2    1 : tunables   24   12    8 : slabdata      4      4      0
rpc_tasks              8     20    192   20    1 : tunables  120   60    8 : slabdata      1      1      0
rpc_inode_cache        6      9    448    9    1 : tunables   54   27    8 : slabdata      1      1      0
…………
…………
…………

/sys

The /sys directory file stores some information related to hardware, and it is also a virtual folder, not a folder on the real hard disk, but also the data in memory;

Recognition of newly added hard drives:

echo "- - -" > /sys/class/scsi_host/hostX/scan #X表示数字,从0开始的

/proc/sys

Different from the "read-only" attribute of other files under /proc, the information in this directory file can be modified, and the operating characteristics of the kernel can be controlled through these configurations;

Beforehand, you can use the "ls -l" command to check whether a file is "writable". Write operations are usually done using a format similar to "echo DATA > /path/to/your/filename". It should be noted that even if the file is writable, it cannot generally be edited with an editor.

The maximum number of threads supported by the query system:

cat /proc/sys/kernel/threads-max

/proc/sys/debug subdirectory

This directory is usually an empty directory; 

/proc/sys/dev subdirectory

The directory that provides parameter information files for special devices on the system, and the information files of different devices are stored in different subdirectories, such as /proc/sys/dev/cdrom and /proc/sys/dev on most systems /raid (if the function of supporting raid is enabled when the kernel is compiled) directory, which usually stores the relevant parameter information files of cdrom and raid on the system;

proc/stat

Tracks various statistics in real time since the system was last booted; as shown below, where
the eight values ​​after the "cpu" line represent statistics in 1/100 (jiffies) mode, low-priority user mode, operating system mode, idle mode, time in I/O wait mode, etc.);
the "intr" line gives the information of the interrupt, the first one is all the interrupts that have occurred since the system was started The number of times; then each number corresponds to the number of times a particular interrupt has occurred since the system was started;
"ctxt" gives the number of CPU context swaps that have occurred since the system was started.
"btime" gives the time since the system was started, in seconds;
"processes (total_forks) the number of tasks created since the system was started;
"procs_running": the number of tasks currently running in the queue;
"procs_blocked ": The number of currently blocked tasks;

[root@rhel5 ~]# more /proc/stat
cpu  2751 26 5771 266413 2555 99 411 0
cpu0 2751 26 5771 266413 2555 99 411 0
intr 2810179 2780489 67 0 3 3 0 5 0 1 0 0 0 1707 0 0 9620 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5504 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12781 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 427300
btime 1234084100
processes 3491
procs_running 1
procs_blocked 0

/proc/swaps

The swap partition and its space utilization information on the current system, if there are multiple swap partitions, the information of each swap partition will be stored in a separate file in the /proc/swap directory, and the lower the priority number, The greater the possibility of being used; the following is the output information when there is only one swap partition in the author's system;

[root@rhel5 ~]# more /proc/swaps 
Filename                                Type            Size    Used    Priority
/dev/sda8                               partition       642560  0       -1

/proc/uptime

The running time since the system was last started, as shown below, the first number represents the system running time, and the second number represents the system idle time, in seconds;
 

[root@rhel5 ~]# more /proc/uptime 
3809.86 3714.13

/proc/version

The kernel version number running on the current system will also display the gcc version installed by the system on the author's RHEL5.3, as shown below;

[root@rhel5 ~]# more /proc/version 
Linux version 2.6.18-128.el5 ([email protected]) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Dec 17 11:42:39 EST 2008

/proc/vmstat

Various statistical data of the current system virtual memory, the amount of information may be relatively large, which varies from system to system, and the readability is better; the following is a fragment of the output information on the author's machine; (The kernel after 2.6 supports this file )

[root@rhel5 ~]# more /proc/vmstat 
nr_anon_pages 22270
nr_mapped 8542
nr_file_pages 47706
nr_slab 4720
nr_page_table_pages 897
nr_dirty 21
nr_writeback 0
…………

/proc/zoneinfo

The detailed information list of the memory zone (zone), the amount of information is large, the following is an output snippet:

[root@rhel5 ~]# more /proc/zoneinfo 
Node 0, zone      DMA
  pages free     1208
        min      28
        low      35
        high     42
        active   439
        inactive 1139
        scanned  0 (a: 7 i: 30)
        spanned  4096
        present  4096
    nr_anon_pages 192
    nr_mapped    141
    nr_file_pages 1385
    nr_slab      253
    nr_page_table_pages 2
    nr_dirty     523
    nr_writeback 0
    nr_unstable  0
    nr_bounce    0
        protection: (0, 0, 296, 296)
  pagesets
  all_unreclaimable: 0
  prev_priority:     12
  start_pfn:         0
…………

3. Create /proc directory subnode

The kernel module mechanism, and the /proc file system, are typical features of Linux systems. Can you take advantage of these facilities to create corresponding nodes in the /proc directory for special files, devices, public variables, etc.? The answer is of course yes.

There are many ways that kernel modules interact with outside of kernel space, and the /proc filesystem is one of the main ways.

Introducing the /proc file system, here we will review some of the basics. A file system is the way the operating system organizes files on a disk or other peripheral. Linux supports many types of filesystems: minix, ext, ext2, msdos, umsdos, vfat, proc, nfs, iso9660, hpfs, sysv, smb, ncpfs, etc. Unlike other filesystems, the /proc filesystem is a pseudo-filesystem. The reason why it is called a pseudo file system is that it does not have any part related to the disk, it only exists in the memory and does not occupy the external memory space. And it does have a lot of similarities with filesystems. For example, it provides an interface for accessing system kernel data in the form of a file system, and can be operated with all common file tools. For example, we can view the information in the proc file through the command cat, more or other text editing tools. More importantly, users and applications can get system information through proc, and can change some parameters of the kernel. Since system information, such as processes, is dynamically changed, when a user or an application reads a proc file, the proc dynamically reads the required information from the system kernel and submits it. The /proc file system is generally placed in the /proc directory.

How to make /proc file system reflect the state of kernel modules? Let's take a look at this slightly more complex example below.

proc_example.c
 
…………
 
int init_module()
 
{
 
            int rv = 0;
 
   
 
            /* 创建目录 */
 
            example_dir = proc_mkdir(MODULE_NAME, NULL);
 
            if(example_dir == NULL) {
 
                    rv = -ENOMEM;
 
                    goto out;
 
            }
 
            example_dir->owner = THIS_MODULE;
 
           
 
            /* 快速创建只读文件 jiffies */
 
            jiffies_file = create_proc_read_entry("jiffies", 0444, example_dir,
 
                                           proc_read_jiffies, NULL);
 
            if(jiffies_file == NULL) {
 
                    rv  = -ENOMEM;
 
                    goto no_jiffies;
 
            }
 
            jiffies_file->owner = THIS_MODULE;
 
   
 
            /* 创建规则文件foo 和 bar */
 
            foo_file = create_proc_entry("foo", 0644, example_dir);
 
            if(foo_file == NULL) {
 
                    rv = -ENOMEM;
 
                    goto no_foo;
 
            }
 
            strcpy(foo_data.name, "foo");
 
            strcpy(foo_data.value, "foo");
 
            foo_file->data = &foo_data;
 
            foo_file->read_proc = proc_read_foobar;
 
            foo_file->write_proc = proc_write_foobar;
 
            foo_file->owner = THIS_MODULE;
 
                   
 
            bar_file = create_proc_entry("bar", 0644, example_dir);
 
            if(bar_file == NULL) {
 
                    rv = -ENOMEM;
 
                    goto no_bar;
 
            }
 
            strcpy(bar_data.name, "bar");
 
            strcpy(bar_data.value, "bar");
 
            bar_file->data = &bar_data;
 
            bar_file->read_proc = proc_read_foobar;
 
            bar_file->write_proc = proc_write_foobar;
 
            bar_file->owner = THIS_MODULE;
 
               
 
       /* 创建设备文件 tty */
 
            tty_device = proc_mknod("tty", S_IFCHR | 0666, example_dir, MKDEV(5, 0));
 
            if(tty_device == NULL) {
 
                    rv = -ENOMEM;
 
                    goto no_tty;
 
            }
 
            tty_device->owner = THIS_MODULE;
 
   
 
            /* 创建链接文件jiffies_too */
 
            symlink = proc_symlink("jiffies_too", example_dir, "jiffies");
 
            if(symlink == NULL) {
 
                    rv = -ENOMEM;
 
                    goto no_symlink;
 
            }
 
            symlink->owner = THIS_MODULE;
 
   
 
            /* 所有创建都成功 */
 
            printk(KERN_INFO "%s %s initialised\n",
 
                   MODULE_NAME, MODULE_VERSION);
 
            return 0;
 
    /*出错处理*/
 
    no_symlink:  remove_proc_entry("tty", example_dir);
 
    no_tty:      remove_proc_entry("bar", example_dir);
 
    no_bar:      remove_proc_entry("foo", example_dir);
 
    no_foo:      remove_proc_entry("jiffies", example_dir);
 
    no_jiffies:    remove_proc_entry(MODULE_NAME, NULL);
 
    out:        return rv;
 
    }
 
    …………
 

The kernel module proc_example first creates its own subdirectory proc_example in the /proc directory. Then three proc normal files (foo, bar, jiffies), a device file (tty) and a file link (jiffies_too) are created in this directory. Specifically, foo and bar are two read-write files that share the functions proc_read_foobar and proc_write_foobar. jiffies is a read-only file that obtains the current system time jiffies. jiffies_too is a symbolic link to the file jiffies.

2. Kernel management tools

1. sysctl management tool

The parameters modified by sysctl are temporarily effective, and are persistently effective by writing a configuration file.

#配置文件
/run/sysctl.d/*.conf
/etc/sysctl.d/*.conf
/usr/local/lib/sysctl.d/*.conf
/usr/lib/sysctl.d/*.conf
/lib/sysctl.d/*.conf
/etc/sysctl.conf  #主要存放在这里面,一般都在这个配置文件里面编写设置

Format:

Unlike the format in the file, use dots (.) to separate paths. There is no need to write /proc/sys, because this configuration file corresponds to the management of the /proc/sys folder.

Common parameters:

-w   临时改变某个指定参数的值
-a   显示所有生效的系统参数
-p   从指定的文件加载系统参数

example:

Prohibit pinging the machine:

[root@centos8 ~]#cat /etc/sysctl.d/test.conf
net.ipv4.icmp_echo_ignore_all=1
[root@centos8 ~]#sysctl -p /etc/sysctl.d/test.conf

Clear cache method:

echo 1|2|3 >/proc/sys/vm/drop_caches

2. ulimit limits system resources

ulimit limits certain system resources of the user, including the number of files that can be opened, the CPU time that can be used, the total amount of memory that can be used, etc.

grammar:

 ulimit [-acdfHlmnpsStvw] [size] 

Options and parameters:

-H :  hard limit ,严格的设定,必定不能超过这个设定的数值 
-S :  soft limit ,警告的设定,可以超过这个设定值,但是若超过则有警告讯息 
-a :  后面不接任何选项与参数,可列出所有的限制额度 
-c :  当某些程序发生错误时,系统可能会将该程序在内存中的信息写成档案,这种档案就被称为核心档案(core file)。 
-f :  此 shell 可以建立的最大档案容量(一般可能设定为 2GB)单位为 Kbytes 
-d :  程序可使用的最大断裂内存(segment)容量 
-l :  可用于锁定 (lock) 的内存量 
-m :  设置可以使用的常驻内存的最大值.单位:kbytes 
-n :  设置内核可以同时打开的文件描述符的最大值.单位:n 
-p :  设置管道缓冲区的最大值.单位:kbytes 
-s :  设置堆栈的最大值.单位:kbytes 
-v :  设置虚拟内存的最大值.单位:kbytes 
-t :  可使用的最大 CPU 时间 (单位为秒) 
-u :  单一用户可以使用的最大程序(process)数量 

General simple settings:

ulimit -SHn 65535 

To make it permanent:

[root@www ~]# vi /etc/security/limits.conf 
* soft noproc 65535 
* hard noproc 65535 
* soft nofile 409600 
* hard nofile 409600 

illustrate:

* means for all users

noproc is the maximum number of processes

nofile is the maximum number of open files

Case:

[root@www ~]# vi /etc/security/limits.conf 
# End of file 
*           soft  core   unlimit 
*           hard  core   unlimit 
*           soft  fsize  unlimited 
*           hard  fsize  unlimited 
*           soft  data   unlimited 
*           hard  data   unlimited 
*           soft  nproc  65535 
*           hard  nproc  63535 
*           soft  stack  unlimited 
*           hard  stack  unlimited 
*           soft  nofile  409600 
*           hard  nofile  409600 

cat /etc/security/limits.conf

cat /etc/security/limits.d/90-nproc.conf

Improper setting of sysctl/ulimit can make the system response very slow when the above indicators are very normal.

Case: For convenience, I configured all sysctl/ulimit parameters such as redis/elasticsearch/network in the system initialization script configure_server.py at one time. As a result, the vm.max_map_count parameter that elasticsearch needs to set causes the redis server to respond slowly after a long run.

3, linux system resource limit

oracle runs on linux. There are certain requirements for resource constraints.

limits.conf and sysctl.conf 

Installing oracle can't escape setting parameters in these two files: the sysctl.conf file is mainly for resource restrictions on the system. The limit.conf mainly for users to do resource restrictions, it depends on the PAM mechanism (Pluggable Authentication Modules Pluggable Authentication Modules), the settings can not exceed the settings of the operating system.

limits.conf syntax:

 username|@groupname type resource limit

username|@groupname: Set the username to be restricted. Add @ before the group name to distinguish it from the username. You can also use the wildcard * to limit all users.

parameter:

  type:

   可以指定 soft,hard 和 -,soft 指的是当前系统生效的设置值。hard 表明系统中所能设定的最大值。soft 的限制不能比har 限制高。用 - 就表明同时设置了 soft 和 hard 的值。

  resource:
  core - 限制内核文件的大小
  date - 最大数据大小
  fsize - 最大文件大小
  memlock - 最大锁定内存地址空间
  nofile - 打开文件的最大数目
  rss - 最大持久设置大小
  stack - 最大栈大小
  cpu - 以分钟为单位的最多 CPU 时间
  noproc - 进程的最大数目
  as - 地址空间限制
  maxlogins - 此用户允许登录的最大数目

Query settings: ulimit command

Only valid for the current tty (terminal). The ulimit command itself has soft and hard settings. Adding -H is hard, and adding -S is soft.

parameter:

-H 设置硬件资源限制.
-S 设置软件资源限制.
-a 显示当前所有的资源限制.
-c size:设置core文件的最大值.单位:blocks
-d size:设置数据段的最大值.单位:kbytes
-f size:设置创建文件的最大值.单位:blocks
-l size:设置在内存中锁定进程的最大值.单位:kbytes
-m size:设置可以使用的常驻内存的最大值.单位:kbytes
-n size:设置内核可以同时打开的文件描述符的最大值.单位:n
-p size:设置管道缓冲区的最大值.单位:kbytes
-s size:设置堆栈的最大值.单位:kbytes
-t size:设置CPU使用时间的最大上限.单位:seconds
-v size:设置虚拟内存的最大值.单位:kbytes

Notice:

unlimited is a special value used to indicate unlimited

sys .c onf parameter description

Most of the kernel parameters are stored in the /proc/sys directory and can be changed while the system is running, but will fail after restarting the machine. /etc/sysctl.conf is an interface that allows changes to a running Linux system. It contains some advanced options for the TCP/IP stack and virtual memory system. Modifying kernel parameters takes effect permanently. That is to say, there is a corresponding relationship between the kernel files under /proc/sys and the variables in the configuration file sysctl.conf.

Common configuration:

kernel.shmall=4294967296
vm.min_free_kbytes=262144
kernel.sem=4096 524288 4096 128
fs.file-max=6815744
net.ipv4.ip_local_port_range=9000 65500
net.core.rmem_default=262144
net.core.rmem_max=4194304
net.core.wmem_default=262144
net.core.wmem_max=1048576
fs.aio-max-nr=1048576
kernel.shmmni=4096
vm.nr_hugepages=8029

illustrate:

kernel.shmmax:
是核心参数中最重要的参数之一,用于定义单个共享内存段的最大值。设置应该足够大,能在一个共享内存段下容纳下整个的SGA ,设置的过低可能会导致需要创建多个共享内存段,这样可能导致系统性能的下降。至于导致系统下降的主要原因为在实例启动以及ServerProcess创建的时候,多个小的共享内存段可能会导致当时轻微的系统性能的降低(在启动的时候需要去创建多个虚拟地址段,在进程创建的时候要让进程对多个段进行“识别”,会有一些影响),但是其他时候都不会有影响。
官方建议值:
32位Linux系统:可取最大值为4GB(4294967296bytes)-1byte,即4294967295。建议值为多于内存的一半,所以如果是32为系统,一般可取值为4294967295。32位系统对SGA大小有限制,所以SGA肯定可以包含在单个共享内存段中。
64位linux系统:可取的最大值为物理内存值-1byte,建议值为多于物理内存的一半,一般取值大于SGA_MAX_SIZE即可,可以取物理内存-1byte。例如,如果为12GB物理内存,可取12*1024*1024*1024-1=12884901887,SGA肯定会包含在单个共享内存段中。 
kernel.shmall:
    该参数控制可以使用的共享内存的总页数。Linux共享内存页大小为4KB,共享内存段的大小都是共享内存页大小的整数倍。一个共享内存段的最大大小是16G,那么需要共享内存页数是16GB/4KB=16777216KB /4KB=4194304(页),也就是64Bit系统下16GB物理内存,设置kernel.shmall = 4194304才符合要求(几乎是原来设置2097152的两倍)。这时可以将shmmax参数调整到16G了,同时可以修改SGA_MAX_SIZE和SGA_TARGET为12G(您想设置的SGA最大大小,当然也可以是2G~14G等,还要协调PGA参数及OS等其他内存使用,不能设置太满,比如16G)
kernel.shmmni:
该参数是共享内存段的最大数量。shmmni缺省值4096,一般肯定是够用了。
fs.file-max:
该参数决定了系统中所允许的文件句柄最大数目,文件句柄设置代表linux系统中可以打开的文件的数量。
fs.aio-max-nr:
      此参数限制并发未完成的请求,应该设置避免I/O子系统故障。
kernel.sem:
以kernel.sem = 250 32000 100 128为例:
       250是参数semmsl的值,表示一个信号量集合中能够包含的信号量最大数目。
       32000是参数semmns的值,表示系统内可允许的信号量最大数目。
       100是参数semopm的值,表示单个semopm()调用在一个信号量集合上可以执行的操作数量。
       128是参数semmni的值,表示系统信号量集合总数。
net.ipv4.ip_local_port_range:
    表示应用程序可使用的IPv4端口范围。
net.core.rmem_default:
表示套接字接收缓冲区大小的缺省值。
net.core.rmem_max:
表示套接字接收缓冲区大小的最大值。
net.core.wmem_default:
表示套接字发送缓冲区大小的缺省值。
net.core.wmem_max:表示套接字发送缓冲区大小的最大值。            
vm.nr_hugepages=8029                 大页数

Six, Linux kernel optimization

Kernel evaluation method:

Add initcall_debug to the startup parameters to get more kernel logs:

[ 3.750000] calling ov2640_i2c_driver_init+0x0/0x10 @ 1
[ 3.760000] initcall ov2640_i2c_driver_init+0x0/0x10 returned 0 after 544 usecs
[ 3.760000] calling at91sam9x5_video_init+0x0/0x14 @ 1
[ 3.760000] at91sam9x5-video f0030340.lcdheo1: video device registered @ 0xe0d3e340, irq = 24
[ 3.770000] initcall at91sam9x5_video_init+0x0/0x14 returned 0 after 10388 usecs
[ 3.770000] calling gspca_init+0x0/0x18 @ 1
[ 3.770000] gspca_main: v2.14.0 registered
[ 3.770000] initcall gspca_init+0x0/0x18 returned 0 after 3966 usecs
...

Alternatively, you can use scripts/bootgraph.pl to convert the dmesg information into a picture:

$ scripts/bootgraph.pl boot.log > boot.svg

 Next, find the most time-consuming aspects and optimize them.

1. Optimizing compiler

ARM vs Thumb2

Compare systems and applications compiled on the ARM or Thumb2 instruction set.

ARM: 3.79 MB for rootfs, 227 KB for ffmpeg.

Thumb2:3.10 MB (-18 %),183 KB (-19 %)。

Performance: Thumb2's performance is significantly improved slightly (about less than 5%).

Although the performance has improved, I personally still choose the ARM instruction set.

musl vs uClibc

There are 3 kinds of C libraries to choose from in Buildroot: glibc, musl , uClibc, here we only compare the latter 2 kinds of smaller libraries.

musl: 680 KB (stats/lib directory).

uClibc:570 KB (-16 %)。

uClibc saves 110 KB and we choose uClibc.

2. Optimize the application

We can choose the functional components of FFmpeg through ./configure.

In addition, you can use the strace and perf commands to debug to optimize FFmpeg's internal d-code.

The result after optimization:

File system: Shrink from 16.11 MB to 3.54 MB (-78 %).

Loading and running time of the program: 150 ms shorter.

Overall startup time: 350 ms shorter.

The optimization in space is large, but the optimization in startup time is small, because Linux only loads the necessary parts of the program when it runs it.

3. Optimize Init and root file system

Ideas:

Use bootchartd to analyze system startup and trim unnecessary services.

Combine the startup scripts under /etc/init.d/ into one.

/proc and /sys are not mounted.

Cut BusyBox, the smaller the filesystem, the faster the kernel mount may be.

Replace the Init program with our application.

Compile the application statically.

Trim out infrequently used files and find files that are not accessed for a long time:

$ find / -atime -1000 -type f

The result after optimization:

Filesystem: Reduced from 3.54 MB to 2.33 MB (-34 %) after cropping Busybox.

Boot time: basically unchanged, presumably because the filesystem itself is small enough.

4. ​Use initramfs as rootfs

Under normal circumstances, the Linux system will mount the initramfs first. The init ramfs is small and located in the memory, and then the initramfs is responsible for loading the root file system.

When we put the Buildroot rootfs very small, we can consider using it directly as initramfs.

What's the benefit of this?

The initramfs can be spliced ​​with the Kernel, and the Bootloader is responsible for loading the Kernel+initramfs into the memory, and the kernel no longer needs to access the disk.

The kernel no longer needs block/storage and filesystem related functions, the size will become smaller, and the loading time and initialization time will be reduced.

Note that initramfs compression needs to be turned off (CONFIG_INITRAMFS_COMPRESSION_NONE).

The result after optimization:

Even with CONFIG_BLOCK and CONFIG_MMC disabled, the total boot time is still 20ms longer. This may be because after Kernel + initramfs are put together, the kernel becomes much larger, and the kernel image needs to be decompressed, which increases the decompression time.

5. Cut out tracing

Disable Tracers related features in Kernel hacking.

Startup time: 550ms shorter.

Kernel size: Shrink by 217KB.

6. Cut some hardware functions that are not needed

omap8250_platform_driver_init() // (660 ms)
cpsw_driver_init()  // (112 ms)
am335x_child_init() // (82 ms)
...

7.  Preset loops per jiffy

At each boot, the kernel calibrates the value of the delay loop for the udelay() function.

This measures the value of loops per jiffy (lpj). We only need to start the kernel once and look for the lpj value in the log:

Calibrating delay loop... 996.14 BogoMIPS (lpj=4980736)

Then fill in lpj=4980736 into the startup parameters, you can:

Calibrating delay loop (skipped) preset value.. 996.14 BogoMIPS (lpj=4980736)

About 82 ms shorter.

8.  Disable CONFIG_SMP

SMP initialization is slow. It is usually enabled in the default configuration, even for a single-core CPU.

If our platform is single core, SMP can be disabled.

After shutdown, the kernel shrinks: -188 KB (-4.6 %), and the startup time decreases by 126ms.

9.  Disable log

Adding quiet to the startup parameters shortens the startup time by 577 ms.

With CONFIG_PRINTK and CONFIG_BUG disabled, the kernel shrinks by 118 KB (-5.8 %).

After disabling CONFIG_KALLSYMS, the kernel shrinks by 107 KB (-5.7 %).

In total, the startup time is reduced by 767 ms.

10.  Enable CONFIG_EMBEDDED and CONFIG_EXPERT

This makes the system calls leaner and the kernel less generic, but it's enough to keep your application running.

The kernel shrinks by 51 KB.

Startup time is reduced by 34 ms.

11.  Select SLAB memory allocators

Usually SLAB, SLOB, SLUB choose one of three.

SLAB: The default choice, the most versatile, the most traditional, and the most reliable.

SLOB: More concise, less code, more space-saving, suitable for embedded systems, after enabling, the kernel is reduced by 5 KB, but the startup time is increased by 1.43 S!

SLUB: It is more suitable for large systems. After enabling, the startup time increases by 2 ms.

Therefore, we still use SLAB.

12.  Kernel compression optimization

The characteristics of different compression methods are as follows:

Measured effect:

It seems that gzip and lzo perform better. The effect of the test should be related to the performance of the CPU/disk. 

13. Kernel compilation parameters

Enable CONFIG_CC_OPTIMIZE_FOR_SIZE, this option may replace gcc -O2 with gcc -Os.

Note that this is just a test result on BeagleBone Black + Linux 5.1, there are differences between different platforms. 

14. Disable pseudo filesystems such as /proc

To consider the compatibility of the application.

ffmpeg relies on /proc, so only some proc-related options can be turned off: CONFIG_PROC_SYSCTL, CONFIG_PROC_PAGE_MONITOR CONFIG_CONFIGFS_FS, and the startup time has not changed.

Shutting down sysfs reduces startup time by 35 ms.

15.  Splicing DTB

Enable CONFIG_ARM_APPENDED_DTB:

$ cat arch/arm/boot/zImage arch/arm/boot/dts/am335x-boneblack-lcd4.dtb > zImage
$ setenv bootcmd 'fatload mmc 0:1 81000000 zImage; bootz 81000000'

Startup time is reduced by 26 ms.

16. Optimize Bootloader

Here we use the best solution: use Uboot Falcon mode.

Falcon mode only executes the first stage of Uboot: SPL, then skips Stage 2 and executes loading Kernel.

Startup time is reduced by 250 ms.

Kernel optimization summary:

At this point, the startup optimization is basically completed, and the final effect is as follows:

[0.000000 0.000000]
[0.000785 0.000785] U-Boot SPL 2019.01 (Oct 27 2019 - 08:04:06 +0100)
[0.057822 0.057822] Trying to boot from MMC1
[0.378878 0.321056] fdt_root: FDT_ERR_BADMAGIC
[0.775306 0.396428] Waiting for /dev/video0 to be ready...
[1.966367 1.191061] Starting ffmpeg
...
[2.412284 0.004277] First frame decoded

From power-on to LCD display the first frame of image, the total time is 2.41 seconds. 

The most effective steps are as follows :

Still worth optimizing space :

The system spent 1.2 seconds waiting for the USB camera to enumerate, is there a way to speed it up here?

Is it possible to turn off tty and terminal login?

Finally, there are some principles to follow when it comes to optimizing startup time :

Please don't optimize prematurely.

Start optimizing from some points with the least influence.

Top-down optimization from rootfs, kernel, bootloader.

Guess you like

Origin blog.csdn.net/qq_35029061/article/details/126210028