＜Linux kernel learning＞File system

Environment: Linux 0.11 / Linux 3.4.2

Reference Book: Complete Analysis of the Linux Kernel Based on the 0.11 Kernel - Zhao Jiong

1. The part that uses the file system in Linux

1.1 About the management program of high-speed buffer in Linux

1.2 The underlying common functions of the file system (reading and writing of the hard disk, allocation and release, etc., management of the directory node inode, and mapping of memory and disk

1.3 Operation module for reading and writing file data (vfs: relationship between virtual file system hardware driver and file system)

1.4 Implementation of the interface between the file system and other programs (fopen fclose, etc.)

Second, the basic concept of the file system

For a hard disk device, several disks are usually divided, and each disk stores a different file system. For example, the figure below divides a hard disk into 4 partitions, including 4 different file systems. Among them, the master boot sector stores the disk boot program and partition table information (indicating the type of each partition on the hard disk).

The file system used by the Linux0.11 kernel is the MINIX file system, and its distribution is shown in the figure below:

① Boot block: used to boot the device, usually the BIOS automatically reads the running data when it is powered on. The content of the disk boot block is empty for non-boot devices.

② Super fast: It is equivalent to the descriptor of the file system, defined as follows:

struct super_block {
    unsigned short s_ninodes;
    unsigned short s_nzones;
    unsigned short s_imap_blocks;
    unsigned short s_zmap_blocks;
    unsigned short s_firstdatazone;
    unsigned short s_log_zone_size;
    unsigned long s_max_size;
    unsigned short s_magic;
/* These are only in memory */
    struct buffer_head * s_imap[8];
    struct buffer_head * s_zmap[8];
    unsigned short s_dev;
    struct m_inode * s_isup;
    struct m_inode * s_imount;
    unsigned long s_time;
    struct task_struct * s_wait;
    unsigned char s_lock;
    unsigned char s_rd_only;
    unsigned char s_dirt;
};

s_ninode表示当前块设备的inode节点数。每个inode代表一个文件。
s_nzones表示当前块设备上以逻辑块为大小 （1KB）的逻辑块数。
s_imap_blocks与s_zmap_blocks分别表示当前块设备的inode节点位图和逻辑块位图所占用的逻辑块数。
    逻辑块位图用于描述当前设备每个磁盘块的使用情况。如果为0，表示对应的磁盘块是空闲的，可以分配使用。当一个磁盘块被分配占用后，对应的逻辑块位图的比特位被置1
    根据超级块数据结构中定义，s_zmap是一个数组，它占用了8块磁盘块大小，每个块大小 是1024字节，因此总共可以管理8192*8个比特位，每个比特位分别对应一个数据磁盘块，总计这8个磁盘块大小可以管理655356数据磁盘块。
    inode节点位图是用来标记iNode节点的使用情况。与逻辑块位图 类似，当创建一个文件时候，我们分配一个iNode数据结构，并且些iNode实例对应的inode位图数组对应的位需要被置1。

s_firstdatazones表示当前块设备上数据区开始位置占用的第一个逻辑块的块号。
s_max_size表示当前块设备上，以字节为单位的最大文件 的长度。
s_magic表示文件系统的魔数，标示了文件系统的类型。

④ i node bitmap: similar to the logic block bitmap.

③ Logical block bitmap: Each bit corresponds to the use of the logical block. If the corresponding logical block is used, the position of the logical block bitmap is 1.

⑤ i node: It is the bridge between the directory and the disk, and the attribute description of the file.

⑥ Logical block: A logical unit used to store data.

For the i-node definition in the file /include/linux/fs.h, as follows:

struct m_inode {
    unsigned short i_mode;    //文件的类型和属性
    unsigned short i_uid;    //宿主用户id
    unsigned long i_size;    //文件的大小
    unsigned long i_mtime;    //文件的修改时间
    unsigned char i_gid;    //用户组id
    unsigned char i_nlinks;    //硬链接数
    unsigned short i_zone[9];    //表示文件和磁盘的映射关系
//i_zone[6]如果你的文件大小只只用了7个逻辑块大小以内,那么这个数组每一个单源存储了一个逻辑块号
//i_zone[7]一次间接块号，如果占用的逻辑块大小大于7，小于512+7则占用一次逻辑块号
//i_zone[8]二次间接块号, 如果占用的逻辑块大小大于512 + 7，小于512 * 512 + 7则启动二次逻辑块号

/* these are in memory also */
    struct task_struct * i_wait;
    unsigned long i_atime;
    unsigned long i_ctime;
    unsigned short i_dev;
    unsigned short i_num;
    unsigned short i_count;
    unsigned char i_lock;
    unsigned char i_dirt;
    unsigned char i_pipe;
    unsigned char i_mount;
    unsigned char i_seek;
    unsigned char i_update;
};

i_mode is a variable of type us, which is used to save the type and attributes of the file. The specific definition is as follows:

1. High-speed buffer (buffer.c)

高速缓冲区是文件系统访问块设备数据的桥梁也是必经之道。如果内核对块设备进行数据读写，每次读写的I/O操作的时间与cpu自身的处理速度相比是非常慢的。因此内核会在内存中开辟一块高速缓冲区，并将其分成一个个与磁盘数据块大小相同的缓冲块进行管理。

在高速缓冲区中会存放着最近使用过的块设备中的数据块。当内核需要读取块设备上的数据块的时候，会先在高速缓冲区进行查找，如果相应数据已经在缓冲中，就无需从块设备上读。如果数据不在缓冲区中，就会发出读块设备的命令，将数据读到高速缓冲区中。当需要把数据写到块设备中时，内核会在高速缓冲区中申请一块空闲缓冲块临时存放这些数据，当执行设备数据同步命令的时候，才会真正的写入到块设备之中。

高速缓冲区的结构如下图所示：

对于linux 0.11内核，每个缓冲块的大小都是1024字节，与磁盘上的逻辑块大小相同。缓冲头结构的定义在文件/include/linux/fs.h中

struct buffer_head {
    char * b_data;            //指向缓冲块的指针
    unsigned long b_blocknr;    //块号
    unsigned short b_dev;        //数据源的设备号(0==free)
    unsigned char b_uptodate;    //表示数据是否被更新
    unsigned char b_dirt;        //0-未修改，1-已修改
    unsigned char b_count;        //使用该数据块的用户数
    unsigned char b_lock;        //缓冲区是否被锁定
    struct task_struct * b_wait;    //如果缓冲区被锁定，有新的进程想访问该数据块就会加入此链表
    struct buffer_head * b_prev;    //hash队列前一块
    struct buffer_head * b_next;    //hash队列后一块
    struct buffer_head * b_prev_free;    //空闲表上前一块
    struct buffer_head * b_next_free;    //空闲表上后一块
};

缓冲区的维护依靠的是一个hash列表和一个free_list空闲列表。在buffer.c程序中定义了一个有307个buffer_head指针的hash数组。hash数组的hash函数是由（设备号^逻辑块号）% 307得到。

在buffer_head中定义的b_prev和b_next指针就是用于散列在同一个hash表项时用于双向链接的。而b_prev_free和b_next_free指针就是用于维护一个所有缓冲块的双向链表。

某一时刻的hash表和free_list状态如下图所示：

内核程序在使用高速缓冲区的缓冲块时，需要指定设备号(dev)和访问设备数据的逻辑块号(block)，然后调用函数bread()、bread_page()和breada()操作，这三个函数的本质都是调用缓冲区搜索函数getblk()来搜索缓冲块中最为空闲的缓冲块。

getblk()函数代码如下：

//用来判断缓冲区的修改和锁定标志
#define BADNESS(bh) (((bh)->b_dirt<<1)+(bh)->b_lock)

//传入参数dev和block
struct buffer_head * getblk(int dev,int block)
{
    struct buffer_head * tmp, * bh;

repeat:
    if (bh = get_hash_table(dev,block))    //搜索哈希表，如果对应缓冲块已经存在在哈希表中，直接返回该指针。
        return bh;
    //tmp指向空闲队列
    tmp = free_list;
    do {
        //如果缓冲区的b_count不等于0表示在使用，直接跳过
        //即使一个缓冲区的b_count=0也不一定是未使用过。
        //比如一个进程执行breada()函数读取几个块，在执行完ll_rw_block()命令后就会释放b_count，但此时读取的操作还在进行，因此b_count=0,b_lock=1.
        if (tmp->b_count)
            continue;
        //找到一个BADNESS值最小的缓冲区
        if (!bh || BADNESS(tmp)<BADNESS(bh)) {
            bh = tmp;
            if (!BADNESS(tmp))
                break;
        }
/* and repeat until we find something good */
    } while ((tmp = tmp->b_next_free) != free_list);
    //如果所有的缓冲区都被使用，则进入睡眠状态等待唤醒。
    if (!bh) {
        sleep_on(&buffer_wait);
        goto repeat;
    }
    //此时已经找到了一个BADNESS最小的缓冲块，等待该缓冲区解锁
    wait_on_buffer(bh);
    //解锁完后如果由被占用则重复
    if (bh->b_count)
        goto repeat;
    //如果缓冲区已经被修改，先进行数据同步。
    while (bh->b_dirt) {
        sync_dev(bh->b_dev);
        wait_on_buffer(bh);
        if (bh->b_count)
            goto repeat;
    }
/* NOTE!! While we slept waiting for this block, somebody else might */
/* already have added "this" block to the cache. check it */
    //在本进程睡眠的过程中，其他进程有可能吧该数据块加入到哈希表中，因此需要先寻找判断。
    if (find_buffer(dev,block))
        goto repeat;
/* OK, FINALLY we know that this buffer is the only one of it's kind, */
/* and that it's unused (b_count=0), unlocked (b_lock=0), and clean */
    //最后得到的缓冲块b_count=0,b_dirt=0,b_uptodate=0,
    //因此需要占用该缓冲块
    bh->b_count=1;
    bh->b_dirt=0;
    bh->b_uptodate=0;
    //从哈希列表和空闲队列移除，指定新的设备号和块号，再插入
    remove_from_queues(bh);
    bh->b_dev=dev;
    bh->b_blocknr=block;
    insert_into_queues(bh);
    return bh;
}

2、文件系统底层操作函数(super.c,bitmap.c,truncate.c,inode.c和namei.c)

bitmap.c

free_block(int dev, int block)

释放设备dev上数据区的逻辑块block，复位逻辑块block对应的逻辑块位图比特位

new_block(int dev)

向设备申请一个逻辑块，该函数会在获取该其超级块，并在超级块中的逻辑位图寻找第一个值为0的bit,然后在对应逻辑块位图置位该bit。为该逻辑块取得在高速缓冲区取得一个对应缓冲块，并更新标志，最后返回逻辑块号。

free_inode(struct m_inode* inode)

释放指定inode节点

new_inode(int dev)

为指定设备创建一个Inode节点

truncate.c

truncate(struct m_inode* inode)

将节点对应的文件长度截为0，主要调用free_ind和free_dind释放一次和二次间接块。

inode.c

wait_on_inode(struct m_inode * inode)

等待一个inode节点空闲

lock_inode(struct m_inode * inode)

锁定一个inode节点

unlock_inode(struct m_inode * inode)

解锁一个inode节点，并唤醒等待队列

invalidate_inodes(int dev)

扫描inode表数组，如果是指定设备使用的i节点就释放

sync_inodes(void)

把i节点表上的所有节点写入到高速缓冲区等待同步

bmap

文件数据块映射到盘块的处理操作，该函数把指定的文件数据块block对应到设备上逻辑块上，并返回逻辑块号。如果创建标志置位，则在设备上对应逻辑块不存在时就申请新磁盘块，返回文件数据块block对应在设备上的逻辑块号（盘块号）。

iput、iget

iput 释放一个inode节点。

主要是对i_count引用次数进行操作，把i节点引用数值减1并且若是管道i节点，则唤醒等待进程若是块设备文件i节点则刷新设备若i节点的链接计数为0，则释放该i节点占用的所有磁盘逻辑块，并释放该节点

iget 获得一个inode节点。

从设备上读取指定节点号的i节点到内存i节点表中，并返回该i节点指针首先在位于高速缓冲区的i节点表中寻找，若找到指定节点号的i节点则再经过一些判断处理后返回i节点指针否则通过设备号和指定i节点号，从设备中读取的i节点信息，放入在i节点表中申请的空闲节点中，并返回该i节点指针

read_inode write_inode

找到指定的设备，通过设备号找到他的超级块，超级块计算要读写的块号，调用bread将其读写入高速缓冲区中，读的话将高速缓冲区的b_data读到内存，释放高速缓冲区。写的话将数据写到高速缓冲区的b_data并设置dirt置位，等待系统sys_sync进行写盘，释放高速缓冲区。

super.c

此文件中主要存放这对文件系统超级块的管理函数。

get_super()函数在指定了设备号的情况下，返回对应的超级块的指针

put_super()函数用于释放超级块，在调用函数umount()函数的会调用此函数

read_super()函数用于把指定设备的文件系统超级块读入缓冲区，并登记到超级块数组，最后返回该超级块的指针

sys_umount()系统调用用于卸载一个设备文件名的文件系统

sys_mount()用于往一个目录上挂载一个文件系统

mount_root()用于挂载根文件系统

namei.c

实现了根据目录名或文件名寻找对应i节点的函数namei()，以及关于目录的建立和删除、目录项的建立和删除等操作和系统调用

3、文件数据进行读写的操作模块

本部分是文件系统的第三个部分，主要包含对块设备，字符设备，管道文件，普通文件的读写函数（用作给系统调用提供读写接口）以及系统调用sys_read()和sys_write()。关系如下图所示：

block_dev.c

block_write(int dev, long *pos, char *buf, int count)

int block_write(int dev, long * pos, char * buf, int count)
{
    int block = *pos >> BLOCK_SIZE_BITS;
    int offset = *pos & (BLOCK_SIZE-1);
    int chars;
    int written = 0;
    struct buffer_head * bh;
    register char * p;

    while (count>0) {
        chars = BLOCK_SIZE - offset;
        if (chars > count)
            chars=count;
        if (chars == BLOCK_SIZE)
            bh = getblk(dev,block);
        else
            bh = breada(dev,block,block+1,block+2,-1);
        block++;
        if (!bh)
            return written?written:-EIO;
        p = offset + bh->b_data;
        offset = 0;
        *pos += chars;
        written += chars;
        count -= chars;
        while (chars-->0)
            *(p++) = get_fs_byte(buf++);
        bh->b_dirt = 1;
        brelse(bh);
    }
    return written;
}

数据块写函数，向指定设备dev的偏移量pos处，从buf缓冲区开始写入count字节的数据。返回成功写入的字节数。

写数据流程：

① 根据偏移量pos会计算出开始写的盘号block，和在第一个数据块中的偏移量offset。

② 针对要写入的字节数count开始循环进行写操作。

2.1首先计算当前数据块剩余可写入数据的大小chars。如果剩余大小chars大于需要写入的数据count，则chars = count。

2.2 如果剩余可写入数据大小为一块数据的内容，则直接申请1块高速缓冲块。

2.3 否则就需要读入将被写入的数据的数据块，并预读取下两块数据，然后将块号递增1，为循环写入做准备。

block_read(int dev, unsigned long *pos, char *buf, int count)

数据块读函数，从指定设备dev的pos处读取数据，和write函数流程类似。

file_dev.c

提供普通文件的读写函数，供系统调用read()和write()使用

file_read(struct m_inode * inode, struct file * filp, char * buf, int count)

int file_read(struct m_inode * inode, struct file * filp, char * buf, int count)
{
    int left,chars,nr;
    struct buffer_head * bh;

    if ((left=count)<=0)
        return 0;
    while (left) {
        if (nr = bmap(inode,(filp->f_pos)/BLOCK_SIZE)) {
            if (!(bh=bread(inode->i_dev,nr)))
                break;
        } else
            bh = NULL;
        nr = filp->f_pos % BLOCK_SIZE;
        chars = MIN( BLOCK_SIZE-nr , left );
        filp->f_pos += chars;
        left -= chars;
        if (bh) {
            char * p = nr + bh->b_data;
            while (chars-->0)
                put_fs_byte(*(p++),buf++);
            brelse(bh);
        } else {
            while (chars-->0)
                put_fs_byte(0,buf++);
        }
    }
    inode->i_atime = CURRENT_TIME;
    return (count-left)?(count-left):-ERROR;
}

file文件读数据流程：

① 若要读取的字节数count <= 0直接返回，否则用left保存要读取的字节数开始循环读取。

② 利用bmap()函数获取该文件的读写指针所在位置的逻辑块号nr，若nr!=0则读取该逻辑块，若nr=0表示数据块不存在，缓冲块bh指向空。

③ 计算读写指针在该逻辑块中的偏移量nr，将该数据块剩余的数据BLOCK_SIZE-nr和剩余需要读取字节数left进行比较，如果BLOCK_SIZE-nr > left表示该快是最后一个数据块，反之还需要读取下一个数据块，调整文件指针后移。

file_write(struct m_inode * inode, struct file * filp, char * buf, int count)

和读函数类似

pipe.c

管道文件是多进程通信进行数据交互的一种基本方式，在本文件中提供了read_pipe()，write_pipe()同时实现了系统调用sys_pipe()。

在创建管道的时候，程序会专门申请一个管道i节点，并为管道分配一页缓冲区（4kb），i节点的i_size字段保存着管道缓冲区地址,管道数据头指针存放在i_zone[0]字段中，管道数据尾指针存放在i_zone[1]指针中，如下图所示：

char_dev.c

该文件包含字符设备的访问函数。主要包含rw_ttyx()串口中端设备读写函数，rw_tty()控制台终端读写函数，rw_memory()内存设备读写函数。以及rw_char()字符设备读写接口函数。