Open file source code analysis and debugging

Open file source code analysis and debugging

1. Implementation of open() in Linux kernel

Kernel version 5.4

(1) Preparation work

1.Data structure

Dentry structure: There is an association between the file path and the inode through the directory entry (dentry) cache. The dentry cache speeds up the search for files by vfs.

vfsmount structure: Each file system mounted in the kernel directory tree will correspond to a vfsmount structure

nameidata structure: The file path is composed of directory entries at each level, so the path search process is a level-by-level search for directory entries. The nameidata structure is the core data structure in the path search process. During each level of directory entry search, it inputs parameters to the search function and saves the results of this search, so it is constantly changing.

2. Basic principles

rcu mechanism:

Copy-on-write (rcu, Read-Copy-Update) is a lock mechanism of the Linux kernel. It is an improved rwlock (but it cannot replace it). It is suitable for situations where there are many readers and few writers, and it can ensure the operation of readers and writers. At the same time.

For readers, the **rcu mechanism can ensure that multiple readers can directly access critical section resources without applying for locks. For the writer, the reason why it can access shared resources at the same time as the reader is because when the reader reads the original data, it modifies the backup of the original data. **When all readers exit access to the shared resource, writers will replace the original data with modified new data. At the same time, the recycling mechanism in rcu will recycle the original data.

Compared with rwlock, rcu is much more efficient when it reads more and writes less. Because the copy technology provided by rcu allows readers and writers to access shared resources at the same time, it eliminates the overhead spent by readers and writers when applying for locks.

Due to the characteristics of the rcu mechanism, the context it uses must be non-sleepable. Because the writer will wait for all readers to exit the critical section before replacing the original data, and if the readers are blocked at this time, the system will enter a deadlock state.

Supplement: Read-write lock rwlock: write exclusive, read shared; write lock priority is high.

rcu-walk和ref-walk:

Path lookup in the kernel provides two modes: ref-walk and rcu-walk. The former is the traditional path search method in the kernel, while ref-walk is a path search mode based on the rcu lock mechanism . Since path search happens to be a scenario of more reading and less writing, based on the fast and efficient characteristics of the rcu mechanism, this mode can efficiently perform path search. However, rcu-walk is not a panacea. If sleep is required during path search, the search mode must be switched from rcu-walk to ref-walk.

Supplement: ref-walk is simple to operate, but during the path walking process, the operation of each directory entry may require operations such as sleeping, obtaining a lock, etc.

(2) Basic implementation

The source code of function do_sys_open() is as follows:

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
	struct open_flags op;
	int fd = build_open_flags(flags, mode, &op);	//通过build_open_flags()将用户态的flags和mode转换成对应的内核态标志
	struct filename *tmp;

	if (fd)
		return fd;

	tmp = getname(filename);	//由于filename是用户态的内存缓冲区(使用了__user修饰),因此通过getname()将文件名从用户态拷贝至内核态;
	if (IS_ERR(tmp))
		return PTR_ERR(tmp);

	fd = get_unused_fd_flags(flags);	//get_unused_fd_flags()为即将打开的文件分配文件描述符;也就是在当前进程的files数组中寻找一个未使用的位置;
	if (fd >= 0) {
		struct file *f = do_filp_open(dfd, tmp, &op);	//通过do_filp_open()为文件创建file结构体;
		if (IS_ERR(f)) {		//如果创建file成功,则通过fd_install()将fd和file进行关联;如果创建file失败,通过put_unused_fd()将已分配的fd返回至系统,并且根据file生成错误的fd;
			put_unused_fd(fd);
			fd = PTR_ERR(f);
		} else {
			fsnotify_open(f);
			fd_install(fd, f);
		}
	}
	putname(tmp);		//通过putname()释放在内核分配的路径缓冲区;
	return fd;
}

(3) Path search

The implementation idea of ​​the file opening operation in the kernel is very simple: that is, the file is searched item by item through the path passed by the user mode; if the file exists, the kernel will create a file structure for the file; at the same time, the file structure is associated with the files array, Finally, the index of the array is returned as the user-mode file descriptor.

The source code of function do_filp_open() is as follows:

/*do_filp_open()解析文件路径并新建file结构*/
struct file *do_filp_open(int dfd, struct filename *pathname,
		const struct open_flags *op)
{
	struct nameidata nd;	//nameidata类型的nd在整个路径查找过程中充当中间变量,它既可以为当前查找输入数据,又可以保存本次查找的结果。
	int flags = op->lookup_flags;
	struct file *filp;

	set_nameidata(&nd, dfd, pathname);
	//path_openat有可能会被调用三次。通常内核为了提高效率,会首先在RCU模式(rcu-walk)下进行文件打开操作;如果在此方式下打开失败,则进入普通模式(ref-walk)。第三次调用比较少用,目前只有在nfs文件系统才有可能会被使用。
	filp = path_openat(&nd, op, flags | LOOKUP_RCU);
	if (unlikely(filp == ERR_PTR(-ECHILD)))
		filp = path_openat(&nd, op, flags);
	if (unlikely(filp == ERR_PTR(-ESTALE)))
		filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
	restore_nameidata();
	return filp;
}

The source code of function path_openat() is as follows:

/*path_openat()描述了整个路径查找过程的基本步骤*/
static struct file *path_openat(struct nameidata *nd,
			const struct open_flags *op, unsigned flags)
{
	struct file *file;
	int error;

	file = alloc_empty_file(op->open_flag, current_cred());		//通过alloc_empty_file分配一个新的file结构,分配前会对当前进程的权限和文件最大数进行判断;
	if (IS_ERR(file))
		return file;

	if (unlikely(file->f_flags & __O_TMPFILE)) {
		error = do_tmpfile(nd, flags, op, file);
	} else if (unlikely(file->f_flags & O_PATH)) {
		error = do_o_path(nd, flags, file);
	} else {
		const char *s = path_init(nd, flags);		//path_init()对接下来的路径遍历做一些准备工作,主要用于判断路径遍历的起始位置,即通过根目录/,或当前路径(pwd),或指定路径
		while (!(error = link_path_walk(s, nd)) &&
			(error = do_last(nd, file, op)) > 0)		//link_path_walk()对所打开文件路径进行逐一解析,每个目录项的解析结果都存在nd参数中;根据最后一个目录项的结果,do_last()将填充filp所指向的file结构;
        {
			nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
			s = trailing_symlink(nd);		//trailing_symlink 来检查当前打开的文件
		}
		terminate_walk(nd);		//查找完成操作,包括解RCU锁。
	}
	if (likely(!error)) {
		if (likely(file->f_mode & FMODE_OPENED))
			return file;
		WARN_ON(1);
		error = -EINVAL;
	}
	fput(file);		//返回file结构;
	if (error == -EOPENSTALE) {
		if (flags & LOOKUP_RCU)
			error = -ECHILD;
		else
			error = -ESTALE;
	}
	return ERR_PTR(error);
}

The source code of function path_init() is as follows:

/*path_init()用于设置路径搜寻的起始位置,主要体现在设置nd变量*/
static const char *path_init(struct nameidata *nd, unsigned flags)
{
	const char *s = nd->name->name;

	if (!*s)
		flags &= ~LOOKUP_RCU;
	if (flags & LOOKUP_RCU)
		rcu_read_lock();

	nd->last_type = LAST_ROOT; /* if there are only slashes... */
	nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
	nd->depth = 0;
	if (flags & LOOKUP_ROOT) {		//如果flags设置了LOOKUP_ROOT标志,则表示该函数被open_by_handle_at函数调用,该函数将指定一个路径作为根;
		struct dentry *root = nd->root.dentry;
		struct inode *inode = root->d_inode;
		if (*s && unlikely(!d_can_lookup(root)))
			return ERR_PTR(-ENOTDIR);
		nd->path = nd->root;
		nd->inode = inode;
		if (flags & LOOKUP_RCU) {
			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
			nd->root_seq = nd->seq;
			nd->m_seq = read_seqbegin(&mount_lock);
		} else {
			path_get(&nd->path);
		}
		return s;
	}

	nd->root.mnt = NULL;
	nd->path.mnt = NULL;
	nd->path.dentry = NULL;

	nd->m_seq = read_seqbegin(&mount_lock);
	if (*s == '/') {		//如果路径名name以/为起始,则表示当前路径是一个绝对路径,通过set_root设置nd;否则,表示路径name是一个相对路径;
		set_root(nd);
		if (likely(!nd_jump_root(nd)))
			return s;
		return ERR_PTR(-ECHILD);
	} else if (nd->dfd == AT_FDCWD) {		//如果dfd为AT_FDCWD,那么表示这个相对路径是以当前路径pwd作为起始的,因此通过pwd设置nd;
		if (flags & LOOKUP_RCU) {
			struct fs_struct *fs = current->fs;
			unsigned seq;

			do {
				seq = read_seqcount_begin(&fs->seq);
				nd->path = fs->pwd;
				nd->inode = nd->path.dentry->d_inode;
				nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
			} while (read_seqcount_retry(&fs->seq, seq));
		} else {
			get_fs_pwd(current->fs, &nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
		return s;
	} else {	//如果dfd不是AT_FDCWD,表示这个相对路径是用户设置的,需要通过dfd获取具体相对路径信息,进而设置nd;
		struct fd f = fdget_raw(nd->dfd);
		struct dentry *dentry;

		if (!f.file)
			return ERR_PTR(-EBADF);

		dentry = f.file->f_path.dentry;

		if (*s && unlikely(!d_can_lookup(dentry))) {
			fdput(f);
			return ERR_PTR(-ENOTDIR);
		}

		nd->path = f.file->f_path;
		if (flags & LOOKUP_RCU) {
			nd->inode = nd->path.dentry->d_inode;
			nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
		} else {
			path_get(&nd->path);
			nd->inode = nd->path.dentry->d_inode;
		}
		fdput(f);
		return s;
	}
}

The source code of function link_path_walk() is as follows:

/*link_path_walk()主要用于对各目录项逐级遍历*/
static int link_path_walk(const char *name, struct nameidata *nd)
{
	int err;

	if (IS_ERR(name))
		return PTR_ERR(name);
	//在进入这个循环之前,如果路径name是一个绝对路径,那么该函数还对路径进行了一些处理,即过滤掉绝对路径/前多余的符号/。
	while (*name=='/')
		name++;
	if (!*name)
		return 0;

	/* At this point we know we have a real path component. */
	for(;;) {
		u64 hash_len;
		int type;

		err = may_lookup(nd);
		if (err)
			return err;

		hash_len = hash_name(nd->path.dentry, name);

		type = LAST_NORM;
		if (name[0] == '.') switch (hashlen_len(hash_len)) {	//如果当前目录项为“.”,则type为LAST_DOT;如果目录项为“..”,则type为LAST_DOTDOT;否则,type默认为LAST_NORM;
			case 2:
				if (name[1] == '.') {
					type = LAST_DOTDOT;
					nd->flags |= LOOKUP_JUMPED;
				}
				break;
			case 1:
				type = LAST_DOT;
		}
		if (likely(type == LAST_NORM)) {
			struct dentry *parent = nd->path.dentry;
			nd->flags &= ~LOOKUP_JUMPED;
			if (unlikely(parent->d_flags & DCACHE_OP_HASH)) {
				struct qstr this = { { .hash_len = hash_len }, .name = name };	//this为qstr类型变量,表示当前搜索路径所处目录项的哈希值,用type指明当前目录项类型;
				err = parent->d_op->d_hash(parent, &this);
				if (err < 0)
					return err;
				hash_len = this.hash_len;
				name = this.name;
			}
		}

		nd->last.hash_len = hash_len;
		nd->last.name = name;
		nd->last_type = type;

		name += hashlen_len(hash_len);
		if (!*name)
			goto OK;
		//如果当前目录项紧邻的分隔符/有多个(比如/home///edsionte),则将其过滤,即使name指向最后一个/;
		do {
			name++;
		} while (unlikely(*name == '/'));
		if (unlikely(!*name)) {
OK:
			/* pathname body, done */
			if (!nd->depth)
				return 0;
			name = nd->stack[nd->depth - 1].name;
			/* trailing symlink, done */
			if (!name)
				return 0;
			//通过walk_component()处理当前目录项,更新nd和next;如果当前目录项为符号链接文件,则只更新next;
			/* last component of nested symlink */
			err = walk_component(nd, WALK_FOLLOW);
		} else {
			/* not the last component */
			err = walk_component(nd, WALK_FOLLOW | WALK_MORE);
		}
		if (err < 0)
			return err;

		if (err) {
			const char *s = get_link(nd);

			if (IS_ERR(s))
				return PTR_ERR(s);
			err = 0;
			if (unlikely(!s)) {
				/* jumped */
				put_link(nd);
			} else {
				nd->stack[nd->depth - 1].name = name;
				name = s;
				continue;
			}
		}
		if (unlikely(!d_can_lookup(nd->path.dentry))) {
			if (nd->flags & LOOKUP_RCU) {
				if (unlazy_walk(nd))
					return -ECHILD;
			}
			return -ENOTDIR;
		}
	}
}
//通过上述循环,将用户所指定的路径name从头至尾进行了搜索,至此nd保存了最后一个目录项的信息,但是内核并没有确定最后一个目录项是否真的存在,这些工作将在do_last()中进行。

The source code of function walk_component() is as follows:

//walk_component()处理当前目录项,更新nd和next;如果当前目录项为符号链接文件,则只更新next;
static int walk_component(struct nameidata *nd, int flags)
{
	struct path path;
	struct inode *inode;	//变量inode保存当前目录项对应的索引节点
	unsigned seq;
	int err;
	/*
	 * "." and ".." are special - ".." especially so because it has
	 * to be able to know about the current root directory and
	 * parent relationships.
	 */
	if (unlikely(nd->last_type != LAST_NORM)) {		//如果type为LAST_DOT和LAST_DOTDOT,将进入handle_dots()对当前目录项进行“walk”;
		err = handle_dots(nd, nd->last_type);
		if (!(flags & WALK_MORE) && nd->depth)
			put_link(nd);
		return err;
	}
	err = lookup_fast(nd, &path, &inode, &seq);		//lookup_fast查询dentry中的缓存,看一下是否命中,如果没有命中,则会用lookup_slow下降到文件系统层进行路径查找。
	if (unlikely(err <= 0)) {
		if (err < 0)
			return err;
		path.dentry = lookup_slow(&nd->last, nd->path.dentry,
					  nd->flags);
		if (IS_ERR(path.dentry))
			return PTR_ERR(path.dentry);

		path.mnt = nd->path.mnt;
		err = follow_managed(&path, nd);	//follow_managed函数会检查当前 dentry 是否是个挂载点,如果是就跟下去
		if (unlikely(err < 0))
			return err;

		if (unlikely(d_is_negative(path.dentry))) {
			path_to_nameidata(&path, nd);	//至此,如果当前目录项查找成功,则通过path_to_nameidata()更新nd;
			return -ENOENT;
		}

		seq = 0;	/* we are already out of RCU mode */
		inode = d_backing_inode(path.dentry);
	}

	return step_into(nd, &path, flags, inode, seq);
}

(4) Treatment of “.” and “…”

When the directory entry is "." (LAST_DOT) or "..." (LAST_DOTDOT), then walk_component() will process it through handle_dots().

If the current directory entry is ".", then the function of walk_component() at this time is to "cross" these current directories, and the nd information does not change, because all ordinary directory entries before "." have already updated nd.

If the current directory entry is "...", that is, the directory entry currently to be walked is the parent directory of the directory entry that was walked last time, that is, the parent directory of the current directory needs to be obtained upwards.

The source code of function handle_dots() is as follows:

static inline int handle_dots(struct nameidata *nd, int type)
{
	if (type == LAST_DOTDOT) {
		if (!nd->root.mnt)
			set_root(nd);
		if (nd->flags & LOOKUP_RCU) {		//如果当前搜索路径的模式位rcu,则进入follow_dotdot_rcu()的流程;否则进入follow_dotdot()的流程。
			return follow_dotdot_rcu(nd);
		} else
			return follow_dotdot(nd);
	}
	return 0;
}

The source code of function follow_dotdot_rcu() is as follows:

//follow_dotdot_rcu()函数是在rcu模式下获取父目录项信息,如果搜索成功,则返回0;否则,返回ECHILD,也就是说需要切换到ref-walk方式下进行搜索路径。
static int follow_dotdot_rcu(struct nameidata *nd)
{
	struct inode *inode = nd->inode;
	
	//进入循环体,向上获取当前目录项的父目录项。通常情况下,这个循环体只会被执行一次即退出,只有当父目录项为一个挂载点时才有可能不断进行循环。
	while (1) {
		if (path_equal(&nd->path, &nd->root))		//如果当前目录项恰好为根目录目录项,则直接跳出循环;
			break;
		if (nd->path.dentry != nd->path.mnt->mnt_root) {	//如果当前目录项既不是根目录,也不是一个挂载点,则属于最普通的情况,即直接获取当前目录项的父目录项。
			struct dentry *old = nd->path.dentry;
			struct dentry *parent = old->d_parent;
			unsigned seq;

			inode = parent->d_inode;
			seq = read_seqcount_begin(&parent->d_seq);
			if (unlikely(read_seqcount_retry(&old->d_seq, nd->seq)))
				return -ECHILD;
			nd->path.dentry = parent;
			nd->seq = seq;
			if (unlikely(!path_connected(&nd->path)))
				return -ENOENT;
			break;
		} else {		
			struct mount *mnt = real_mount(nd->path.mnt);
			struct mount *mparent = mnt->mnt_parent;
			struct dentry *mountpoint = mnt->mnt_mountpoint;
			struct inode *inode2 = mountpoint->d_inode;
			unsigned seq = read_seqcount_begin(&mountpoint->d_seq);
			if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
				return -ECHILD;
			if (&mparent->mnt == nd->path.mnt)
				break;
			/* we know that mountpoint was pinned */
			nd->path.dentry = mountpoint;
			nd->path.mnt = &mparent->mnt;
			inode = inode2;
			nd->seq = seq;
		}
	}
	
	//如果这个父目录项是一个挂载点,那么还需做一些特殊检查。因为在特殊情况下,当前这个父目录项又被挂载了其他的文件系统,那么返回上级目录这个操作获取的应该是最新文件系统的内容而不是之前那个文件系统的内容。
	while (unlikely(d_mountpoint(nd->path.dentry))) {
		struct mount *mounted;
		mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry);		//通过__lookup_mnt()检查父目录下挂载的文件系统是否为最新的文件系统,如果是则检查结束;否则,将继续检查;
		if (unlikely(read_seqretry(&mount_lock, nd->m_seq)))
			return -ECHILD;
		if (!mounted)
			break;
		nd->path.mnt = &mounted->mnt;
		nd->path.dentry = mounted->mnt.mnt_root;
		inode = nd->path.dentry->d_inode;
		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
	}
	nd->inode = inode;		//更新nd中的inode
	return 0;
}

(5) Processing of ordinary directory items

The 5.4 kernel does not use do_lookup() to search for ordinary directory entries. Instead, it uses the two functions lookup_fast() and lookup_slow() to search for them. The specific process is: each time the kernel will try to use lookup_fast to query the cache in the dentry. , see if there is a hit, if not, use lookup_slow to descend to the file system layer for path search.

static int lookup_fast(struct nameidata *nd,
		       struct path *path, struct inode **inode,
		       unsigned *seqp)
{
	......
	
				if (unlazy_walk(nd))
				return -ECHILD;
	......
}

If the current directory entry is a symbolic link file, the kernel handles it in another way. The kernel will call the unlazy_walk function to terminate the RCU search mode.

(6) Processing of symbolic link directory entries

If the current directory entry has symbolic link characteristics, the kernel will attempt to switch the current rcu-walk to ref-walk through unlazy_walk(). Because processing symbolic link files uses the hook function follow_link() of the entity file system, this function may cause blocking, so you must switch to the ref-walk mode that allows blocking.

If the switch fails, -ECHILD is returned, that is, at the return value do_filp_open(), the open operation will be performed in ref-walk mode again; otherwise, walk_component() will return 1, then its link_path_walk() is called to process the symbolic link file (link file is a path).

The processing method of symbolic link files is relatively special, but the essential implementation is still based on link_path_walk(), which is the basic implementation of the entire path search framework. For a path, after link_path_walk() processing, nd will save the information of the last directory entry, but it is not known whether the file (or directory) represented by the last directory entry exists. You need to call path_openat() of link_path_walk() ) for final processing.

(7) Open operation analysis

**do_last(): **In path_openat(), after link_path_walk() searches for the open path, do_last() will be entered to process the last directory entry in the path. The last directory entry may be of various types, such as "." or "...", or it may be a symbolic link file or "/"

(8) Mind map

Insert image description here

do_filp_open(): parse the file path and create a new file structure;

path_openat(): describes the basic steps of the entire path search process;

path_init(): used to set the starting position of path search, mainly reflected in setting the nd variable;
link_path_walk(): mainly used to traverse each directory entry step by step;
do_last(): process the last directory entry in the path;

walk_component(): Process the current directory entry, update nd and next; if the current directory entry is a symbolic link file, only update next;

handle_dots(): handle directory entries of "." (LAST_DOT) or "..." (LAST_DOTDOT);
lookup_fast(): search for common directory entries through the cache in dentry;
lookup_slow(): search for common directory entries through the file system Find;

follow_dotdot_rcu(): Obtain the parent directory item information in rcu mode. If the search is successful, return 0; otherwise, return ECHILD, which means you need to switch to ref-walk mode to search the path;
follow_dotdot(): In the read-write lock Get parent directory item information in mode;
unlazy_walk(): switch the current rcu-walk to ref-walk;

follow_link(): The hook function of the entity file system. This function may cause blocking;

2. Debugging the process of opening files

Add a printk statement to the source code to print out the relevant data structure field values:

Open file mode: The Linux system uses the form 0ABC to represent the file operation permissions:

0 represents octal;
A represents the permissions of the file owner;
B represents the permissions of the group user;
C represents the permissions of other users;

Among them, A, B, and C are all numbers from 0 to 7.

The meanings of each number from 0 to 7 are as follows (r: Read, w: Write, x: eXecute):

---     0   不可读写,不可执行
--x     1   可执行,不可读写
-w-     2   可写,不可读,不可执行
-wx     3   可写可执行,不可读
r--     4   可读,不可写,不可执行
r-x     5   可读,可执行,不可写
rw-     6   可读写,不可执行
rwx     7   可读写,可执行

The file.c code is as follows:

#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>   		//task结构体
#include <linux/syscalls.h>
#include <linux/fs_struct.h>
#include <linux/fdtable.h>
#include <linux/fs.h>

MODULE_LICENSE("GPL");  //许可证

static int __init init_file(void)
{
	struct file *fp;
	fp=filp_open("/home/zhang/code/file/test.txt",O_RDWR,0644);	//0644代表的是文件主有可读写的权限,组用户和其他用户有可读的权限。
	printk("users:%d  umask:%d  count:%d  f_flags:%d  f_mode:%d  f_version:%d  i_mode:%d\n",current->fs->users,current->fs->umask,current->files->count,fp->f_flags,fp->f_mode,fp->f_version,fp->f_inode->i_mode);
	
	return 0;
}

static void __exit exit_file(void)    //出口函数
{
        printk("Exiting...\n");
}
 
// 指明入口点与出口点,入口/出口点是由module.h支持的
module_init(init_file);
module_exit(exit_file);

operation result:

Insert image description here

Guess you like

Origin blog.csdn.net/qq_58538265/article/details/133915819