linux内核互斥同步之互斥体

互斥体

在linux内核中，除信号量以外，还有一个类似的实现叫作互斥体Mutex。

信号量的count成员可以初始化为1，并且down和up操作也可以实现类似Mutex的作用，那为什么还可以单独实现Mutex机制，Mutex的语义相对于信号量要简单轻便一些，在锁争用激烈的测试场景下，Mutex比信号量执行速度快，可扩展性更好。

1.互斥体定义

struct mutex {
    
    
	/* 1: unlocked, 0: locked, negative: locked, possible waiters */
	atomic_t		count;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_MUTEX_SPIN_ON_OWNER)
	struct task_struct	*owner;
#endif
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
	struct optimistic_spin_queue osq; /* Spinner MCS lock */
#endif

};

count：原子计数，1表示没有人持有锁，0表示锁被持有，负数表示锁被持有且有人在等待队列中等待。
wait_lock：spin_lock锁，用于保护wait_list睡眠等待队列。
wait_list：用于管理所有在该Mutex上睡眠的进程，没有成功获取锁的进程会睡眠在此链表上。
owner：要打开选项才会有owner，用于指向锁持有者的task_struct数据结构。
osq：用于实现MCS锁机制

Mutex实现了自旋等待的机制，应该说是Mutex比读写信号量更早的实现了自旋等待机制。自旋等待机制的核心原理是当发现持有锁者正在临界区执行并且没有其他优先级高的进程要调度时，那么当前进程坚信锁持有者会很快离开临界区并释放锁，因此与其睡眠等待不如乐观的自旋等待，以减少睡眠唤醒的开销。在实现自旋等待机制时，内核实现了一套MCS机制来保证只有一个进程在自旋等待锁持有者释放。

2.MCS锁机制

MCS锁时一种自旋锁的优化的方案，在Linux2.6.25内核中自旋锁已经采用排队自旋算法进行了优化，以解决早期自旋锁争用不公平的问题。但是在NUMA系统中，排队自旋锁仍然存在一个比较严重的问题。在一个锁争用激烈的系统中，所有自旋等待锁的线程都在同一个共享变量上自旋，申请和释放都在同一个变量上修改，由于cache一致性原理导致参与自旋的CPU中的cacheline变得无效。导致严重的CPU高速缓存行颠簸现象。

MCS算法可以解决自旋锁遇到的问题，MCS算法的核心思想是每个锁的申请者只在本地CPU的变量上自旋。而不是全局变量。OSQ锁时MCS锁机制的一个具体实现。

MCS锁本质上是一种基于链表的自旋锁OSQ锁的实现需要两个数据结构：

struct optimistic_spin_node {
    
    
	struct optimistic_spin_node *next, *prev;
	int locked; /* 1 if lock acquired */
	int cpu; /* encoded CPU # + 1 value */
};

struct optimistic_spin_queue {
    
    
	/*
	 * Stores an encoded value of the CPU # of the tail node in the queue.
	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
	 */
	atomic_t tail;
};

每个MCS锁有一个optimistic_spin_queue数据结构，该数据结构只有一个成员tail，初始化为0，struct optimistic_spin_node数据结构表示本地CPU上的节点，它可以组成一个双向链表，包含next和prev指针，lock用于表示加锁状态，cpu成员用于重新编码cpu编号，表示node是在哪一个CPU上。optimistic_spin_node数据结构会定义成per-CPU变量，即每个CPU有一个node结构。

static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);

MCS锁在osq_lock_init函数中初始化，例如Mutex初始化时，就会调用它：

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
    
    
	.
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
	osq_lock_init(&lock->osq);
#endif
	.
}
static inline void osq_lock_init(struct optimistic_spin_queue *lock)
{
    
    
	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
}

2.1 MCS申请锁

osq锁机制中osq_lock用于申请锁，下面看下这个函数如何实现，由于代码比较长，我们分成几个部分来看。

bool osq_lock(struct optimistic_spin_queue *lock)
{
    
    
	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
	struct optimistic_spin_node *prev, *next;
	int curr = encode_cpu(smp_processor_id());
	int old;

	node->locked = 0;
	node->next = NULL;
	node->cpu = curr;

	/*把当前cpu的编号加入到lock->tail*/
	old = atomic_xchg(&lock->tail, curr);
	if (old == OSQ_UNLOCKED_VAL)
		return true;

第2行：通过this_cpu_ptr取得cpu本地的optimistic_spin_node节点

第5行：取得当前CPU的编号

第8-10行：对本地的optimistic_spin_node节点就行初始化

第13行：利用原理交换函数atomic_xchg交换全局的lock->tail和本地CPU编号，并且把lock->tail的旧值返回给old

第14行：如果old的值等于OSQ_UNLOCKED_VAL，说明之前锁没有被占用，成功获取锁

接着看代码的后面部分:

	/*把当前节点加入到列表中去*/
	prev = decode_cpu(old);
	node->prev = prev;
	WRITE_ONCE(prev->next, node);

	/*
	 * Normally @prev is untouchable after the above store; because at that
	 * moment unlock can proceed and wipe the node element from stack.
	 *
	 * However, since our nodes are static per-cpu storage, we're
	 * guaranteed their existence -- this allows us to apply
	 * cmpxchg in an attempt to undo our queueing.
	 */
	/*MCS锁的机制通过locked来传递锁，所以LOCKED必然是一个临界资源
	  但是
	*/
	while (!READ_ONCE(node->locked)) {
    
    
		/*
		 * If we need to reschedule bail... so we can block.
		 */
		if (need_resched())
			goto unqueue;

		cpu_relax_lowlatency();
	}
	return true;

第2行：通过old变量获取，其指向节点的cpu编号

第3-4行：将当前节点加入到optimistic_spin_node的链表当中

第17-25：Mutex通过locked来传递锁，即是上一个持有锁的cpu节点，通过设置next节点locked，传递锁。所以17行中，循序检测locked是否为1，1时表明上一个cpu节点已经释放锁，当前cpu节点获取到锁，然后退出while检测循环。如果在while循环体中，检测到有高优先级进程需要调度，不在循环等锁，直接跳转到unqueue。

unqueue:
	/*
	 * Step - A  -- stabilize @prev
	 *
	 * Undo our @prev->next assignment; this will make @prev's
	 * unlock()/unqueue() wait for a next pointer since @lock points to us
	 * (or later).
	 */

	for (;;) {
    
    
		if (prev->next == node &&
			/*
			如果cmpxchg(void *ptr,int old,int new)
			如果ptr和old的值一样，则把new写到ptr内存，否则返回ptr的值
		   */
		    cmpxchg(&prev->next, node, NULL) == node)
			break;

		/*
		 * We can only fail the cmpxchg() racing against an unlock(),
		 * in which case we should observe @node->locked becomming
		 * true.
		 */
		if (smp_load_acquire(&node->locked))
			return true;

		cpu_relax_lowlatency();

		/*
		 * Or we race against a concurrent unqueue()'s step-B, in which
		 * case its step-C will write us a new @node->prev pointer.
		 */
		prev = READ_ONCE(node->prev);
	}

	/*
	 * Step - B -- stabilize @next
	 *
	 * Similar to unlock(), wait for @node->next or move @lock from @node
	 * back to @prev.
	 */

	next = osq_wait_next(lock, node, prev);
	if (!next)
		return false;

	/*
	 * Step - C -- unlink
	 *
	 * @prev is stable because its still waiting for a new @prev->next
	 * pointer, @next is stable because our @node->next pointer is NULL and
	 * it will wait in Step-A.
	 */

	WRITE_ONCE(next->prev, prev);
	WRITE_ONCE(prev->next, next);

	return false;

unqueue代码段主要分为三个部分：

解除前继节点prev_node的next指针的指向。
解除当前节点curr_node的next指针的指向，并且找出当前节点下一个确定的节点next_node.
让前继节点prev_node->next指向next_node，next_node->prev指针指向prev_node。

第11-16行：如果当前prev_node->next指针等于node，然后用cmpxchg()函数原子判断前继节点next 指针是否指向当前节点，如果是，则把prev->next指针指向NULL，则达到了1的目的

第24-25行：如果cmpxchg比较交换指令判断失败，说明这期间有人修改了MCS链表。利用smp_load_acquire宏再一次判断当前节点是否持有锁，宏定义如下：

#define smp_load_acquire(p)						\
({									\
	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
	compiletime_assert_atomic_type(*p);				\
	smp_mb();							\
	___p1;								\
})

ACCESS_ONCE()宏使用volatile关键字强制重新加载p的值，smp_mb()保证内存屏障之前的读写指令都执行完毕。如果这时判断当前节点的locked为1，说明当前节点持有了锁，返回true。

第33行中：之前cmpxchg()判断失败说明当前节点的前继节点prev_node发生了变化，这里重新加载的前继及诶单，继续下一次循环。

接着看下第二部分：

	/*
	 * Step - B -- stabilize @next
	 *
	 * Similar to unlock(), wait for @node->next or move @lock from @node
	 * back to @prev.
	 */

	next = osq_wait_next(lock, node, prev);
	if (!next)
		return false;

第二部分主要是osq_wait_next函数定义如下：

static inline struct optimistic_spin_node *
osq_wait_next(struct optimistic_spin_queue *lock,
	      struct optimistic_spin_node *node,
	      struct optimistic_spin_node *prev)
{
    
    
	struct optimistic_spin_node *next = NULL;
	int curr = encode_cpu(smp_processor_id());
	int old;

	/*
	 * If there is a prev node in queue, then the 'old' value will be
	 * the prev node's CPU #, else it's set to OSQ_UNLOCKED_VAL since if
	 * we're currently last in queue, then the queue will then become empty.
	 */
	old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;

	for (;;) {
    
    
		if (atomic_read(&lock->tail) == curr &&
		    atomic_cmpxchg(&lock->tail, curr, old) == curr) {
    
    

			break;
		}
		if (node->next) {
    
    
			next = xchg(&node->next, NULL);
			if (next)
				break;
		}

		cpu_relax_lowlatency();
	}

	return next;
}

第8行：curr指向当前CPU编号，

第16行：变量old指向前继节点prev_node所在CPU编号，如果前继节点为空，old为0。

第19-20行：判断当前节点为curr_node是否为MCS链表中的最后一个节点，如果是，则将old赋值给

lock，然后直接返回next=null。

第24-27行：如果当前节点curr_node有后继节点，那么利用xchg函数，将node->next置为NULL，并将原来的next返回

继续看第三部分代码

	/*
	 * Step - C -- unlink
	 *
	 * @prev is stable because its still waiting for a new @prev->next
	 * pointer, @next is stable because our @node->next pointer is NULL and
	 * it will wait in Step-A.
	 */

	WRITE_ONCE(next->prev, prev);
	WRITE_ONCE(prev->next, next);

后继节点next_node的prev指针指向前继节点prev_node，前继节点prev_node的next指针指向后继节点next_node，这样就完成了当前节点的curr_node脱离MCS链表。

MCS锁的架构图如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-No5wdZYv-1597062125944)(C:\Users\10235657\Desktop\markdown\MCS锁.png)]

2.2MCS释放锁

void osq_unlock(struct optimistic_spin_queue *lock)
{
    
    
	struct optimistic_spin_node *node, *next;
	int curr = encode_cpu(smp_processor_id());

	/*
	 * Fast path for the uncontended case.
	 */
	if (likely(atomic_cmpxchg(&lock->tail, curr, OSQ_UNLOCKED_VAL) == curr))
		return;

	/*
	 * Second most likely case.
	 */
	node = this_cpu_ptr(&osq_node);
	next = xchg(&node->next, NULL);
	if (next) {
    
    
		WRITE_ONCE(next->locked, 1);
		return;
	}

	next = osq_wait_next(lock, node, NULL);
	if (next)
		WRITE_ONCE(next->locked, 1);
}

第9行：如果lock->tail等于当前curr，说明没有竞争该锁的情况，那么直接把lock->tail设置为0释放锁

第15-20行：首先获得当前cpu的optimistic_spin_node，利用xchg将node->next置为NULL，并返回next节点，如果next不为空，利用WRITE_ONCE设置next->locked为1，释放锁。

3.Mutex锁实现

mutex锁有两种初始化的方式：

DEFINE_MUTEX静态定义
内核代码中动态使用mutex_init()函数

#define __MUTEX_INITIALIZER(lockname) \
		{ .count = ATOMIC_INIT(1) \
		, .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \
		, .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
		__DEBUG_MUTEX_INITIALIZER(lockname) \
		__DEP_MAP_MUTEX_INITIALIZER(lockname) }

#define DEFINE_MUTEX(mutexname) \
	struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
	
#define mutex_init(mutex)						\
do {									\
	static struct lock_class_key __key;				\
									\
	__mutex_init((mutex), #mutex, &__key);				\
} while (0)

void
__mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
{
    
    
	atomic_set(&lock->count, 1);
	spin_lock_init(&lock->wait_lock);
	INIT_LIST_HEAD(&lock->wait_list);
	mutex_clear_owner(lock);
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
	osq_lock_init(&lock->osq);
#endif

	debug_mutex_init(lock, name, key);
}

3.1Mutex申请锁

void __sched mutex_lock(struct mutex *lock)
{
    
    
	might_sleep();
	/*
	 * The locking fastpath is the 1->0 transition from
	 * 'unlocked' into 'locked' state.
	 */
	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
	mutex_set_owner(lock);
}

申请锁的快车道的条件是count计数原子地减1后等于0.如果count计数原子地减1之后小于0，说明该锁已经被人持有了，那么进入慢车道__mutex_lock_slowpath()，第9行代码中，在成功持有锁之后设置lock->owner指向当前进程的tast_struct数据结构，慢车道调用下面函数实现：

__visible void __sched
__mutex_lock_slowpath(atomic_t *lock_count)
{
    
    
	struct mutex *lock = container_of(lock_count, struct mutex, count);

	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, 0,
			    NULL, _RET_IP_, NULL, 0);
}

__mutex_lock_common代码比较长，我们分几部分来看：

static __always_inline int __sched
__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
		    struct lockdep_map *nest_lock, unsigned long ip,
		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
    
    
	struct task_struct *task = current;
	struct mutex_waiter waiter;
	unsigned long flags;
	int ret;

	preempt_disable();
	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);

	if (mutex_optimistic_spin(lock, ww_ctx, use_ww_ctx)) {
    
    
		/* got the lock, yay! */
		preempt_enable();
		return 0;
	}

第11行：关闭内核抢占，第12行代码，mutex_optimistic_spin实现自旋等待机制，该函数比较长，简化后的片段如下：

static bool mutex_optimistic_spin(struct mutex *lock,
				  struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{
    
    
	struct task_struct *task = current;

	if (!mutex_can_spin_on_owner(lock))
		goto done;

	if (!osq_lock(&lock->osq))
		goto done;

	while (true) {
    
    
		struct task_struct *owner;
          
		owner = READ_ONCE(lock->owner);
		if (owner && !mutex_spin_on_owner(lock, owner))
			break;
		if (mutex_try_to_acquire(lock)) {
    
    
			mutex_set_owner(lock);
			osq_unlock(&lock->osq);
			return true;
		}
		if (!owner && (need_resched() || rt_task(task)))
			break;

		cpu_relax_lowlatency();
	}

	osq_unlock(&lock->osq);
done:
	if (need_resched()) {
    
    
		__set_current_state(TASK_RUNNING);
		schedule_preempt_disabled();
	}

	return false;
}

mutex_optimistic_spin函数中：

第5行：mutex_can_spin_on_owner用于判断mutex是否能够自旋等待。

static inline int mutex_can_spin_on_owner(struct mutex *lock)
{
    
    
	struct task_struct *owner;
	int retval = 1;

	if (need_resched())
		return 0;

	rcu_read_lock();
	owner = READ_ONCE(lock->owner);
	if (owner)
		retval = owner->on_cpu;
	rcu_read_unlock();
	
	return retval;
}

当进程持有Mutex锁时，lock->owner指向该进程的task_struct数据结构，task_struct->on_cpu为1表示锁持有者正在运行，也就是正在临界区中执行，因为锁持有者释放该锁后lock->owner执行NULL。第8-13行代码中使用RCU机制构造一个读临界区，为了保护ower指针指向的task_struct数据结构不会因为进程被杀之后导致访问ower指针出错，RCU读临界区可以保护ower指向的task_struct数据结构在读临界区内不会释放。

回到mutex_optimistic_spin函数中，第7行返回0说明锁持有者并没有正在运行，不符合自锁等待机制条件，直接跳到done。

第10行：获取一个OSQ锁来进行保护，OSQ锁时自旋锁的一种优化方案，为什么要申请MCS锁呢，因为接下来要自旋等待该锁尽快释放，因此不希望有其他人产于进来一起自旋等待，多人参与自旋等待会导致严重的CPU高速缓存颠簸。所有在等待Mutex的参与者放入OSQ锁的队列中，只有队列的第一个等待者可以参与自旋等待。

第13-28行：while循环会一直自旋等待并且判断锁持有者是否释放锁:

第16行：获取当前持有锁进程的task_struct接口。

第17行:mutex_spin_on_owner一直在自旋等待锁持有者释放锁，如何实现自旋等待的呢？

static noinline
bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
{
    
    
	bool ret = true;

	rcu_read_lock();
	while (lock->owner == owner) {
    
    
	
		barrier();
		if (!owner->on_cpu || need_resched()) {
    
    
			ret = false;
			break;
		}

		cpu_relax_lowlatency();
	}
	rcu_read_unlock();

	return ret;
}

在mutex_lock中，成功获取锁之后，会设置lock->owner为进程的task_struct，释放锁时lock->ower会被设置为NULL。mutex_spin_on_owner函数中，一直判断lock->owner == owner，为假时，跳出循环，或者持有锁的进程没有在临界区内执行，或者有更高优先级进程需要调度也跳出循环。。

回到mutex_optimistic_spin，第17行中，mutex_spin_on_owner返回后，判断是否需要继续自旋等待，接着调用mutex_try_to_acquire获取锁，成功后返回。第24行中是否有实时进程或者当前进程需要被调度，有的话退出自旋等待。

回到__mutex_lock_common第14行中，如果mutex_optimistic_spin返回为true，说明已经获取到了锁，使能内核抢占后返回，自旋等待失败后，继续看–mutex_lock_common函数:

	spin_lock_mutex(&lock->wait_lock, flags);

	if (!mutex_is_locked(lock) && (atomic_xchg(&lock->count, 0) == 1))
		goto skip_wait;

	list_add_tail(&waiter.list, &lock->wait_list);
	waiter.task = task;

	lock_contended(&lock->dep_map, ip);

	for (;;) {
    
    

		if (atomic_read(&lock->count) >= 0 &&
		    (atomic_xchg(&lock->count, -1) == 1))
			break;

		if (unlikely(signal_pending_state(state, task))) {
    
    
			ret = -EINTR;
			goto err;
		}

		__set_task_state(task, state);
		spin_unlock_mutex(&lock->wait_lock, flags);
		schedule_preempt_disabled();
		spin_lock_mutex(&lock->wait_lock, flags);
	}
	__set_task_state(task, TASK_RUNNING);

	mutex_remove_waiter(lock, &waiter, current_thread_info());
	/* set it to 0 if there are no waiters left: */
	if (likely(list_empty(&lock->wait_list)))
		atomic_set(&lock->count, 0);

第3行：再尝试一次获取锁，也许可以幸运地成功获取锁，就不需要走睡眠唤醒慢车道了。

第6-7行：把waiter加入到mutex等待队列wait_list中，实现先进先出队列。

第11-26行：for循环中，每次循环中首先尝试获取锁，获取失败后调用schedule_preempt_disabled函数让出cpu，进程进入睡眠状态。

第27行：获取锁成功后退出for循环，将设置当前进程为可运行状态。

第29行：将waiter从等待队列中出列。

3.2Mutex释放锁

mutex释放的过程如下面的函数:

void __sched mutex_unlock(struct mutex *lock)
{
    
    
	mutex_clear_owner(lock);
	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
}

第3行：调用mutex_clear_owner清除lock->owner。解锁和加锁一样有快车道和慢车道之分，快车道是如果count原子加1后大于0，说明等待队列中没有人，那么就解锁成功，否则–mutex_unlock_slowpath进入慢车道:

__visible void
__mutex_unlock_slowpath(atomic_t *lock_count)
{
    
    
	struct mutex *lock = container_of(lock_count, struct mutex, count);

	__mutex_unlock_common_slowpath(lock, 1);
}


/*
 * Release the lock, slowpath:
 */
static inline void
__mutex_unlock_common_slowpath(struct mutex *lock, int nested)
{
    
    
	unsigned long flags;
	if (__mutex_slowpath_needs_to_unlock())
		atomic_set(&lock->count, 1);

	spin_lock_mutex(&lock->wait_lock, flags);
	mutex_release(&lock->dep_map, nested, _RET_IP_);

	if (!list_empty(&lock->wait_list)) {
    
    
		/* get the first entry from the wait-list: */
		struct mutex_waiter *waiter =
				list_entry(lock->wait_list.next,
					   struct mutex_waiter, list);

		wake_up_process(waiter->task);
	}

	spin_unlock_mutex(&lock->wait_lock, flags);
}

在以上函数中，处于性能的考虑，首先释放锁，让正在自旋等待的进程能够第一时间获取到锁，然后如果等待队列中不为空，调用wake_up_process唤醒第一个等待的进程。

4.总结

从Mutex实现细节的分析可以知道，Mutex比信号量的实现要高效很多。

Mutex最先实现自旋等待机制
Mutex在睡眠之前尝试获取锁
Mutex实现MCS锁来避免多个CPU争用锁而导致CPU高速缓存行颠簸现象

正是因为Mutex的简洁性和高效性，因此Mutex的使用场景比信号量要更严格，使用Mutex需要注意下面的约束条件:

同一时刻只有一个线程可以持有Mutex
只有锁持有者可以解锁，不能在一个进程中持有Mutex，而在另外一个进程中释放它。
不允许递归的加锁和解锁
进程持有Mutex时，进程不可以退出
Mutex可以睡眠，所以不能在硬中断和软中断中使用

实际使用中，如何在spin_lock，信号量和mutex选择呢？

中断上下文中毫不犹豫选择spin_lock，如果临界区有睡眠或者隐含睡眠的动作以及内核API，避免选择spin_lock。在信号量和mutex之间，只要符合mutex使用条件就使用mutex，否则使用信号量。