Linux kernel analysis notes-timer and time management

In this time, I will mainly talk about things related to time. We are all familiar with this, so I will be directly as the subject.

      We must first understand two concepts: system timer and dynamic timer. Periodically generated events are driven by a system timer. The system timer here is a programmable hardware chip that can generate interrupts at a fixed frequency. This interrupt is a timer interrupt, and its corresponding interrupt handler is responsible for updating the system time and also for executing tasks that need to be run periodically. The system timer and clock interrupt handler are the core of the Linux system kernel management mechanism. A dynamic timer is a tool used to delay the execution of a program. The kernel can dynamically create or destroy dynamic timers.

       The kernel must be able to calculate and manage time with the help of hardware. The hardware provides a system timer for the kernel to calculate the elapsed time. The clock can be regarded as an electronic time resource in the kernel. The system timer triggers the clock interrupt by itself at a certain frequency. The frequency can be called the tick rate by programming. When the clock interrupt occurs, the kernel handles it through a special interrupt handler. The system timer frequency (beat rate) is defined by static preprocessing, that is, HZ. The hardware is set according to the HZ value when the system is started. The architecture is different, and the value of HZ is also different, which is defined in asm/param.h. The beat rate just mentioned means this. The period is 1/HZ second. The last thing to note is that this HZ value is not fixed, but adjustable when writing the kernel code. Of course, for the operating system, this fixed clock interruption is not necessary. In fact, the kernel can use dynamic programming timer operations to suspend events. Not much to say here.

       In the Linux kernel, there is a variable called jiffies (defined in linux/jiffies) that records the total number of beats since the system was started. At startup, the kernel initializes this variable to 0, and every time the clock interrupt handler will increase the value of this variable thereafter. Because the number of clock interruptions in one second is equal to HZ, the value added by jiffies in one second is also HZ. The system running time is calculated in seconds, which is equal to jiffes/HZ. As a variable represented on the computer, it always exists The size, when this variable increases beyond its upper limit, it will wrap around to 0. This wrap around looks very simple, but in fact it still causes a lot of trouble for our programming, such as boundary condition judgment. Fortunately, the kernel provides four macros to help compare beat counts. These macros are defined in linux/jiffies.h to handle beat wraparounds:

      1    Explanation: The unknown parameter is usually jiffies, and the known parameter is the value to be compared.

       If the HZ value in the kernel is changed, it will cause abnormal results for some programs in the user space. This is because the kernel derives this value to the user space in the form of beats per second. This interface has been stable for a long time. Later, the application gradually depends on the value of this specific HZ. So if the defined value of HZ is changed in the kernel, it will break the constant relationship of user space-user space does not know the value of this new HZ. In order to solve this problem, the kernel must change the value of all exported jiffies. The kernel defines USER_HZ to represent the HZ value seen in user space. The kernel can use the macro jiffies_to_clock_t() to convert a beat count represented by HZ into a beat count represented by USER_HZ. The usage of this macro depends on whether USER_HZ is an integral multiple of HZ or vice versa. When it is an integer multiple, the form of the macro is quite simple:

1
#define jiffies_to_clock_t(x) ((x)/(HZ/USER_HZ));

       If it is not an integer multiple, then the macro will have to use a more complex algorithm. Similarly, if it is a 64-bit system, the kernel uses the function jiffies_64_to_clock() to convert the unit of the 64-bit jiffies value from HZ to USER_HZ.

       The architecture provides two devices for timing: system timer and real-time clock. The system timer provides a periodic trigger interrupt mechanism. The real-time clock (RTC) is a device used to permanently store the system time. Even after the system is turned off, it can also rely on the power provided by the micro battery on the motherboard to protect the system's timing. When the system starts, the kernel initializes the wall time by reading the RTC. The time is stored in the xtime variable. The main function of the real-time clock is to initialize the xtime variable at startup.

       With the above conceptual foundation, the following will analyze the clock interrupt handler. It is divided into two parts: the architecture-related part and the architecture-independent part. The relevant part is registered in the kernel as the interrupt handler of the system timer, so that when a clock interrupt is generated, it can run accordingly. The work performed is as follows:

1. Get the xtime_lock lock to protect access to jiffies_64 and wall time xtime.
2. Respond when needed or reset the system clock.
3. Periodically use wall time to update the real-time clock.
4. Call the time routine that has nothing to do with the architecture: do_timer(). The
interrupt service routine mainly performs the following work by calling the routine do_timer() that has nothing to do with the architecture:
1. Add 1. to the jiffies_64 variable.
2. Update resource consumption Statistical values, such as system time and user time consumed by the current process.
3. Execute the dynamic timer that has expired.
4. Execute the scheduler_tick() function.
5. Update the wall time, which is stored in the xtime variable.
6. Calculate the average load value.

       do_timer still seems very simple, because its main job is to complete the above framework, just let other functions do it specifically:

1
2
3
4
5
6
void do_timer(struct pt_regs *regs)
{
    jiffies_64++;
    update_process_times(user_mode(regs));
    update_times();
}

       The above user_mode() macro queries the status of the processor register regs. If the clock interruption occurs in user space, it returns 1; if it occurs in kernel mode, it returns 0. The update_process_times() function will affect the user or the user according to the location where the clock interruption is generated. The system updates the corresponding time:

1
2
3
4
5
6
7
8
9
void update_process_times(int user_tick)
{
    struct task_struct *p=current;
    int cpu=smp_processor_id();
    int system=user_tick^1;
    updata_one_process(p,user_tick,system,cpu);
    run_local_timers();
    scheduler_tick(user_tick,system);
}

       The function of update_one_process() is to update the process time. Its implementation is quite detailed. But note that because the XOR operation is used, as long as one of the two variables of user_tick and system is 1, the other must be 0. The updates_one_process() function can add user_tick and system to the corresponding count of the process by judging the branch. on:

1
2
p->utime = user;
p->stime = system;

       The above operation increases the appropriate count value by 1, while the other value remains unchanged. As you may have discovered, this means that when the kernel counts the process time, it is classified and counted according to the mode of the processor when the interrupt occurs, and it counts all the previous tick to the process. But in fact, the process may enter and exit the kernel mode multiple times during the last metronome, and during the last metronome, the process may not be the only running process, but there is no way. The following run_lock_times() function marks a soft interrupt to handle all expired timers. Finally, the scheduler_tick() function is responsible for reducing the time slice count of the current running process and setting the need_resched flag when needed. In SMP machines, this function is also responsible for balancing the run queues on each processor. When the update_process_times() function returns, the do_timer() function will then call update_times() to update the wall time.

1
2
3
4
5
6
7
8
9
10
void update_times(void)
{
    unsigned long ticks;
    if(ticks){
        wall_jiffies += ticks;
        update_wall_time(ticks);
    }
    last_time_offset = 0;
    calc_load(ticks);
}

       这里的ticks记录最近一次更新后新产生的节拍数。通常情况下ticks显然应该等于1.但是时钟中断也有可能丢失,因而节拍也会丢失。在中断长时间被禁止的情况下,就会出现这种现象(这种情况并不常见,往往是个BUG).wall_jiffies值随后被加上ticks----所以此刻wall_jiffies值就等于更新的墙上时间的更新值jiffies----接着调用update_wall_time()函数更新xtime,最后由calc_load()执行。do_timer()函数执行完毕后返回与体系结构相关的中断处理程序,继续执行后面的工作,释放xtime_lock锁,然后退出。以上的工作每1/HZ都要发生一次。

       刚前边说的墙上时间就是我们常说的实际时间,指变量xtime,由结构体timespec定义(kernel/timer.c),如下:

1
2
3
4
struct  timespec{
    time_t tv_sec;  //秒,存放自1970年7月1日(UTC)以来经过的时间,1970年7月1日称为纪元
    long tv_nsec;   //纳秒,记录自上一秒开始经过的纳秒数
}

读写这个xtime变量需要xtime_lock锁,该锁是一个顺序锁(seqlock).关于内核读写就不说了,注意适当加解锁就好。回到用户空间,从用户空间取得墙上时间的主要接口是gettimeofday(),在内核中对应系统调用为sys_gettimeofday():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
asmlinkage long sys_gettimeofday(struct timeval __user *tv, struct timezone __user *tz)
{
         if (likely(tv != NULL)) {
                 struct timeval ktv;
                 do_gettimeofday(&ktv);
                 if (copy_to_user(tv, &ktv, sizeof(ktv)))
                         return -EFAULT;
         }
         if (unlikely(tz != NULL)) {
                 if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
                         return -EFAULT;
         }
         return 0;
}

       分析上面的函数发现,问题就集中在tv上。当tv非空,就调用do_gettimeofday(),它主要完成循环读取xtime的操作。如果tz参数为空,该函数将把系统时区(存放在sys_tz中)返回用户。如果给用户空间拷贝墙上时间或时区发生错误,该函数返回-EFAULT;如果成功,则返回0.另外,内核提供的time系统调用,几乎被gettimeofday()完全取代。C库函数提供的一些墙上时间相关的库调用如ftime和ctime。系统的settimeofday()是用来设置当前时间,它需要具有CAP_SYS_TIME权限。除了更新xtime时间以外,内核不会像用户空间程序那样频繁使用xtime。但也需要注意在文件系统的实现代码中存放访问时间戳时需要使用xtime。

      上面说完了有关硬时钟,下面开始新的话题,是关于定时器的(也称动态定时器或内核定时器)。定时器并不周期执行,它在超时后就自行销毁。定义器由定义在linux/timer.h中的time_list表示,如下:

1
2
3
4
5
6
7
8
9
struct timer_list {
         struct list_head entry;
         unsigned long expires;
         spinlock_t lock;
         unsigned long magic;
         void (*function)(unsigned long);
         unsigned long data;
         struct tvec_t_base_s *base;
};

      内核提供了一组与定时器相关的用来简化管理定时器的操作。所有这些接口都声明在文件linux/timer.h中,大多数接口在文件kernel/timer.c中获得实现。有了这些接口,我们要做的事情就很简单了:

1.创建定时器:struct timer_list my_timer;

2.初始化定时器:init_timer(&my_timer);

3.根据需要,设置定时器了:

            my_timer.expires = jiffies + delay;

            my_timer.data = 0;

            my_timer.function = my_function;

4.激活定时器:add_timer(&my_timer);

      经过上面的几步,定时器就可以开始工作了。然而,一般来说,定时器都在超时后马上就会执行,但是也有可能被推迟到下一时钟节拍时才能运行,所以不能使用它来实现硬实时。如果修改定时器,使用mod_timer(&my_timer,jiffies+new_delay)来修改已经激活的定时器时间。它也可以操作那些已经初始化,但还没有被激活的定时器,如果定时器未被激活,mod_timer会激活它。如果第啊哟个定时器时未被激活,该函数返回0;否则返回1。但不论哪种情况,一旦从mod_timer函数返回,定时器都将被激活而且设置了新的定时值。当然你也可以在超市前删除定时器用:del_timer(&my_timer);另外需要注意的是在多处理器上定时器中断可能已经在其它机器上运行了,这是就需要等待可能在其它处理器上运行的定时器处理程序都退出后再删除该定时器。这是就要使用del_timer_sync()函数执行删除工作。这个函数参数和上面一个一样,只是不能在中断上下文中使用而已。定时器是独立与当前代码的,这意味着可能存在竞争条件,这个就要特别小心,从这个意义上讲后者删除比前者更加安全。

      内核在时钟中断发生后执行定时器,定时器作为软件中断在下半部上下文中执行。具体来说就是时钟中断处理程序会执行update_process_timers()函数,该函数随即调用run_local_timers()函数:

1
2
3
4
void run_local_timers(void)
{
    raise_softirq(TIMER_SOFTIRQ);
}

      这个函数处理软中断TIEMR_SOFTIRQ,从而在当前处理器上运行所有的超时定时器。所有定时器都以链表的形式组织起来,但如果单纯的链表结构显然影响性能,因为每次都要顺序的的查找调整,这个时候,内核定时器按它们的超时时间将他们分为5组,当定时器超时时间接近时,定时器将随组一起下移。采用这种方法可以减少搜素超时定时器所带来的负担。

下一话题,内核代码(尤其是驱动程序)除了使用定时器或下半部机制以外还提供了许多延迟的方法来处理各种延迟请求。下面就来总结一下:

1.忙等待(也叫忙循环):通常是最不理想的方法,因为处理器被白白占用旋转而无法做别的事情。该方法仅仅在想要延迟的时间是节拍的整数倍或者精确率要求不高时才可以使用。实现起来还是挺简单的,就是在循环中不断旋转直到希望的时钟节拍数耗尽。比如:

1
2
3
unsigned long delay = jiffies+10;   //10个节拍
while(time_before(jiffies,delay))
    cond_resched();

      缺点很明显,更好的方法是在代码等待时,允许内核重新调度执行其他任务,如下:

1
2
3
unsigned long delay = jiffies+10;   //10个节拍
while(time_before(jiffies,delay))
    cond_resched();

      cond_resched()函数将调度一个新程序投入运行,但它只有在设置完need_resched标志后才能生效。换句话说,就是系统中存在更重要的任务需要运行。再由于该方法需要调用调度程序,所以它不能在中断上下文中使用----只能在进程上下文中使用。事实上,所有延迟方法在进程上下文中使用,因为中断处理程序都应该尽可能快的执行。另外,延迟执行不管在哪种情况下都不应该在持有锁时或者禁止中断时发生。

      至于说那些需要很短暂的延迟(比时钟节拍还短)而且还要求延迟的时间很精确,这种情况多发生在和硬件同步时,也就是说需要短暂等待某个动作的完成----等待时间往往小于1ms,所以不可能使用像前面例子中那种基于jiffies的延迟方法。这时,就可以使用在linux/delay.h中定义的两个函数,它们不使用,这两个函数可以处理微秒和毫秒级别的延迟的时间,如下所示:

1
2
void udelay(unsigned long usecs);
void mdelay(unsigned long msecs);

      前者是依靠执行次数循环来达到延迟效果的,而mdelay()函数又是通过udelay()函数实现的。因为内核知道处理器在一秒内能执行多少次循环,所以udelay()函数仅仅需要根据指定的延迟时间在1秒中占的比例,就能决定需要进行多少次循环就能达到需要的推迟时间。udelay()函数仅能在要求的延迟时间很短的情况下执行,而在高速机器中时间很长的延迟会造成溢出,经验表明,不要试图在延迟超过1ms的情况下使用这个函数。这两个函数其实和忙等待一样,如果不是非常必要,还是不要用了算了。

      What I said earlier is a little scary, so what should I do? In fact, a more ideal delayed execution method is to use the schedule_timeout() function. This method will let the tasks that need to be delayed execute sleep until the specified delay time has elapsed before re-running. However, this method cannot guarantee that the sleep time is exactly equal to the specified delay time-it can only be as close as possible to the specified delay time. When the specified time expires, the kernel wakes up the delayed task and puts it back into the run queue, as follows:

1
2
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(s*HZ);

      The only parameter is the relative delay time, the unit is jiffies. In the above example, the corresponding task is pushed into the interruptible sleep queue and sleeps for s seconds. Before calling the function schedule_timeout, do not set the task to be interruptible or discordant, otherwise the task will not sleep. This function needs to call the scheduler, so the code that calls it must ensure that it can sleep. In short, the calling code must be in the process context and cannot hold the lock. For the implementation details of this function, you can look at the source code, which is quite simple. The next step is when the timer expires, the process_timeout() function is called:

1
2
3
4
void process_timeout(unsigned long data)
{
    wake_up_process((task_t *)data);
}

      This function puts the task in TASK_RUNNING state, and then puts it in the run queue. When the task is re-scheduled, the return code will continue to execute at the position before sleep (just after calling schedule()). If the task is awakened early (such as receiving a signal), the timer is destroyed, and the process_timeout() function returns the remaining time.

      Finally, in the section on process scheduling, we said that the code of the process context can put itself in the waiting queue in order to wait for a specific time to occur. However, a task on the waiting queue may be both waiting for a specific event to arrive, and waiting for a specific time to expire-it depends on who comes faster. In this case, the code can simply use the scedule_timeout() function instead of the schedule() function. In this way, when the specified time expires, the task will be awakened. Of course, the code needs to check the reason for being awakened. It may be awakened by an event, it may be because the delay time expires, or it may be because a signal is received-and then the corresponding operation is performed.

Guess you like

Origin blog.csdn.net/daocaokafei/article/details/114805823