LINUX LOAD AVERAGE misunderstanding of

uptime and commands top load average index can see, from left to right three numbers denote 1 minute, 5 minutes, 15 minutes load average:

$ uptime
10:16:25 up 3 days, 19:23, 2 users, load average: 0.00, 0.01, 0.05

Uptime $
10:16:25 up. 3 Days, 19:23, 2 Users, Load Average: 0.00, 0.01, 0.05
concept Load average UNIX-derived systems, although various different formulas, but they are used to measure the number of processes being used and the number of CPU process is waiting for the CPU, the number of runnable processes is a word of. So load average can be used as markers CPU bottleneck, if the number is greater than the CPU, indicating that the CPU might not be enough.

However, it is not the case on Linux!

load average on Linux in addition to including the number of processes being used and the number of CPU process is waiting for the CPU, but also includes a number of uninterruptible sleep process. When IO devices usually wait, wait for the network, the process will be in uninterruptible sleep state. Linux designer logic is, uninterruptible sleep should all be very short-lived, and soon will resume operation, it is equivalent to runnable. Even then, however uninterruptible sleep is a brief sleep, not to mention the real world uninterruptible sleep is not necessarily very short, a lot of, or a long period of uninterruptible sleep usually means IO device encountered a bottleneck. As we all know, the process does not require CPU sleep state, even if all of the CPU is idle, sleep process is not running, so the number of sleep process is definitely not suitable for use as a measure CPU load, Linux process count the uninterruptible sleep practices directly into the load average subvert the original meaning of the load average. So on a Linux system, load average this indicator basically lost a role, because you do not know what does it mean, when you see a high load average, you do not know too much runnable processes or uninterruptible process too much sleep, too CPU can not determine is not enough equipment or IO bottleneck.

参考资料:https://en.wikipedia.org/wiki/Load_(computing)
“Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system.“

Source:

RHEL6
kernel/sched.c:

static void calc_load_account_active(struct rq *this_rq)
{
long nr_active, delta;

    nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;

    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
            atomic_long_add(delta, &calc_load_tasks);
    }

}

===============

static void calc_load_account_active(struct rq *this_rq)
{
long nr_active, delta;

    nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;

    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
            atomic_long_add(delta, &calc_load_tasks);
    }

}
RHEL7
kernel/sched/core.c:

static long calc_load_fold_active(struct rq *this_rq)
{
long nr_active, delta = 0;

    nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;

    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
    }

    return delta;

}
RHEL7
kernel/sched/core.c:

static long calc_load_fold_active(struct rq *this_rq)
{
long nr_active, delta = 0;

    nr_active = this_rq->nr_running;
    nr_active += (long) this_rq->nr_uninterruptible;

    if (nr_active != this_rq->calc_load_active) {
            delta = nr_active - this_rq->calc_load_active;
            this_rq->calc_load_active = nr_active;
    }

    return delta;

}
RHEL7
kernel/sched/core.c:

/*

  • Global load-average calculations
  • We take a distributed and async approach to calculating the global load-avg
  • in order to minimize overhead.
  • The global load average is an exponentially decaying average of nr_running +
  • nr_uninterruptible.
  • Once every LOAD_FREQ:
  • nr_active = 0;
  • for_each_possible_cpu(cpu)
  • nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
  • avenrun[n] = avenrun[0] exp_n + nr_active (1 - exp_n)
  • Due to a number of reasons the above turns in the mess below:
    • for_each_possible_cpu() is prohibitively expensive on machines with
  • serious number of cpus, therefore we need to take a distributed approach
  • to calculating nr_active.
  • \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
  • = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
  • So assuming nr_active := 0 when we start out -- true per definition, we
  • can simply take per-cpu deltas and fold those into a global accumulate
  • to obtain the same result. See calc_load_fold_active().
  • Furthermore, in order to avoid synchronizing all per-cpu delta folding
  • across the machine, we assume 10 ticks is sufficient time for every
  • cpu to have completed this task.
  • This places an upper-bound on the IRQ-off latency of the machine. Then
  • again, being late doesn't loose the delta, just wrecks the sample.
    • cpu_rq()->nr_uninterruptible isn't accurately tracked per-cpu because
  • this would add another cross-cpu cacheline miss and atomic operation
  • to the wakeup path. Instead we increment on whatever cpu the task ran
  • when it went into uninterruptible state and decrement on whatever cpu
  • did the wakeup. This means that only the sum of nr_uninterruptible over
  • all cpus yields the correct result.
  • This covers the NO_HZ=n code, for extra head-aches, see the comment below.
    */
    RHEL7
    kernel/sched/core.c:

/*

  • Global load-average calculations
  • We take a distributed and async approach to calculating the global load-avg
  • in order to minimize overhead.
  • The global load average is an exponentially decaying average of nr_running +
  • nr_uninterruptible.
  • Once every LOAD_FREQ:
  • nr_active = 0;
  • for_each_possible_cpu(cpu)
  • nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
  • avenrun[n] = avenrun[0] exp_n + nr_active (1 - exp_n)
  • Due to a number of reasons the above turns in the mess below:
    • for_each_possible_cpu() is prohibitively expensive on machines with
  • serious number of cpus, therefore we need to take a distributed approach
  • to calculating nr_active.
  • \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
  • = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
  • So assuming nr_active := 0 when we start out -- true per definition, we
  • can simply take per-cpu deltas and fold those into a global accumulate
  • to obtain the same result. See calc_load_fold_active().
  • Furthermore, in order to avoid synchronizing all per-cpu delta folding
  • across the machine, we assume 10 ticks is sufficient time for every
  • cpu to have completed this task.
  • This places an upper-bound on the IRQ-off latency of the machine. Then
  • again, being late doesn't loose the delta, just wrecks the sample.
    • cpu_rq()->nr_uninterruptible isn't accurately tracked per-cpu because
  • this would add another cross-cpu cacheline miss and atomic operation
  • to the wakeup path. Instead we increment on whatever cpu the task ran
  • when it went into uninterruptible state and decrement on whatever cpu
  • did the wakeup. This means that only the sum of nr_uninterruptible over
  • all cpus yields the correct result.
  • This covers the NO_HZ=n code, for extra head-aches, see the comment below.
    */

Guess you like

Origin blog.51cto.com/youling87/2485715