Read the TSC clock in one article: (x86_64/arm64) implementation introduction and programming use

Linux(16)之Time Stamp Counter

Author: Once Day Date: May 30, 2023

Reference documents:

1 Overview

DPDK (Data Plane Development Kit) is a set of libraries and drivers for fast processing of data packets. TSC (Time Stamp Counter, Time Stamp Counter) is a high-precision timer used to measure time on the CPU core. The TSC is a 64-bit register, there is one TSC per CPU core, and it increments every clock cycle.

On different CPU cores, the cycle of TSC may be the same or different. It depends on several factors:

  1. Synchronized TSC : Modern processors implement synchronization of the TSC, i.e. starting and incrementing the TSC on all cores simultaneously. This means that on these processors, TSC cycles should be the same on all cores. To determine whether a processor supports synchronous TSC, you can check the processor's specification document, or query the relevant fields of the CPUID instruction.

  2. Dynamic frequency adjustment (such as Intel SpeedStep, AMD Cool'n'Quiet, etc.) : Some processors can dynamically adjust the CPU frequency according to the load. This may cause TSC to increase at different rates on different cores, since each core may run at a different frequency. In this case, the TSC period may be different on different cores. To solve this problem, you can set the processor to run at a fixed frequency, or use a timer that is not affected by dynamic frequency adjustment, such as HPET (High Precision Event Timer, High Precision Event Timer).

  3. Multiprocessor systems (such as servers with multiple physical CPUs) : In systems with multiple physical processors, each processor has its own TSC. This can cause cores on different processors to have different TSC cycles. You can try to use software techniques such as RDTSC (Read Time Stamp Counter instruction) or use timers more suitable for multiprocessor systems such as HPET to solve this problem.

In short, the TSC cycles on different CPU cores may be the same or different. Determining whether the TSC period is the same usually needs to consider factors such as processor architecture, dynamic frequency adjustment technology and multi-processor system. In scenarios where precise timing is required, other timers such as HPET can be used to avoid potential problems.

2. DPDK's two timestamp timers (TSC, HPET)

In DPDK, Time Stamp Counter (TSC, Time Stamp Counter) and High Precision Event Timer (HPET, High Precision Event Timer) are two methods for measuring time. The usage relationship between them can be summarized from the following aspects:

  1. Time Measurement Accuracy and Performance

    • TSC: TSC is a high-precision timer that increments every clock cycle. Due to the fast reading speed and low latency of TSC, in scenarios with high performance requirements, DPDK may give priority to using TSC as a timer.
    • HPET: HPET is also a high-precision timer, but compared to TSC, its reading speed is slower and the delay is higher. However, HPET is more stable in multiprocessor systems and dynamic frequency adjustment scenarios, so in these cases, DPDK may choose to use HPET as a timer.
  2. Selection of TSC and HPET

    • DPDK automatically detects and selects an appropriate timer at startup. TSC is preferred because of its higher performance. However, DPDK falls back to using HPET as a timer if it detects that TSC may be unstable on multi-core processors, multi-processor systems, or dynamic frequency adjustment scenarios.
  3. timer API

    • DPDK provides general-purpose timer APIs, which abstract the underlying timer implementations (such as TSC and HPET), so that DPDK applications can perform time measurement and scheduling without caring about the underlying timer type. This means that DPDK application developers do not need to directly deal with the usage relationship between TSC and HPET, but can implement the required timing functions through the API provided by DPDK.

In summary, in DPDK, TSC and HPET are two methods used to measure time. The usage relationship between them mainly depends on performance requirements and stability in specific scenarios. DPDK will automatically select the appropriate timer, and application developers can implement the timing function through the general timer API provided by DPDK without directly dealing with the relationship between TSC and HPET.

3. Detailed summary of time stamp counter (TSC, Time Stamp Counter)

(1) Advantages :

  • High precision: TSC is a high precision counter, the TSC register on each CPU core is incremented every clock cycle, so it can provide very precise time measurement.
  • Low latency: Compared with other timers (such as HPET), TSC reads faster and has lower latency.
  • Wide support: The vast majority of modern processors support TSC, making it a general-purpose timing solution.

(2) Disadvantages :

  • Synchronization issues: In multi-core processors and multi-processor systems, the TSCs on different CPU cores may not be perfectly synchronized, resulting in inconsistent time measurements.
  • Dynamic frequency adjustment: Dynamic adjustment of CPU frequency (such as Intel SpeedStep, AMD Cool'n'Quiet, etc.) may cause TSC to increase at different speeds, affecting accuracy.
  • Virtualized environments: In virtualized environments, the behavior of the TSC may be influenced by the hypervisor, resulting in inaccurate time measurements.

(3) How to use :

  • Read TSC: The current TSC value can be read by executing the RDTSC (Read Timestamp Counter) instruction. In C/C++, __rdtsc()the TSC can be read using inline assembly or using a built-in function provided by the compiler such as .
  • Calculate time difference: By reading the TSC at different points in the program, it is possible to calculate the number of clock cycles between two points in time. Then, divide the number of clock cycles by the CPU frequency (in Hz) to get the time difference (in seconds).

(4) Matters needing attention :

  • Ensure Synchronization: Before using TSC, you should check that your processor supports synchronous TSC. You can query the relevant fields of the CPUID instruction or the processor specification document to obtain this information.
  • Consider Dynamic Frequency Scaling: When using TSC on a processor that is affected by Dynamic Frequency Scaling, either set the processor to run at a fixed frequency, or use a timer that is not affected by Dynamic Frequency Scaling (such as HPET).
  • Multiprocessor systems: In multiprocessor systems, timers suitable for multiprocessor systems (such as HPET) should be used, or software techniques should be used to solve the problem of TSC desynchronization on different processors.
  • Considering the problem of CPU out-of-order execution, rdtsc needs to cooperate with cpuid or lfence instructions to ensure that the pipeline is empty at this moment, that is, the instructions to be measured by rdtsc have been executed. Later CPUs provide the rdtscp instruction, which is equivalent to cpuid + rdtsc, but the execution cycle of the cpuid instruction itself fluctuates, while the execution of the rdtscp instruction is more stable.
  • Multi-core system: The new CPU supports the Invariant TSC feature, which can ensure that the TSC seen by each core is consistent by default, otherwise the measurement code cannot be scheduled to other cores during execution.
  • Timing measurement is easily disturbed (thread scheduling, preemption, system interruption, virtualization, etc.), and the sequence of instructions to be measured is required to be as short as possible, and multiple measurements are required

(5) Frequently asked questions :

  • TSC out of sync: In multi-core processors and multiprocessor systems, the TSCs on different CPU cores may not be perfectly synchronized. You can try to correct using software techniques, or use other timers (such as HPET).
  • Dynamic frequency adjustment: Dynamic adjustment of CPU frequency may cause TSC to increase at different speeds. The processor can be set to run at a fixed frequency, or use a timer (such as HPET) that is not affected by dynamic frequency adjustment.
  • Virtualization environment: TSC in a virtual machine may be affected by a virtual machine monitor (hypervisor). In a virtualized environment, it is recommended to use a virtualization-friendly timer, such as a virtualized clock provided by a virtual machine monitor (such as kvm-clock in KVM).

4. History of TSC development

Reference documents:

The TSC provided by the earliest CPU has many disadvantages:

  • Frequency is affected by cpu frequency, some depth levels into C-state will even stop working (no more jumps)
  • Under the SMP architecture, the cores are not synchronized, which means that the tsc on one core is different from that on other cores, and the hopping frequency is also different.

Later, Intel enhanced it (you can check it in the CPU feature identification):

  • constant_tsc : The meaning is to jump at a fixed frequency, which has nothing to do with the current frequency of the cpu.
  • nonstop_tsc : Entering C-State will not stop beating.

Based on the combination of these two characteristics, it is called invariant tsc , that is, tsc is beating at an ideal constant frequency, which is in line with the assumption of the clock.

The problem of asynchrony under the SMP architecture is judged by the kernel:

  • When the Linux kernel starts, it detects whether the tsc is synchronized, and tries to calibrate the tsc on multiple cores to start running at the same frequency and starting value.
  • Setting the characteristics of tsc by writing the value of the MSR register requires cpu support. Currently, only intel cpu can be considered as multi-core synchronization .

The frequency of TSC can be obtained in the following ways :

  • Calculated by some register values ​​in CPUID, newer CPUs can.
  • It is calculated by reading the value of the MSR register, and it is necessary to follow up with different CPU models to read different registers.
  • By reading the symbol tsc_khz of the kernel export

The following is to read the kernel symbols:

bpftrace -e 'BEGIN { printf("%u\n", *kaddr("tsc_khz")); exit(); }'

In addition, the tsc frequency calculated and adjusted by the kernel is not necessarily the same as that calculated by the hardware registers, because the kernel will calibrate.

5. In-depth analysis of TSC multi-core clock

Reference documents:

Are the results of rdtsc synchronized between multiple cores of the same processor and between different cores of different processors? If they are not synchronized, then the timing results cannot be compared with each other.

Regarding this point, Intel's official manual does not say clearly, as follows:

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

Just saying that TSC can guarantee to increment at a nominal rate in any (power) state of the CPU, does not explicitly state that TSC can keep pace in the case of multi-core or even multi-processor .

When the Linux kernel starts, the process of processing the TSC (Time Stamp Counter) clock can be divided into the following steps :

  1. Detect TSC features : The kernel first uses the CPUID instruction to detect whether the CPU supports TSC. If the CPU supports TSC, the kernel will continue to check other TSC related features, such as:

    • Whether to support invariant TSC (Invariant TSC): Invariant TSC is synchronized between all cores and processors, and is not affected by CPU frequency and power management events.
    • Whether to support Constant TSC (Constant TSC): Constant TSC is synchronized across all cores and processors, but may be affected by CPU frequency adjustments.
  2. Calibrate TSC : In order to convert TSC clock cycles to real time, the kernel needs to know the CPU's clock frequency. During boot, the kernel will calibrate the TSC so that it aligns with real time. The calibration process usually involves calculating TSC increments at certain time intervals, and then deducing the CPU clock frequency from these increments.

  3. Select a clock source : The Linux kernel supports a variety of clock sources, such as TSC, HPET (High Precision Event Timer, High Precision Event Timer) and ACPI Power Management Timer. At startup, the kernel will choose the best clock source based on the accuracy and performance of the available clock sources. If the TSC has constant or constant characteristics, and exhibits good performance and accuracy, the core may set it as the default clock source.

  4. Synchronizing TSCs in multiprocessor systems : In multiprocessor systems, the kernel needs to ensure that the TSCs on all processors are synchronized. The kernel will use specific synchronization algorithms (such as calibrating clock skew) to try to ensure that the TSC value on different processors remains consistent. However, this synchronization is not always perfect, so care needs to be taken when using TSC in multiprocessor systems.

  5. Initialize the scheduling clock : During kernel booting, it also needs to initialize the scheduling clock, which is used by the kernel scheduler to decide when to run processes and threads. If TSC is selected as the default clock source, the kernel will use TSC to initialize and maintain the scheduling clock.

The kernel checks whether the TSC is synchronous code is as follows ( X86 structure ):

/*
 * Make an educated guess if the TSC is trustworthy and synchronized
 * over all CPUs.
 */
int unsynchronized_tsc(void)
{
	if (!boot_cpu_has(X86_FEATURE_TSC) || tsc_unstable)
		return 1;

#ifdef CONFIG_SMP
	if (apic_is_clustered_box())
		return 1;
#endif

	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
		return 0;

	if (tsc_clocksource_reliable)
		return 0;
	/*
	 * Intel systems are normally all synchronized.
	 * Exceptions must mark TSC as unstable:
	 */
	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
		/* assume multi socket systems are not synchronized: */
		if (num_possible_cpus() > 1)
			return 1;
	}

	return 0;
}

The following information can be obtained from this code:

  • If you have the constant_tsc flag in your cpuinfo, then whether between different cores of the same CPU or between different cores of different CPUs, TSC is synchronized and can be used casually.
  • If you are using an Intel CPU, but there is no constant_tsc flag in cpuinfo, then TSC is still synchronized between different cores of the same processor, but not synchronized between different cores of different CPUs, try not to use it.
  • Under the Intel CPU, there is also a comment "assume multi socket systems are not synchronized", that is, on a multi-processor system, the TSC between different CPUs (processors, sockets, NUMA nodes) is not synchronized.

Non-intel x86 platform has different stories. Current Linux kernel treats all non-intel SMP system as non-sync TSC system. See unsynchronized_tsc code in tsc.c. LKML also has the AMD documents.

Non-Intel x86 platforms have a different situation. The current Linux kernel treats all non-intel SMP systems as asynchronous TSC systems .

6. "TSC" clock for ARM architecture

Reference documents:

The ARM64 architecture (also known as the ARMv8-A architecture) introduces a new component called the System Counter (System Counter), which is used to provide a monotonically increasing timer for precise time measurement and scheduling in the ARM64 system. The system counter is very important in the ARM64 architecture because it provides a stable and reliable timer for the operating system and applications.

insert image description here

System counters have the following key features:

  1. 64-bit Monotonically Incrementing Counter : The system counter is a 64-bit wide register that increments every clock cycle. Since it is monotonically increasing, it is not affected by any system events, such as power management events or processor sleep states.

  2. Global Synchronization : System counters are globally synchronized across all processor cores and processors. This means that, in a multiprocessor system, no additional synchronization mechanism is required to ensure the consistency of the counters.

  3. Architecture-based access : The ARM64 architecture provides a set of instructions so that the operating system and applications can directly access system counters. These directives include:

    • CNTFRQ_EL0: Used to read the frequency of the system counter in order to convert the counter value into real time.
    • CNTPCT_EL0: Used to read the current count value of the system counter.
  4. Exception Level Access Control : The Exception Level (EL) mechanism in the ARM64 architecture allows the operating system to control access to system counters by applications and other operating system components. For example, an operating system can allow access to system counters by user-level applications (running at EL0) or restrict them to the kernel level (running at EL1 or higher).

  5. Virtualization support : The ARM64 architecture also provides system counter support for virtualized environments. In a virtualized environment, the host operating system can configure virtual system counters for each virtual machine, enabling them to use timing functions similar to physical counters.

The Linux kernel uses this timer on the ARM64 architecture to implement the "TSC clock" :

u64 rdtsc(void)
{
	u64 val;

	/*
	 * According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
	 * system counter is at least 56 bits wide; from Armv8.6, the counter
	 * must be 64 bits wide.  So the system counter could be less than 64
	 * bits wide and it is attributed with the flag 'cap_user_time_short'
	 * is true.
	 */
	asm volatile("mrs %0, cntvct_el0" : "=r" (val));

	return val;
}

The accuracy of the system counter generally does not exceed 100MHz, and generally does not reach the accuracy of the CPU cycle level.

Therefore, you can also use the PMCCNTR_EL0 in the PMU series register (requires the core to be enabled), and read this register to know how many cycles the current CPU has run .

7. Actual Code Demonstration

7.1 x86_64 architecture DPDK obtains TSC frequency and cycle code

The easiest way is dmesgto get it by message:

onceday->~:# dmesg  |grep tsc
[    0.000001] tsc: Detected 2995.199 MHz processor

Its frequency is also related to lscputhe inside BogoMIPS, which is one-half of it:

BogoMIPS:            5990.39

Secondly, it can also be obtained through code. The following code comes from dpdk:

/* SPDX-License-Identifier: BSD-3-Clause
 * Copyright(c) 2017 Intel Corporation
 */
#include <stdio.h>
#include <stdint.h>

#include <fcntl.h>
#include <unistd.h>
#include <cpuid.h>

static unsigned int rte_cpu_get_model(uint32_t fam_mod_step)
{
    
    
    uint32_t family, model, ext_model;

    family = (fam_mod_step >> 8) & 0xf;
    model  = (fam_mod_step >> 4) & 0xf;

    if (family == 6 || family == 15) {
    
    
        ext_model = (fam_mod_step >> 16) & 0xf;
        model += (ext_model << 4);
    }

    return model;
}

static int32_t rdmsr(int msr, uint64_t *val)
{
    
    
    int fd;
    int ret;

    fd = open("/dev/cpu/0/msr", O_RDONLY);
    if (fd < 0)
        return fd;

    ret = pread(fd, val, sizeof(uint64_t), msr);

    close(fd);

    return ret;
}

static uint32_t check_model_wsm_nhm(uint8_t model)
{
    
    
    switch (model) {
    
    
    /* Westmere */
    case 0x25:
    case 0x2C:
    case 0x2F:
    /* Nehalem */
    case 0x1E:
    case 0x1F:
    case 0x1A:
    case 0x2E:
        return 1;
    }

    return 0;
}

static uint32_t check_model_gdm_dnv(uint8_t model)
{
    
    
    switch (model) {
    
    
    /* Goldmont */
    case 0x5C:
    /* Denverton */
    case 0x5F:
        return 1;
    }

    return 0;
}

uint64_t get_tsc_freq_arch(void)
{
    
    
    uint64_t tsc_hz = 0;
    uint32_t a, b, c, d, maxleaf;
    uint8_t  mult, model;
    int32_t  ret;

    /*
     * Time Stamp Counter and Nominal Core Crystal Clock
     * Information Leaf
     */
    maxleaf = __get_cpuid_max(0, NULL);
    printf("maxleaf: %d\n", maxleaf);
    if (maxleaf >= 0x15) {
    
    
        __cpuid(0x15, a, b, c, d);

        /* EBX : TSC/Crystal ratio, ECX : Crystal Hz */
        if (b && c)
            return c * (b / a);
    }

    __cpuid(0x1, a, b, c, d);
    model = rte_cpu_get_model(a);
    printf("model: %d\n", model);

    if (check_model_wsm_nhm(model))
        mult = 133;
    else if ((c & bit_AVX) || check_model_gdm_dnv(model))
        mult = 100;
    else
        return 0;

    printf("mult: %d\n", mult);

    ret = rdmsr(0xCE, &tsc_hz);
    if (ret < 0)
        return 0;

    return ((tsc_hz >> 8) & 0xff) * mult * 1E6;
}

/** C extension macro for environments lacking C11 features. */
#if !defined(__STDC_VERSION__) || __STDC_VERSION__ < 201112L
#define RTE_STD_C11 __extension__
#else
#define RTE_STD_C11
#endif

uint64_t rte_rdtsc(void)
{
    
    
    union {
    
    
        uint64_t tsc_64;

        // RTE_STD_C11

        struct {
    
    
            uint32_t lo_32;
            uint32_t hi_32;
        };
    } tsc;

#ifdef RTE_LIBRTE_EAL_VMWARE_TSC_MAP_SUPPORT
    if (unlikely(rte_cycles_vmware_tsc_map)) {
    
    
        /* ecx = 0x10000 corresponds to the physical TSC for VMware */
        asm volatile("rdpmc" : "=a"(tsc.lo_32), "=d"(tsc.hi_32) : "c"(0x10000));
        return tsc.tsc_64;
    }
#endif

    asm volatile("rdtsc" : "=a"(tsc.lo_32), "=d"(tsc.hi_32));
    return tsc.tsc_64;
}

uint64_t rte_get_tsc_cycles(void)
{
    
    
    return rte_rdtsc();
}

/**
 * Macro to align a value to the multiple of given value. The resultant
 * value will be of the same type as the first parameter and will be no lower
 * than the first parameter.
 */
#define RTE_ALIGN_MUL_CEIL(v, mul) \
    ((((v) + (typeof(v))(mul)-1) / ((typeof(v))(mul))) * (typeof(v))(mul))

/**
 * Macro to align a value to the multiple of given value. The resultant
 * value will be of the same type as the first parameter and will be no higher
 * than the first parameter.
 */
#define RTE_ALIGN_MUL_FLOOR(v, mul) (((v) / ((typeof(v))(mul))) * (typeof(v))(mul))

/**
 * Macro to align value to the nearest multiple of the given value.
 * The resultant value might be greater than or less than the first parameter
 * whichever difference is the lowest.
 */
#define RTE_ALIGN_MUL_NEAR(v, mul)                     \
    ({
      
                                                       \
        typeof(v) ceil  = RTE_ALIGN_MUL_CEIL(v, mul);  \
        typeof(v) floor = RTE_ALIGN_MUL_FLOOR(v, mul); \
        (ceil - (v)) > ((v)-floor) ? floor : ceil;     \
    })

#define CYC_PER_10MHZ 1E7

int main(void)
{
    
    
    uint64_t start, end, mhz;
    uint64_t tsc_hz = get_tsc_freq_arch();
    printf("tsc_hz: %lu\n", tsc_hz);
    start = rte_get_tsc_cycles();
    sleep(1);
    end = rte_get_tsc_cycles();
    printf("start_clock: %lu\n", start);
    printf("end_clock: %lu\n", end);
    printf("diff_clock: %lu\n", end - start);
    /* Round up to 10Mhz. 1E7 ~ 10Mhz */
    mhz = (end - start);
    mhz = RTE_ALIGN_MUL_NEAR(mhz, CYC_PER_10MHZ);
    printf("mhz: %lu Mhz\n", (uint64_t)(mhz/1E6));
    return 0;
}

After compiling and running, the output is as follows (failure to obtain frequency, only manual measurement):

ubuntu->performance:$ ./tsc.out 
maxleaf: 13
model: 94
mult: 100
tsc_hz: 0
start_clock: 30482663960076192
end_clock: 30482687903183604
diff_clock: 23943107412
mhz: 2390 Mhz
7.2 ARM64 architecture obtains TSC (System Counter)

It is more convenient to obtain the cycle value and frequency, and directly use assembly to read its value.

/** Read generic counter frequency */
static uint64_t __rte_arm64_cntfrq(void)
{
    uint64_t freq;

    asm volatile("mrs %0, cntfrq_el0" : "=r"(freq));
    return freq;
}

/** Read generic counter */
static uint64_t __rte_arm64_cntvct(void)
{
    uint64_t tsc;

    asm volatile("mrs %0, cntvct_el0" : "=r"(tsc));
    return tsc;
}
d_clock: 30482687903183604
diff_clock: 23943107412
mhz: 2390 Mhz

Guess you like

Origin blog.csdn.net/Once_day/article/details/130959295