Top-down methodology

The Top-down methodology was proposed by Intel and is a software performance analysis technique.

The PMU of the x86 processor generally provides 8 PMCs, 4 of which are fixed PMCs, and the corresponding monitoring signals cannot be configured. The other 4 signals monitored by the PMC are configurable.

Motivation for the Top-down methodology

Through PMC, software engineers can obtain the number of occurrences of various events (instruction number, branch prediction error, Cache miss, etc.) in the processor. Based on these events, software engineers can spy on the processor's behavior.

First of all, software engineers can obtain the clock cycle of program execution through PMC. In X86, the clock period is one of the fixed PMCs.

Then, software engineers can obtain the execution times of various types of instructions (memory access, branch, plastic calculation, floating point calculation, vector calculation, etc.) through the PMC.

Then, the software engineer can obtain the number of events that may cause the pipeline to stall through the PMC. A pipeline stall indicates that the processor is unable to provide instructions to the execution units. Typical events that cause pipeline stalls include: Cache misses, TLB misses, branch prediction errors, insufficient launch ports, and insufficient out-of-order resources.

To improve performance, modern processors take advantage of superscalar, out-of-order, and speculative execution. This has caused the problem that isolated statistical indicators cannot reflect the overall performance bottleneck of the program. The specific performance is:

  • Superscalar error. In sequential and single-issue processors, one stall event causes the entire pipeline to stall. But in a superscalar processor, since the processor can issue or process multiple instructions at the same time, an instruction encountering a stall does not mean that the entire pipeline is stalled. This makes the simple statistics of the number of times that may cause stalls unable to correctly reflect the situation of pipeline stalls.
  • Events that may cause pipeline stalls overlap. For example, the front-end and back-end of the processor may have stall behavior at the same time, and at this time the stall effects caused by the two events overlap. Similarly, stalls caused by register dependencies and memory access dependencies will overlap .
  • The predefined set of events misses the pipeline stall. To obtain the number of pipeline stall cycles by accumulating the product of the stall event and the stall cycle, a set of predefined stall events is required . This predefined event set may miss some rare or unexpected stalls, introducing errors.
  • Speculative execution. Speculatively executed instructions are not instructions that actually need to be executed . Even if the pipeline does not stall, executing speculative instructions is still a waste of processor performance.

The keywords of the Top-down methodology are "pie chart" and "hierarchical". Top-down first looks for a quantifiable observation point in the micro-architecture, and determines a classification unit ( clock cycle, slot or bandwidth, etc.).

Then classify the selected units, get the proportion of each category, and form a "pie chart" (as shown in Figure 2). Such as whether the clock cycle is idle or used. For a classification obtained, you can continue to refine the classification, or re-select observation points and units, and then classify. Thus forming a hierarchical structure (or decision tree).

     In Intel's decision tree, the classification of the first layer (L1) and the second layer (L2) is relatively macro, which can be considered as indicators at the architectural level. The metrics of L1 and L2 are obtained using the same metrics and the same observation points. Therefore, without considering the error, the sum of L1 indicators should be equal to 1; the sum of L2 indicators should be equal to 1; the sum of L2 indicators belonging to the same category should not be greater than the L1 indicator of this category. L1 and L2 can be pulled through for analysis. In Figure 2, the metrics of L1 and L2 are drawn in the same pie chart.

      Starting from the third layer (L3), metrics have obvious microarchitecture characteristics. L3 metrics do not use a unified metric and watchpoint. L3 indicators belonging to the same L2 classification will use the same measurement and observation points; L3 indicators belonging to different L2 classifications will not use different measurement and observation points; L3 indicators will not follow the measurement and observation points of their L2 classification. L3 metrics cannot be analyzed through pull-through analysis, and can only be analyzed under a certain L2 category. The same is true for all layers after L3.

The first layer of Top-down methodology

The processor pipeline is generally divided into two parts: the front end (frontend) and the back end (blackend). The front-end part of the instructions flow sequentially, and the back-end part of the instructions are concurrent and out of order. In the first layer of the Top-down methodology, the observation point is set at the split point of the front-end and back-end in the pipeline. Figure 4 shows the microarchitecture block diagram of skylake. On the way, a dotted line is used to mark the demarcation position between the in-order part (Inorder) and the out-of-order part (OOO), that is, from Allocate/rename/retire (generally refers to RAT and ROB) to Scheduled (generally refers to RS), before this demarcation position, the instruction stream is sequential; after this position, the instruction stream will appear concurrent and out-of-order.

         However, the top-down observation point is not directly placed on the dotted line, but placed at the position of number 4 in Figure 4, which is the position from the decoding unit to the register renaming unit (RAT). This position is also called dispatch. This is because the logic of RAT and ROB will be affected by the out-of-order engine. As shown in the figure, there are three arrows on the left side of the Allocation/rename/retire section, which come from Load buffer, Store buffer and Reorder buffer respectively. These three resources are important resources that the out-of-order execution engine needs to maintain and depend on. Pipeline stalls due to deficiencies of these three resources should be considered to be caused by out-of-order engines.

The op-down methodology sets the observation point of the first layer at the dispatch position. Before this position, instructions flow on the pipeline in the order of program execution; after this position, the flow of instructions is affected by the out-of-order execution engine, such as insufficient out-of-order cache resources, unsatisfied register dependencies, and unresolved address dependencies.

For superscalar processors, multiple instructions can be sent in parallel at the dispatch location. Using period as a taxonomic unit will cause superscalar errors. The Top-down methodology here selects the slot as the taxon. A slot represents an instruction delivered in one cycle, and the total amount of slots is the product of the dispatch width and the execution cycle.

The classification of slots is shown in Figure 5. According to whether the slot is occupied by instructions, it can be divided into two categories. For the slot occupied by the instruction, it can be further divided into retiring and bad-speculation according to whether the instruction is speculatively executed. Slots that are not occupied by instructions can be further divided into frontend-bound and backend-bound according to the cause of idleness. If the backend cannot receive instructions provided by the frontend, the idle slot is classified as backend-bound; if the frontend cannot provide instructions, the idle slot is classified as frontend-bound

 Intel believes that when the front-end and back-end stalls appear at the same time, it is considered a back-end stall, because it is more important to optimize the back-end stall in existing processors.

Impact of Top-down on PMC Design

Considering that the test time is much longer than the depth of the pipeline level of the processor, the workaround adopted here is to directly use the number of retired instructions to represent the retired slots occupied by normally retired instructions, ignoring the error caused by the misalignment of pipeline levels.

 For example, to analyze the utilization rate of the execution unit, the clock cycle can be classified as a classification unit, classified according to the number of issue instructions in the cycle, and divided into issue 0, 1, 2 or n instructions for the execution unit.

To analyze the bandwidth of a storage unit, the bandwidth can be used as a classification unit and classified according to the reason for using the bandwidth, which is divided into Load, Store, Refill (refill), and Evict (evict).

Top-down toolspmu-tools

        The Intel Top-down tools are called pmu-tools . pmu-tool is only used in the Linux operating system (because the Perf tool of the Linux system needs to be called). pmu-tool is developed with python3, no additional installation steps are required. After downloading the code, it can be executed directly. Intel's performance analysis tool set Vtune also integrates the Top-down tool.

        The top-down tool uses perf to obtain the underlying information of the physical machine. Perf is a performance testing tool for the linux platform, which is very powerful. The most commonly used features include:

  • perf recordReport the execution time of each function of the program, so as to locate the hot spots of the program.
  • perf statGet the value of the hardware PMC in the program execution phase.

The top-down tool is used perf statto obtain PMC statistics. The tool will generate a list of PMCs that need to be monitored according to the target CPU, and then use it to perf statobtain the values ​​of these PMCs.

The number of PMCs used by top-down is very large, but the PMC interfaces provided by the hardware are very few. At the same time, the hardware can only count 4 PMCs. To this end, perf statthe execution time is segmented and the PMC is polled and monitored. First monitor the first 4 PMCs in the PMC list, and switch to the next 4 PMCs after a period of time. This approach is called multiplexing .

After the program finishes running, according to the monitoring time of PMC and the total execution time, the counted PMC count is scaled:final_count = raw_count * time_enabled/time_running。

toplev adopts the method of multiplexing by default. Therefore, when using the top-down tool for analysis, the execution time of the program should not be too short, otherwise it will be impossible to count all the PMCs.

If the test program is really short, multiple measurements can also be used. Run the program under test several times, measuring only a few PMCs each time. This method is called no-multiplex and is enabled via a command line option.

metrics

The metrics that need to be measured and analyzed by the top-down methodology are defined in pmu-tool. These metrics are calculated on the results of hardware performance counters to get more visual and targeted metrics. For example, the metric Metric_L1MPKIrepresents the number of L1 cache misses per thousand instructions. Calculated as follows:

Metric_L1MPKI = 1000 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY

In the formula, MEM_LOAD_RETIRED.L1_MISS and INST_RETIRED.ANY are both hardware PMCs, which represent the number of non-speculative load instructions and all non-speculative retirement instructions that have L1 cache misses, respectively.

The metrics used in pmu-tool can be divided into two parts : decision tree and microarchitecture metrics . The decision tree part is used to locate performance bottlenecks, such as Frontend-Bound, Backend-Bound, Retiring and Bad-Speculation.

In order to facilitate the characterization of the meaning of indicators, Intel provides thresholds for each metric. When the counted metric is greater than the threshold, pmu-tool will highlight this metric for prompting. In the decision tree part, engineers can complete the positioning analysis layer by layer according to the prompt of pmu-tool. After completing the decision tree positioning, the cause of the performance bottleneck can be further analyzed through the microarchitecture measurement. The metrics in the measurement part of the microarchitecture have no constraint relationship with each other. Software engineers need to judge whether the values ​​​​need to be adopted based on the meaning of these metrics to find performance defects.

Metric database

There are three versions of the metric definition in pmu-tool, which are called V1.0, V2.0 and V4.5 in the code. The choice of which version of the newly added processor does not depend entirely on the chronological order of processor releases, but on the continuation of the processor architecture. Each version contains definitions for different microarchitectures. With the development of the micro-architecture, new metrics will be added to the metric definitions for different micro-architectures, or the formula of the metrics will be adjusted.

V1.0 is a minimal version, applied to Silvermont and Knights Landing microarchitecture. V1.0 only provides a 3-level decision tree (10 metrics) and 6 general metrics.

  • "slm" stands for Silvermont and Airmont microarchitecture, which belongs to the Atom series of small-core processors.
  • "knl" stands for Knights Landing, the processor architecture in the second-generation PHI Compute Card, which uses 72 Silvermont cores.

V2.0 is a simplified version, applied to Elkhart Lake and GraceMont micro-architecture. V2.0 provides a 4-level decision tree (38 metrics in total) and 45 general metrics.

  • "ehl" stands for the Elkhart Lake microarchitecture, also known as Tremont, which also belongs to the architecture of the Atom series.
  • In version V2.0, "adl" stands for Gracemont and Enhanced Gracemont microarchitectures, which are respectively used as Alder Lake/Raptor Lake energy efficiency core (small core) microarchitectures. Gracemont is the successor architecture of Tremont.

V4.5 is the current main version, applied to the following microarchitectures:

  • Core series: All architectures starting from the 2nd generation Core. Currently supported up to Raptor Lake.
    • "snb" stands for Sandy Bridge microarchitecture, used in the second generation of Core processors.
    • "ivb" stands for Ivy Bridge microarchitecture, which is used in the third generation of Core processors.
    • "hsw" stands for Haswell microarchitecture, used in 4th generation Core processors.
    • "bdw" stands for Broadwell microarchitecture, used in 5th generation Core processors.
    • "skl" stands for Skylake, KabyLake, CoffeeLake, Whiskey Lake, Amber Lake, and Comet Lake microarchitectures for sixth, seventh, eighth, and ninth-generation Core processors.
    • "icl" stands for Ice Lake and Rocket Lake microarchitecture, which is used for the 10th and 11th generation Core processors.
    • "tgl" stands for Tiger Lake microarchitecture, which is used in the 11th generation Core processors. Tiger Lake and Ice Lake share the same top-down design.
    • In V4.5, "adl" stands for Golden Cove and Raptor Cove microarchitectures, which are Alder Lake/Raptor Lake performance core (big core) microarchitectures respectively. For the 12th and 13th generation Core processors.
  • Xeon series:
    • "jkt" stands for Sandy Bridge microarchitecture, used in Xeon E3-1200 series, Xeon E5-2400/1400 series, Xeon E5-4600/2600/1600 series, Xeon E7-8800/4800/2800 series.
    • "ivt" stands for Ivy Bridge microarchitecture, used in Xeon E5-2400/1400 v2 series, Xeon E5-4600/2600/1600 v2 series, E7-8800/4800/2800 v2 series.
    • "hsx" stands for Haswell microarchitecture, used in Xeon E3-1200 v3 series, Xeon E5-2600/1600 v3 series.
    • "bdx" stands for Broadwell microarchitecture, used in Xeon D-1500 series and Xeon E5 v4 series.
    • "skx" stands for Skylake microarchitecture, used in Xeon Scalable processors and Xeon E3-1500m v5 series.
    • "clx" stands for Cascade Lake and Copper Lake microarchitecture, used in 2nd generation Xeon Scalable processors.
    • "icx" stands for Ice Lake microarchitecture, used in 3rd generation Xeon Scalable processors.
    • "spr" stands for Sapphire Rapids microarchitecture, used in 4th generation Xeon Scalable processors.

V4.5 provides a 4-level decision tree (a total of 109 metrics) and 118 general metrics. Compared with the decision tree of V2.0, the decision tree of V4.5 has very obvious differences in the 3rd and 4th levels. In addition, V4.5 also has two evolution routes of Core micro-architecture and Xeon micro-architecture. Compared with the corresponding Core microarchitecture version, the Xeon route mainly strengthens the decision-making of the uncore part.

tool use

The toplev in the tool source code is the executable file of the tool. Execute toplev --help to get the help information of the tool command line.

The basic format of toplev is

toplev [options] command

Among them, command specifies the command that needs to test the program. Since toplev needs access to the PMC, sudo privileges are usually required.

When no options are provided, the output of pmu-tool is as follows:

Figure 7 pmutool output example

The output information of pmu-tool includes the following parts:

  • Hardware information ( processor model, microarchitecture code, frequency ), such as"# 4.5-full-perf on Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz [clx/skylake]”
  • Metrics that need attention. When no parameters are provided, the tool only provides the metrics of the first-level decision tree, and only outputs metrics worthy of attention (the metrics that the tool considers unimportant will not be printed).
    • The information of each metric includes: processor core number (such as S1-C0 and S1-C0-T1), metric group (FE or BE), metric name, metric unit (Slots), and value.
    • Each metric also provides the execution time taken to measure this metric, for example [33.2%] indicates that measuring this metric takes up 1/3 of the execution time.
    • If the metric exceeds the preset threshold, it will be marked with an arrow, for example, backend_bound is marked with an arrow.
  • Tips for further analysis: such as"Run toplev --describe Backend_Bound^ to get more information on bottleneck”

Common options for pmu-tool include:

  • -v: Output all measured metrics.
  • -m: In addition to measuring decision trees, microarchitecture metrics are also provided.
  • -l1// -l2/ -l3/ -l4: --allSpecifies which level the measurement decision tree goes to. l4 means to measure all metrics from L1 to L4; all means to measure the entire decision tree.
  • -x, -o <file>: Output the measurement results to the specified file in CSV format, and the CSV delimiter is a comma.
  • --core <core>: Specifies the measured CPU. Usually used with the taskset command to bind the test task to a certain processor.
  • --no-desc: The output does not print the description of the metric, which can optimize the screen display.
  • --no-multiplex:Perform multiple measurements. Suitable for workloads with very short execution times.

When the above options are provided, the output of the tool is as follows:

Figure 8 pmutool output example

At this point, the test results are saved in the matmul_fma.csv file:

Figure 9 pmutool output csv file

Guess you like

Origin blog.csdn.net/m0_54437879/article/details/131723187