Let the CPU black box no longer be black - TMA_top-down CPU architecture performance bottleneck analysis method What & Why---HOW---Frontend---Speculation---Retiring (unfinished)

 What & Why

Apply the TMA method, you can run it once on the target machine, and you can get the result in the above figure

TMA has four categories for CPU, and the proportion of each category can be simply understood as the proportion of consumption. In the "best" case, the proportion of Retiring (retirement) is 100%, and the remaining proportion is 0%, that is, the remaining three categories are the categories that cause the CPU to be inefficient.

When analyzing the results of TMA, only focus on the branch with the highest proportion at the first level, and then trace down level by level. For example, if you know that the first level is Backend, you only need to pay attention to the memory part. Finally, you can know that the low CPU efficiency is mainly caused by the path of CPU accessing DRAM.

Industry benchmark Benchmark - such as SPEC CPU

Latency delay

For multi-threaded optimization, the most feared problem is locks.

TMA is not at the same level as traditional hotspot analysis methods. TMA analyzes software performance at the microarchitecture level.

 How

There are front-end and back-end inside the cpu. The front-end is mainly responsible for value fetching, decoding and other operations, which are executed sequentially; the back-end is responsible for receiving instructions from the front-end, executing them out of order, and retiring in sequence.

If a CPU has a performance bottleneck, it may appear in the front end or the back end, or it may be caused by branch prediction.

In the first level of TMA, there are three categories of Frontend, Backend, and Speculation, as well as retiring. The retiring category represents the proportion of pipeline execution in an ideal state.

Pipeline Slots: If there are four emission CPUs—that is, the front end sends four uops to the back end in each cycle, and the back end can also receive 4 uops in each cycle, the Pipeline is called a pipeline slot after being filled with uops.

 In the picture above, the Pipline Slots are only filled by 50%, that is, 50% of the bubbles appear, so the category of Retiring may be 50% (if there is a Bad Speculation error guess, the category of Retiring will be <50%)

 In this way, we can roughly understand that the first level of TMA is to divide the Pipeline Slots to see how many Pipeline Slots are fully utilized.

How to further divide unused pipeline slots?

TMA idea: first check the junction of the front end and the back end of the CPU, that is, the status of the Issued event. Each cycle and each pipeline is divided into two states: uop issued and no uop issued.

If there is no uop issued, it means that there is a problem with the front end or the back end, so you only need to check whether the back end is blocked, and then you can make a corresponding division.

For the successful uops of the issue, the subsequent uops may not be retired due to branch prediction errors, that is, useless work has been done.

If no uops is issued and the back-end is not blocked (Back-end stall), it can be attributed to front-end failure (Frontend Bound)

TMA does not need to detect the specific situation of the CPU every cycle. It only needs to execute the complete test program and obtain some specific performance counter values ​​of the program to perform TMA.

Why can we carry out the first-level division when we get some statistical results ?

Total Slots , that is, how many Pipeline Slots there are in total, can be obtained by multiplying the execution cycle of the program by the number of launches per cycle of the CPU.

Slots Retired , that is, the number of uops that have been successfully Retired. This counter is implemented in most CPUs. It only needs to add the number of uops Retired in this cycle for each cycle.

Slots issued , that is, how many uops were successfully launched in total, and this has also been implemented in most CPUs.

Fetch Bubbles , that is, it only needs to count the number of pipeline slots that the front end does not make good use of when there is no back-end congestion.

Recovery Bubbles refers to the number of uops that are not launched when there is no congestion at the back end due to the bottleneck of the front end due to Recovery.

The sum of Fetch Bubbles and Recovery Bubbles is Frontend Bubbles, that is, how much Pipeline Slost is wasted caused by the front end when there is no congestion at the back end.

By designing corresponding logic circuits in the CPU and adding Performance monitoring units (PMUs), the statistical values ​​we need in the above formula can be easily obtained.

The proportion of Bubbles caused by Fetch to all Pipeline Slots is the proportion of Slots wasted by the front end.

Subtract the total number of Retired uops from the total number of Issued uops, indicating the number of uops that have done useless work despite the Issue but no Retire, plus the Bubbles caused during the Recovery process. The sum of the two is then divided by the total number of Slots, which can represent the proportion of Slots wasted by Bad Speculation.

uops: microinstructions, which are low-level hardware operations, and the CPU front-end is responsible for obtaining the program code represented in the architectural instructions and decoding it into one or more uops

Pipline Slots: Indicates the hardware resources required to process a uop

Pipline Width: Indicates the number of pipeline slots

The Top-Down metrics measured by Pipeline Slots, such as Front-End Bound and Back-End Bound, indicate the percentage of Pipeline Slots blocked due to various reasons (such as Front-End problems and Back-End problems).

For front-end and back-end division: Generally, the part behind the (micro-op) uops queue is called the back-end, and the subsequent bandwidth of the micro-op Queue module is the CPU launch width

Some mechanisms used by modern out-of-order execution CPUs:

piplined

superscalar superscalar architecture

OOO execution 

speculation speculation

multiple caches multi-level cache

memory pre-fetching and disambiguation memory prefetching and disambiguation

vector operations vector operations

 Frontend---front end

 frontend bound can be divided into

Latency Delay, indicating the proportion of no uops sent to the backend due to the excessive front-end latency,

bandwidth bandwidth, which means that the front-end cannot make full use of the bandwidth ratio of the four transmissions due to insufficient decoding capabilities of the front-end.

The following formulas are derived from the skl_client_ratios.py file in TMAM 3.5 , which mainly stores Skylake's TMA proportional coefficient and formula

The specific description of the PMC events used in the formula can be found in Inter's Intel® 64 and IA-32 Architectures Software Developer's Manual , specifically CHAPTER 19 PERFORMANCE MONITORING EVENTS in Volume 3 (3A, 3B, 3C & 3D): System Programming Guide .

Frontend Bound

公式:self.val = EV("IDQ_UOPS_NOT_DELIVERED.CORE", 1) / SLOTS(self, EV, 1)

In the above formula, the EV function represents the call to a specific event, and SLOTS is another function. For the specific calculation method, you can find the corresponding function definition to view. The value of the frontend bound is the value of the IDQ_UOPS_NOT_DELIVERED.CORE event divided by the value of the SLOTs function. SLOTS represents the number of slots in the entire execution cycle Pipline. The specific method is to multiply the operation cycle of the CPU by the width of the pipeline.

def SLOTS(self, EV, level):
return Pipeline_Width CORE_CLKS(self, EV, level)

What does the IDQ_UOPS_NOT_DELIVERED.CORE event mean? By consulting Intel's manual, you can find the definition of this event: Count  issue pipeline slots  where no uop was delivered from the front end to the back end when  there is no back-end stall . To put it simply, when the backend is not blocked by stalls, the number of uops that the frontend does not emit, that is, the number of uops wasted due to frontend reasons. Then divide this value by the total number of SLOTS, which is the proportion of Frontend Bound.

The SLOTS calculation formula is defined as a constant in pipeline_Width, which is 4. This is because the pipeline width is 4, and it needs to be modified accordingly according to different CPU architectures.

The formula of the CORE_CLKS function, since there are many mechanisms in modern CPUs, core clks should be considered in various situations

def CORE_CLKS(self, EV, level):
return ((EV("CPU_CLK_UNHALTED.THREAD", level) / 2)*(1 + EV("CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE", level) / EV("CPU_CLK_UNHALTED.REF_XCLK", level)))
if ebs_mode else(EV("CPU_CLK_UNHALTED.THREAD_ANY", level) / 2)
if smt_enabled else CLKS(self, EV, level)

def CLKS(self, EV, level):
return EV("CPU_CLK_UNHALTED.THREAD", level)

Take three different values ​​for core clks according to the values ​​of ebs_mode and smt_enabled  , and the value of ebs_mode is always FALSE, so only the value of smt_enabled is currently considered

SMT - Synchronous Multithreading

It is called HT (Hyper Threading) in intel

In Intel and AMD CPUs, each Core has at most two threads, that is, one Core can run two Threads at the same time. When smt_enabled is True, Core Clks is CPU_CLK_UNHALTED.THREAD_ANY divided by 2

The interpretation of the CPU_CLK_UNHALTED.THREAD_ANY event is: Core cycles when at least one thread on the physical core is not in halt state, that is, the number of cycles executed by threads in the Core . Dividing by 2 here is because during hyperthreading, two threads will share the pipeline width of 4 launches. If the two threads are statically divided, the number of Slots for each thread is 2*Clks.

From this, it can also be proved that TMA is also possible to analyze hyperthreaded CPUs, because when hyperthreading is enabled, it can be simply understood that the pipeline width of each logical core becomes half of the original, and the number of SLOTS becomes half of the original.

Frontend_Latency

公式:self.val = Pipeline_Width * EV("IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE", 2) / SLOTS(self, EV, 2)

The description of the IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE event here is: Counts, on the  per-thread basis , cycles when  no uops are delivered  to Resource Allocation Table (RAT). That is, the number of cycles that each Thread does not have uops to emit . Multiply this number of cycles by the pipeline width to get the number of Slots wasted by excessive front-end delay. Finally, divide the number of Slots by the total number of Slosts to get the corresponding proportion.

Therefore, when we apply TMAM, we should try to avoid enabling SMT. If SMT is enabled, try not to use two logical cores in one physical core at the same time. This is also a small defect of TMAM.

Can lead to Frontend Latency: ICache_Misses, ITLB_Misses, Branch_Resteers, DSB_Switches, LCP, MS_Switches these six subclasses

Icache_misses

The first is ICache_misses. Once an instruction cache miss occurs, Frontend does not have Instruction to do decode, which will inevitably lead to no corresponding uops sent to the backend, resulting in a bottleneck in Frontend Latency.

self.val = (EV("ICACHE_16B.IFDATA_STALL", 3) + 2 * EV("ICACHE_16B.IFDATA_STALL:c1:e1", 3)) / CLKS(self, EV, 3)

It can be simply understood here that the number of cycles of pipeline stall caused by L1 ICache Miss is divided by the total number of cycles to obtain a proportion of L1 ICache Miss. The description of the ICACHE_16B.IFDATA_STALL event is: Cycles where a code line fetch is stalled due to an L1 instruction cache miss. The legacy decode pipeline works at a  16 Byte granularity. The cycles where the code line fetch is stalled due to the loss of the L1 instruction cache.

ITLB_misses:

ITLB_Misses, the idea is the same as ICache Misses, it is also calculated at the level of Cycles, why not continue to calculate at the level of Slots? Of course, the advantages of the Slots level are much greater than those of the Cycles level, but many specific underlying events are at the Cycles level, and it is difficult or even impossible to calculate the number of wasted Slots. Therefore, at this time, statistical analysis can only be done at the Cycles level, which will be more convenient and concise.

公式:self.val = EV("ICACHE_64B.IFTAG_STALL", 3) / CLKS(self, EV, 3)

The interpretation of the ICACHE_64B.IFTAG_STALL event is: Cycles where a code fetch is stalled due to L1 instruction cache tag miss. Cycles where code fetch is stalled due to L1 instruction cache tag loss. If a Cache Tag Miss occurs, it means that the corresponding Tag is not found in the TLB, and a TLB Miss occurs, and the number of cycles caused by the Tag Miss, divided by the total number of cycles, is the corresponding proportion.

Branch_Remains:

The statistics are the ratio of the number of cycles consumed in the middle to the total number of cycles when the front end recovers from a wrong branch.

May repeat statistics with other misses

self.val = (EV("INT_MISC.CLEAR_RESTEER_CYCLES", 3) + BAClear_Cost * EV("BACLEARS.ANY", 3)) / CLKS(self, EV, 3)

For the INT_MISC.CLEAR_RESTEER_CYCLES event in the formula, the corresponding interpretation is: Cycles the issue-stage is  waiting for front-end to fetch  from resteered path  following branch misprediction or machine clear events. That is, the number of cycles the front end waits for Resteer.

Resteer refers to the branch prediction error recovery process

BACLEARS.ANY --- Number of occurrences of front-end prediction errors

BAClear_Cost The number of cycles wasted for each branch misprediction

For Branch Resteers, it can be further expanded into three categories, namely Branch Misprediction , Machine Clears and new branch address clears .

DSB_Switches:

This classification counts the proportion of cycles that cause the front-end to stall when switching from the DSB pipeline to the MITE pipeline.

DSB is the abbreviation of Decoded Stream Buffer. Some uops that have been decoded are stored. It can be understood as a uops ICahce. DSB is introduced in Intel's Sandy Bridge architecture. DSB corresponds to AMD's Op Cache. As the name suggests, it is the Cache that stores uOps.

MITE is the abbreviation of Micro-instruction Translation Engine. In fact, it is the traditional decode pipeline before the introduction of DSB.

Since the front end switches from DSB to MITE Pipeline, there will be a certain penalty cycle, resulting in no uops sent to the Micro-op Queue. This classification is mainly to calculate the proportion of this part of the cycle to the total cycle.

self.val = EV("DSB2MITE_SWITCHES.PENALTY_CYCLES", 3) / CLKS(self, EV, 3)

The DSB2MITE_SWITCHES.PENALTY_CYCLES event indicates Cycles of delay due to Decode Stream Buffer to MITE switches, that is, the number of cycles wasted by switching, and then divide this value by the total number of cycles to obtain the corresponding proportion.

LCP: Length Changing Prefixes Length Changing Prefixes

For the instruction being decoded, if the dynamic length prefix changes, several cycle stalls will occur, about three cycles. Using appropriate compiler flags or the default compiler may solve the LCP problem

self.val = EV("ILD_STALL.LCP", 3) / CLKS(self, EV, 3)

The ILD_STALL.LCP event indicates that Stalls caused by  changing prefix length  of the instruction, that is, the number of cycles caused by LCP, divided by the total number of cycles.

MS_Switches:

What I want to express is the proportion of cycles wasted switching from DSB pipeline or MITE pipeline to MS pipeline.

MS is the abbreviation of Microcode  Sequencer microcode sequence , corresponding to the MSROM module in the Intel architecture diagram and the Microcode Rom in the AMD architecture diagram.

self.val = MS_Switches_Cost * EV("IDQ.MS_SWITCHES", 3) / CLKS(self, EV, 3)

The IDQ.MS_SWITCHES event counts the number of switching times, and the specific interpretation is: Number of Switches from DSB (DECODE Stream Buffer) or Mite (Legacy Decode Pipeline) to the Microcode Sequency, and then Then use the number of times to switch to the waste of the cycle ms_switches_cost (here MS_SWITCHES_COST is a constant 2), and it can get the number of pauses caused by ms switching. Finally, divide this value by the total number of cycles, which is the corresponding proportion of MS Switch.

This is why the TMA methodology emphasizes that the comparison of values ​​at different levels and under different classifications is meaningless . Only when the Frontend Latency category is the main bottleneck, we will expand it, and then we only need to compare the proportions of the six subcategories below it, without considering other categories.

Frontend Bandwidth

The proportion of Slots consumed by the front end does not fully utilize the bandwidth

self.val = self.Frontend_Bound.compute(EV) - self.Frontend_Latency.compute(EV)

That is, directly subtract the value of its brother Frontend_Latency from the value of the parent class Frontend_Bound. Since the sum of these two subclasses must be equal to the total number of Slots wasted in the front end, it can be obtained by simply subtracting. Next, we mainly look at the three subcategories under Bandwidth in detail. For Bandwidth, TMA divides it into three subcategories: MITE, DSB, and LSD.

For Bandwidth, TMA divides it into three subcategories: MITE , DSB and LSD . These three subcategories correspond to three types of Frontend pipelines, and it is possible that the uops width of the issue is not 4.

MITE : (Micro-instruction Translation Engine) micro-instruction translation engine

Since Frontend's Decoder pipeline may have inefficiencies (incompetence), it will cause MITE's Bandwidth to be unsatisfactory.

official:

self.val = (EV(" IDQ.ALL_MITE_CYCLES_ANY_UOPS ", 3) - EV(" IDQ.ALL_MITE_CYCLES_4_UOPS ", 3)) / CORE_CLKS(self, EV, 3)

For IDQ.ALL_MITE_CYCLES_ANY_UOPS, the count is the number of cycles of uops sent from MITE to IDQ (Instruction Decode Queue), and the corresponding interpretation is: Counts cycles MITE is delivered at least one uops.

Then for IDQ.ALL_MITE_CYCLES_4_UOPS, the statistics are the number of cycles when 4 uops are sent from MITE to IDQ, that is, Counts cycles MITE is delivered four uops.

The difference between the two is the number of cycles when the Bandwidth of MITE is not 4, and then divide this difference by the total number of cycles to get the corresponding proportion. Of course, we can add up the number of cycles that MITE sends 1, 2, and 3 uops to IDQ respectively to get the number of cycles with Bandwidth less than 4, and then divide this number of cycles by the total number of cycles to get the same result.

DSB : (Decoded Stream Buffer) decode uop buffer

There may also be inefficient (low efficiency), the specific reason may be that the DSB cache structure is not well utilized, or a Bank conflict occurs during reading, these reasons will cause the DSB to not fully utilize the bandwidth. The specific calculation formula is:

self.val = (EV("IDQ.ALL_DSB_CYCLES_ANY_UOPS", 3) - EV("IDQ.ALL_DSB_CYCLES_4_UOPS", 3)) / CORE_CLKS(self, EV, 3)

The interpretation of the two events in the formula is consistent with the interpretation of the two events in the MITE formula, except that the observed object becomes the DSB pipeline. The difference between the two events is the number of cycles with low DSB efficiency, and then divided by the total number of cycles, the corresponding proportion can be obtained.

However, I found that with the upgrade of Intel CPU Microarchitecture, the Bandwidth of MITE and DSB has also been upgraded, and the maximum bandwidth is definitely no longer 4. Take the architecture diagram of Skylake again to illustrate. From the architecture diagram, we can know that the maximum Bandwidth of DSB has changed to 6 , and the Bandwidth of MITE has changed to 5 . Therefore, the formula here should use the MITE/DSB uops transmission period, minus the uops bandwidth greater than or equal to 4 periods, which is more reasonable. Either use the sum of the number of cycles to send 1, 2, and 3 uops as the numerator, and divide it by the total number of cycles instead of subtraction.

Fusion: It is a mechanism of Intel CPU that enables Frontend to fuse multiple uops into a complex uop for processing, and then split it for processing when the backend executes.

LSD : Loop Stream Detector Loop Stream Detector

Detect and save the uops cyclic sequence. When the uops cyclic sequence is less than or equal to the capacity of the LSD, it can be stored in the LSD , so that the corresponding uops sequence can be obtained without the front-end decoding, and only need to continuously take out the corresponding uops sequence from the LSD.

self.val = (EV("LSD.CYCLES_ACTIVE", 3) - EV("LSD.CYCLES_4_UOPS", 3)) / CORE_CLKS(self, EV, 3)

It seems that it is consistent with the calculation method of his sibling nodes. The LSD.CYCLES_ACTIVE event is Cycles with at least one uop delivered by the LSD and none from the decoder (the LSD has at least one uplink and the decoder has no uplink cycle), and the LSD.CYCLES_4_UOPS event is Cycles with 4 uops delivered by the LSD and none from the decoder (the 4 uplinks provided by the LSD line cycle, while the decoder does not ) . Then the difference between the two is the number of cycles that the LSD bandwidth is not enough, and then divide this value by the total number of cycles to get the corresponding proportion.

The correct analysis method is to first look at the first layer, then look at the corresponding second layer, and then look at the corresponding third layer. There is no need to pay attention to the proportion of each analogy under different branches, only need to pay attention to the proportion of the same branch.

Guess you like

Origin blog.csdn.net/m0_54437879/article/details/131689069