Let the CPU black box no longer be black - [TMA_Top-down CPU architecture performance bottleneck analysis method] Speculation

Speculation Speculation/speculation---Wrong branch prediction and the impact of the machine clear mechanism in the CPU on CPU performance. Taking Bad Speculation as a large classification is a feature of TMAM. If the value of this classification is greater than a certain level, it is necessary to investigate and resolve this classification first.

The bad speculation category represents

1/ Due to the number of slots wasted due to wrong speculation, although the uops on the wrong branch were issued, they were not retired in the end, and this part of the slot is part of the waste;

2/ or Frontend issue-pipline due to wrong branch recovery to correct branch, resulting in pipeline stall, no uops issue to backend, the number of slots wasted in this process also belongs to this category

Bad Speculation

The calculation formula for this category is:

(EV("UOPS_ISSUED.ANY", 1) - EV("UOPS_RETIRED.RETIRE_SLOTS", 1) + Pipeline_Width * Recovery_Cycles(self, EV, 1)) / SLOTS(self, EV, 1)

The denominator is the total number of slots, and the numerator is the number of slots belonging to the category

The numerator of this category can be divided into two parts, ( UOPS_ISSUED.ANY UOPS_RETIRED.RETIRE_SLOTS ) The difference between the two event counts indicates the number of slots that were issued but not retired in the end;

Pipeline_Width Recovery_Cycles indicates the number of slots wasted due to the issue-pipeline being blocked due to Recovery.

The calculation formula of the Recovery_Cycles function:

def Recovery_Cycles(self, EV, level):
return (EV("INT_MISC.RECOVERY_CYCLES_ANY", level) / 2) if smt_enabled else EV("INT_MISC.RECOVERY_CYCLES", level)

Recovery Cycle needs to consider whether SMT (Simultaneous Multi-Threading) is enabled,

The INT_MISC.RECOVERY_CYCLES_ANY event counts the number of cycles that cause Resource allocator Stall to recover from branch misprediction or Machine Clear events. If HT (hyperthreading) is turned on, since two Threads are running in the same Core, the value needs to be equally divided into two threads . If the number of RECOVERY_CYCLES of each thread can be counted separately, it will definitely be more accurate, but the difficulty of implementation and the cost of resource consumption will increase a lot, so a speculative method is used here, directly divided by 2.

The sub-category of Bad Speculation is divided into two parts: Branch Mispredicts and Machine Clears. One represents the proportion of the number of slots wasted due to branch misprediction; the other represents the proportion of the number of slots wasted due to triggering of certain conditions that cause Machine Clear to occur. The two categories will be explained separately below.

Branch Mispredicts:

When the CPU encounters a branch instruction, in order to avoid pipeline stall, Branch Predictor will guess the branch jump. Although the accuracy rate of the current strategy is as high as 90%, it is inevitable that Mispredicts will occur. At this time, it is necessary to discard the relevant instructions on the wrong branch and fetch instructions from the correct branch again. The specific calculation formula of this subclass is as follows:

Mispred_Clears_Fraction(self, EV, 2)* Bad_Speculation

The calculation principle is to multiply the proportion coefficient of Mispredict Clears by the value of the parent node to obtain the corresponding proportion. Here, the specific calculation formula of the Mispred_Clears_Fraction function needs to be explained.

def Mispred_Clears_Fraction(self, EV, level):
return EV("BR_MISP_RETIRED.ALL_BRANCHES", level) /(EV("BR_MISP_RETIRED.ALL_BRANCHES", level) + EV("MACHINE_CLEARS.COUNT", level))

BR_MISP_RETIRED.ALL_BRANCHES can be understood as the number of Mispredicts Branch, that is, as long as a misprediction occurs, the value of the Counter will be increased by one, regardless of how many instructions are affected by the misprediction and how many cycles are wasted.

MACHINE_CLEARS.COUNT counts the number of times Machine Clear occurs, and as long as the Clear operation is triggered, the counter will be incremented by one. All in all, this coefficient is the ratio of the number of occurrences of Mispredicts to the total number of occurrences of Mispredicts and Machine Clears.

It can be seen that the value obtained by multiplying this coefficient by its parent node is not very accurate, because each time Mispredicts and Machine Clears waste Slots differently, only considering the number of occurrences without considering the specific impact, which will inevitably lead to large deviations in some extreme cases. However, this may also be limited by Counters. In the absence of better and more appropriate Counters, this is a compromise method.

Machine Clears:

When the CPU detects certain conditions, it will trigger the Machine Clears operation to clear the instructions on the pipeline to ensure the reasonable and correct operation of the CPU. For example, wrong Memory access sequence (memory ordering violations); self-modifying code (self-modifying code); access to illegal address space (load illegal address ranges), these operations will trigger Machine Clear. For the calculation method of the Machine Clears classification, due to the specific calculation formula of its sibling node Branch Mispredicts, the classification can be obtained only by subtracting the value of the sibling node from the value of the parent node. The specific calculation formula is as follows:

self.val = Bad_Speculation Branch_Mispredicts

Since TMAM did not further expand the Bad Speculation, the explanation of Bad Speculation is over here. Although the Bad Speculation classification in PMU-Tools has only two sub-nodes, we can add more Counters to perform better statistical analysis on the causes of Bad Speculation. For example, count the number of uops that were issued by Mispredicts and Machine Clears but were not retired in the end, which are recorded as A and B; and then count the number of cycles that Mispredicts and Machine Clears lead to Pipeline Stall respectively, which are recorded as C and D.

Then, we can use (A+C*4) / (A+B+C*4+D*4) to get the exact proportion coefficient of Mispredicts, and then we can use A / (A+C*4) and C*4 / (A+C*4) to further get the next level node of Branch Mispredicts. Here is just an introduction. I hope that you are not limited to the specific calculation formulas of Intel CPU and the specific implementation of Performance Counters. We can design a TMAM suitable for our own use according to our own ideas.

Guess you like

Origin blog.csdn.net/m0_54437879/article/details/131702024