Optimizing if-else using CPU's branch prediction (Branch Prediction) model

The structure of modern computers and compilers is more intelligent, and they will try their best to use all the components of the CPU, and do not want to make any area of the CPU idle

I mentioned simd optimization in my previous article, which utilizes the data bit width of the CPU, and the CPU accumulator completes the calculation of data bit width/data type size in parallel in a single clock cycle. For 32-bit operating systems, 32-bit The width can accommodate 4 float-type data, and the accumulation calculation of a four-dimensional vector can be completed at one time, which greatly utilizes the width of the bus

Accelerating Computing Using SIMD Instructions - MAX Blog - CSDN Blog

The following will introduce another optimization method that can "squeeze" computer performance

branch prediction

Branch prediction (Branch Prediction) is an advanced data processing method starting from the Pentium5 generation, which solves the data processing method of processing branch instructions (if-then-else) that cause pipeline failure. The CPU can judge the direction of the program branch, which can speed up the operation speed .

first look at the program

Write a program to generate 32768 random numbers from 0 to 256, and then judge whether it is greater than 128 in turn, and accumulate if it is greater than

Repeat this process 100,000 times, and count the final time

First the unsorted array:

#include <algorithm>
#include <ctime>
#include <iostream>

#define ARRAYSIZE 32768

int main()
{
    // 产生数组
    int data[ARRAYSIZE] ;
    // 将数组1到1024之间的随机数填充
    for (int i = 0; i < ARRAYSIZE; ++i)
    {
    	data[i] = std::rand() % 256;
	}
    //排序函数，一会将比较它开没有开启的时间
	//std::sort(data, data + ARRAYSIZE);

	//记录开始时间
    clock_t start_time = clock();

    //循环求和    
    long long sum = 0;

    //将主要计算部分执行100000次，减少误差
    for (int i = 0; i < 100000; ++i)
    {
        // 主要计算部分，选大于128的数字累加，即中间大小的数字
        for (int j = 0; j < ARRAYSIZE; ++j)
        {
            if (data[j] >= 128)
            {
            	sum += data[j];
			}
        }
    }

    //记录结束时间   
    clock_t end_time = clock();
    
    //计算出累加部分花费的时间
    double ElapsedTime = static_cast<double>(end_time - start_time) / CLOCKS_PER_SEC;
    
    //打印累加时间
    std::cout << "ElapsedTime:" << ElapsedTime << std::endl;
}

Note that sorting is commented out

 //std::sort(data, data + ARRAYSIZE);

operation result:

Turn sorting on:

std::sort(data, data + ARRAYSIZE);

operation result:

It can be seen that the performance gap between the random array before and after sorting is more than three times

The result of running under java

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {

        int ArraySize = 32768;
        // 产生数组
        int data[] = new int[ArraySize];
        // 将数组1到1024之间的随机数填充
        Random rnd = new Random(0);
        for (int i = 0; i < ArraySize; ++i)
            data[i] = rnd.nextInt() % 256;

        //排序函数，一会将比较它开没有开启的时间
        //Arrays.sort(data);

	//记录开始时间
        long start_time = System.nanoTime();
        
        //循环求和    
        long sum = 0;
        
        //将主要计算部分执行100000次，减少误差
        for (int i = 0; i < 100000; ++i)
        {
            // 主要计算部分，选大于128的数字累加，即中间大小的数字
            for (int j = 0; j < ArraySize; ++j)
            {
                if (data[j] >= 128)
                    sum += data[j];
            }
        }
        
        //记录结束时间
        long end_time = System.nanoTime();
        
        //计算出累加时间
        System.out.println("ElapsedTime:"+(end_time - start_time) / 1000000000.0);
    }
}

Under the condition of whether to enable Arrays.sort(data)

Arrays.sort(data);

The result is:

Sort not enabled

After sorting is enabled

The performance gap is more than three times

Moreover, java is more efficient than C++, and other optimizations may be made, and the reason will be found here later.

For data[j] >= 128

Every time you make a judgment, you need to pause, wait until the judgment is completed, and then go to the back. Is it possible to predict its result, and then go directly to the next step, and then make a judgment while proceeding to the next step? In fact, this is what the program does. , the following will introduce the reasons for this reason

About CPU pipeline

cpu pipelining technology is a technology that decomposes instructions into multiple steps and overlaps the operations of different instructions, so as to realize parallel processing of several instructions to speed up the running process of the program . Each step of the instruction is processed by its own independent circuit, and each step is completed, it goes to the next step, and the previous step processes the subsequent instruction.

For the CPU pipeline, there are the following four steps:

Read command (Fetch)
Instruction decoding (Decode)
Run command (Execute)
Write-back

For an instruction, it will be executed in four areas respectively, so if it is necessary to wait for an instruction to be executed before starting to execute the next instruction, then there will be three areas that are free.

Then, can you load the following instructions when the first instruction proceeds to the next step? The answer is yes, then it will become as shown in the figure below:

Then one pipeline will run four instructions at the same time

However, for conditional jump instructions, only when the previous instruction runs to the execution stage can the branch to be selected be known

When a judgment is encountered, it will cause pipeline bubbling, resulting in a decrease in pipeline utilization

branch prediction

Combining the above problems, you can predict the results in advance, then keep the pipeline running, and then judge whether the prediction results are correct

If the prediction is correct, then proceed

If the prediction fails, flush the pipeline, re-acquire instructions, and decode

Then for the following program

        // 主要计算部分，选大于128的数字累加，即中间大小的数字
        for (int j = 0; j < ARRAYSIZE; ++j)
        {
            if (data[j] >= 128)
            {
            	sum += data[j];
			}
        }

Give the branch direction to the following two tags

T = branch hit
N = branch not hit

If the branches are sorted, there will be the following branches:

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

completely unpredictable

If the array is not sorted, there will be the following branch:



data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

The front is all false, and the back is all true

Obviously predictable

The conditional jump instruction can be skipped predictably, thereby increasing the occupancy rate of the pipeline, and the program will be faster

How to avoid branch prediction and improve the running efficiency of the program?

I offer two options here:

Use table query:

If it is very expensive to sort the array in advance, the technique of table query is used here

int lookup[256];
for (int i = 0; i < 256; ++i) 
{
    lookup[i] = (i >= 128) ? i : 0;
}

Construct a table, record the non-conforming position as 0, and record the matching number as itself

for (unsigned i = 0; i < ARRAY_SIZE; ++i) 
{
    sum += lookup[data[i]];
}

When accumulating, there is no judgment, but the numbers that do not match are all added to 0, thus avoiding branch judgment

Use bit operations to cancel branch jumps:

Calculate the positive and negative values through data[i] - 128

(data[i] - 128)>>31 right shift 31 bits, if less than 128 get -1, greater than or equal to 128 get 0

-1 converted to binary is 0xffff, 0 is 0x0000

Reverse it with ~

If it is the correct number and 0xffff, it is the original number

If it is not correct and 0x0000, it is 0

for (unsigned i = 0; i < ARRAY_SIZE; ++i) 
{
    int t = (data[i] - 128) >> 31;
    sum += ~t & data[i]; 
}

branch prediction classification

Branch prediction technology includes static branch prediction at compile time and dynamic branch prediction at execution time by hardware

The simplest static branch prediction method is to choose a branch. In this way, the average hit rate is 50%. A more accurate method is to make statistics based on the results of the original operation to try to predict whether the branch will jump

Dynamic branch prediction is a technique that recent processors have attempted to implement. The simplest dynamic branch prediction strategy is the branch prediction buffer (Branch Prediction Buff) or branch history table (branch history table)

The following two categories of predictions are introduced

1-bit dynamic prediction

Predict whether to jump this time according to whether the instruction jumped last time. If it jumped last time, predict that it will jump this time

It is very suitable for the array arranged above. This prediction will only change twice at the beginning and the middle part, and there will be no branch prediction failure in the rest.

2-bit dynamic prediction

This type of prediction is also known as a bimodal predictor or a saturation predictor

Divided into four states: 00 01 10 11

Branches predicted to be taken when in states 00 and 01

Predicted branch not taken when in states 10 and 11

Push a state to the left when the prediction is adopted, and keep it unchanged if it is 00

Push a state to the left when the branch is predicted not to be taken, and remain unchanged if it is an 11-state

The conditional branch instruction must take a branch twice in a row to flip from the strong state, changing the predicted branch

It is suitable for situations where there are sudden changes in the stability

Two-stage adaptive predictor:

Simply put, it can record periodic jumps and predict a pattern

Divided into two parts, the front is a branch history register, followed by a saturation predictor

For the case of 001001001001

Take a 2-bit adaptive predictor as an example. If the previous history recorder finds 00, and then jumps to the latter saturation predictor, it can be judged immediately that it is more likely to be 1.

According to the state machine state in the saturation register A, the possibility of the next one being 1 is extremely high, which is the general principle

The following is a 4-bit adaptor

The branch history table is larger, and the other principles are the same

For an n-bit BHR, it is possible to accurately track branch history patterns whose repetition period is within n. If the pattern period of a branch history in the program exceeds n, performance will be affected

Of course, there are other branch predictors, so I won’t introduce them here.