The structure of modern computers and compilers is more intelligent, and they will try their best to use all the components of the CPU, and do not want to make any area of the CPU idle
I mentioned simd optimization in my previous article, which utilizes the data bit width of the CPU, and the CPU accumulator completes the calculation of data bit width/data type size in parallel in a single clock cycle. For 32-bit operating systems, 32-bit The width can accommodate 4 float-type data, and the accumulation calculation of a four-dimensional vector can be completed at one time, which greatly utilizes the width of the bus
Accelerating Computing Using SIMD Instructions - MAX Blog - CSDN Blog
The following will introduce another optimization method that can "squeeze" computer performance
branch prediction
Branch prediction (Branch Prediction) is an advanced data processing method starting from the Pentium5 generation, which solves the data processing method of processing branch instructions (if-then-else) that cause pipeline failure. The CPU can judge the direction of the program branch, which can speed up the operation speed .
first look at the program
Write a program to generate 32768 random numbers from 0 to 256, and then judge whether it is greater than 128 in turn, and accumulate if it is greater than
Repeat this process 100,000 times, and count the final time
First the unsorted array:
#include <algorithm>
#include <ctime>
#include <iostream>
#define ARRAYSIZE 32768
int main()
{
// 产生数组
int data[ARRAYSIZE] ;
// 将数组1到1024之间的随机数填充
for (int i = 0; i < ARRAYSIZE; ++i)
{
data[i] = std::rand() % 256;
}
//排序函数,一会将比较它开没有开启的时间
//std::sort(data, data + ARRAYSIZE);
//记录开始时间
clock_t start_time = clock();
//循环求和
long long sum = 0;
//将主要计算部分执行100000次,减少误差
for (int i = 0; i < 100000; ++i)
{
// 主要计算部分,选大于128的数字累加,即中间大小的数字
for (int j = 0; j < ARRAYSIZE; ++j)
{
if (data[j] >= 128)
{
sum += data[j];
}
}
}
//记录结束时间
clock_t end_time = clock();
//计算出累加部分花费的时间
double ElapsedTime = static_cast<double>(end_time - start_time) / CLOCKS_PER_SEC;
//打印累加时间
std::cout << "ElapsedTime:" << ElapsedTime << std::endl;
}
Note that sorting is commented out
//std::sort(data, data + ARRAYSIZE);
operation result:
Turn sorting on:
std::sort(data, data + ARRAYSIZE);
operation result:
It can be seen that the performance gap between the random array before and after sorting is more than three times
The result of running under java
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
int ArraySize = 32768;
// 产生数组
int data[] = new int[ArraySize];
// 将数组1到1024之间的随机数填充
Random rnd = new Random(0);
for (int i = 0; i < ArraySize; ++i)
data[i] = rnd.nextInt() % 256;
//排序函数,一会将比较它开没有开启的时间
//Arrays.sort(data);
//记录开始时间
long start_time = System.nanoTime();
//循环求和
long sum = 0;
//将主要计算部分执行100000次,减少误差
for (int i = 0; i < 100000; ++i)
{
// 主要计算部分,选大于128的数字累加,即中间大小的数字
for (int j = 0; j < ArraySize; ++j)
{
if (data[j] >= 128)
sum += data[j];
}
}
//记录结束时间
long end_time = System.nanoTime();
//计算出累加时间
System.out.println("ElapsedTime:"+(end_time - start_time) / 1000000000.0);
}
}
Under the condition of whether to enable Arrays.sort(data)
Arrays.sort(data);
The result is:
Sort not enabled
After sorting is enabled
The performance gap is more than three times
Moreover, java is more efficient than C++, and other optimizations may be made, and the reason will be found here later.
For data[j] >= 128
Every time you make a judgment, you need to pause, wait until the judgment is completed, and then go to the back. Is it possible to predict its result, and then go directly to the next step, and then make a judgment while proceeding to the next step? In fact, this is what the program does. , the following will introduce the reasons for this reason
About CPU pipeline
cpu pipelining technology is a technology that decomposes instructions into multiple steps and overlaps the operations of different instructions, so as to realize parallel processing of several instructions to speed up the running process of the program . Each step of the instruction is processed by its own independent circuit, and each step is completed, it goes to the next step, and the previous step processes the subsequent instruction.
For the CPU pipeline, there are the following four steps:
- Read command (Fetch)
- Instruction decoding (Decode)
- Run command (Execute)
- Write-back
For an instruction, it will be executed in four areas respectively, so if it is necessary to wait for an instruction to be executed before starting to execute the next instruction, then there will be three areas that are free.
Then, can you load the following instructions when the first instruction proceeds to the next step? The answer is yes, then it will become as shown in the figure below:
Then one pipeline will run four instructions at the same time
However, for conditional jump instructions, only when the previous instruction runs to the execution stage can the branch to be selected be known
When a judgment is encountered, it will cause pipeline bubbling, resulting in a decrease in pipeline utilization
branch prediction
Combining the above problems, you can predict the results in advance, then keep the pipeline running, and then judge whether the prediction results are correct
If the prediction is correct, then proceed
If the prediction fails, flush the pipeline, re-acquire instructions, and decode
Then for the following program
// 主要计算部分,选大于128的数字累加,即中间大小的数字
for (int j = 0; j < ARRAYSIZE; ++j)
{
if (data[j] >= 128)
{
sum += data[j];
}
}
Give the branch direction to the following two tags
T = branch hit
N = branch not hit
If the branches are sorted, there will be the following branches:
data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118, 14, 150, 177, 182, 133, ...
branch = T, T, N, T, T, T, T, N, T, N, N, T, T, T, N ...
completely unpredictable
If the array is not sorted, there will be the following branch:
data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N N N N N ... N N T T T ... T T T ...
The front is all false, and the back is all true
Obviously predictable
The conditional jump instruction can be skipped predictably, thereby increasing the occupancy rate of the pipeline, and the program will be faster
How to avoid branch prediction and improve the running efficiency of the program?
I offer two options here:
Use table query:
If it is very expensive to sort the array in advance, the technique of table query is used here
int lookup[256];
for (int i = 0; i < 256; ++i)
{
lookup[i] = (i >= 128) ? i : 0;
}
Construct a table, record the non-conforming position as 0, and record the matching number as itself
for (unsigned i = 0; i < ARRAY_SIZE; ++i)
{
sum += lookup[data[i]];
}
When accumulating, there is no judgment, but the numbers that do not match are all added to 0, thus avoiding branch judgment
Use bit operations to cancel branch jumps:
Calculate the positive and negative values through data[i] - 128
(data[i] - 128)>>31 right shift 31 bits, if less than 128 get -1, greater than or equal to 128 get 0
-1 converted to binary is 0xffff, 0 is 0x0000
Reverse it with ~
If it is the correct number and 0xffff, it is the original number
If it is not correct and 0x0000, it is 0
for (unsigned i = 0; i < ARRAY_SIZE; ++i)
{
int t = (data[i] - 128) >> 31;
sum += ~t & data[i];
}
branch prediction classification
Branch prediction technology includes static branch prediction at compile time and dynamic branch prediction at execution time by hardware
The simplest static branch prediction method is to choose a branch. In this way, the average hit rate is 50%. A more accurate method is to make statistics based on the results of the original operation to try to predict whether the branch will jump
Dynamic branch prediction is a technique that recent processors have attempted to implement. The simplest dynamic branch prediction strategy is the branch prediction buffer (Branch Prediction Buff) or branch history table (branch history table)
The following two categories of predictions are introduced
1-bit dynamic prediction
Predict whether to jump this time according to whether the instruction jumped last time. If it jumped last time, predict that it will jump this time
It is very suitable for the array arranged above. This prediction will only change twice at the beginning and the middle part, and there will be no branch prediction failure in the rest.
2-bit dynamic prediction
This type of prediction is also known as a bimodal predictor or a saturation predictor
Divided into four states: 00 01 10 11
Branches predicted to be taken when in states 00 and 01
Predicted branch not taken when in states 10 and 11
Push a state to the left when the prediction is adopted, and keep it unchanged if it is 00
Push a state to the left when the branch is predicted not to be taken, and remain unchanged if it is an 11-state
The conditional branch instruction must take a branch twice in a row to flip from the strong state, changing the predicted branch
It is suitable for situations where there are sudden changes in the stability
Two-stage adaptive predictor:
Simply put, it can record periodic jumps and predict a pattern
Divided into two parts, the front is a branch history register, followed by a saturation predictor
For the case of 001001001001
Take a 2-bit adaptive predictor as an example. If the previous history recorder finds 00, and then jumps to the latter saturation predictor, it can be judged immediately that it is more likely to be 1.
According to the state machine state in the saturation register A, the possibility of the next one being 1 is extremely high, which is the general principle
The following is a 4-bit adaptor
The branch history table is larger, and the other principles are the same
For an n-bit BHR, it is possible to accurately track branch history patterns whose repetition period is within n. If the pattern period of a branch history in the program exceeds n, performance will be affected
Of course, there are other branch predictors, so I won’t introduce them here.