Conditional judgment statement and branch prediction

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

The reason for writing this article is to see the problem of [1]. Although I have learned about branch prediction before, I did not expect that the impact on performance under extreme conditions would be so great. I originally wanted to describe this question and its answer in detail based on my own understanding, but I found that many articles have already achieved this goal, so this article will briefly talk about it.

Problem Description

The first is to look at a very classic code:

#include <algorithm>
#include <ctime>
#include <iostream>

int main() {
    
    
    const unsigned ARRAY_SIZE = 50000;
    int data[ARRAY_SIZE];
    const unsigned DATA_STRIDE = 256;

    for (unsigned c = 0; c < ARRAY_SIZE; ++c) data[c] = std::rand() % DATA_STRIDE;

    std::sort(data, data + ARRAY_SIZE);

    {
    
      // 测试部分
        clock_t start = clock();
        long long sum = 0;

        for (unsigned i = 0; i < 100000; ++i) {
    
    
            for (unsigned c = 0; c < ARRAY_SIZE; ++c) {
    
    
                if (data[c] >= 128) sum += data[c];
            }
        }

        double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

        std::cout << elapsedTime << "\n";
        std::cout << "sum = " << sum << "\n";
    }
    return 0;
}

We take a look and comment out the sort has not commented what sort of influence on the program, of course, in accordance with the terms of common sense does not have any impact, but ordering more slowly, because the time complexity from the O(N)rose O(NlogN), the following are the results:
Insert picture description here
above The sort is added, but the following is not added.

How is it, shocked. There is a performance gap of more than three times, which is completely inconsistent with expectations. The answer to the question is actually 分支预测.

When we learned the principle of microcomputers, we knew that BLUprefetching instructions would actually be performed in it, because for the CPU, an instruction has to go through the following steps from beginning to end:

  1. Fetch: Fetch
  2. Translation refers to: Decode
  3. Execute: execute
  4. Retrieval: Write-back

Obviously execution is only a part, then application pipelines (pipelines) is obviously a better way to squeeze the CPU, because the CPU can execute without wasting bus resources, and there is no need to wait for the process of fetching instructions after executing an instruction. However, there may be a problem when encountering a conditional judgment statement. Depending on the true/false of the judgment condition, a jump may occur, and we do not know which branch to jump to before the conditional judgment. Obviously there are two methods, one is to wait synchronously, so that the wrong instruction will not be fetched, but it is very slow; the second method is to pick a branch based on a certain condition, load it into the instruction queue first, and predict success will be all right. If an error occurs, flush the buffer, roll back to the previous branch and re-fetch instructions. For specific branch prediction strategies, please refer to [5] [3].

Wiki has the following description of branch prediction [3]:

  • Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong, then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
  • The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.
  • If there is no branch prediction, the processor will have to wait until the conditional jump instruction passes the execution stage, and then the next instruction can enter the fetch stage in the pipeline. The branch predictor tries to avoid wasting time by trying to guess whether a conditional jump is likely to occur. Then take the branch that is most likely to be guessed and execute it speculatively. If it is later detected that the guess is wrong, the speculatively executed or partially executed instruction will be discarded, and the pipeline will restart from the correct branch, causing a delay.
  • The time wasted in the case of an incorrect branch prediction is equal to the number of stages in the pipeline from the acquisition stage to the execution stage. Modern microprocessors often have very long pipelines, so the misprediction delay is between 10 and 20 clock cycles. As a result, making the pipeline longer increases the demand for more advanced branch predictors.

So we can draw a simple conclusion from the above simple demo: remember to keep an eye on the optimization logic judgment under the condition of a large number of loops.


Of course, using the strategy mode in the design pattern can avoid multiple if-elsetroubles, but the implementation of the common strategy mode is basically based on table lookup [6], that is, a hash table is placed in it, and different types of parameters are passed in. To execute different code blocks. The definition of the strategy pattern is as follows:

  • Define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary independently from clients that use it.
  • Define a family of algorithm classes, encapsulate each algorithm separately, so that they can replace each other. Strategy mode can make algorithm changes independent of the clients that use them.

It can be seen that what the strategy pattern wants to do is not simply to make it more efficient, but to make the caller and the code provider loosely coupled, and even meet the open and close principle.

Of course, what we need more often is the readability and maintainability of the code. Performance optimization is a very late matter, and I think there are fewer scenarios that require instruction-level optimization, such as storage and kernel scenarios.

optimization

The above code is indeed extreme, so can we optimize it? Of course, it is possible to use bit operations to optimize conditional branches. In this way, branch prediction is not needed for the CPU, and of course there will be no turbulent performance changes like the above code.

The strategy comes from [2]:

|x| >> 31 = 0 # 非负数右移31为一定为0
~(|x| >> 31) = -1 # 0取反为-1
 
-|x| >> 31 = -1 # 负数右移31为一定为0xffff = -1
~(-|x| >> 31) = 0 # -1取反为0
 
-1 = 0xffff
-1 & x = x # 以-1为mask和任何数求与,值不变
int t = (data[c] - 128) >> 31; # statement 1
sum += ~t & data[c]; # statement 2

More cleverly is this:

int t=-((data[c]>=128)); # generate the mask
sum += ~t & data[c]; # bitwise AND

In fact data[c], &the value greater than 128 is the corresponding value 0xffff; on the contrary, it is 0. This kind of judgment statement that is greater than a condition is actually quite common, so we can modify it in many places, which is a kind of programming skill. You can see more use of alignment operations in [7].

Inspiration for our code

In fact, what I want to talk about are two GCC's built-in functions. They are related to branch prediction. One of them should be usable in our code.

__builtin_speculation_safe_value

The role of this function is described in the GNU manual as follows:

  • This built-in function can be used to help mitigate against unsafe speculative execution. type may be any integral type or any pointer type.
  • This built-in function can be used to help alleviate unsafe predictive execution. The type can be any integer type or any pointer type.

The following examples are given in the manual:

int array[500];
int f (unsigned untrusted_index)
{
    
    
  if (untrusted_index < 500)
    return array[untrusted_index];
  return 0;
}

This is actually no problem in our daily coding process, but in fact this code is a little bit dangerous, and the problem lies in branch prediction.

If this function is called multiple times with a value less than 500, and then the function is called with a value that is out of range, it will still try to execute the code block first, until the CPU determines that the prediction is incorrect (the CPU will cancel all incorrect operating). However, depending on how the function result is used, some traces may be left in the cache, and these traces can reveal the content stored in out-of-bounds locations . This __builtin_speculation_safe_valuecan be avoided by using :

int array[500];
int f (unsigned untrusted_index)
{
    
    
  if (untrusted_index < 500)
    return array[__builtin_speculation_safe_value (untrusted_index)];
  return 0;
}

When we change the code to the above format, the security can be guaranteed. The manual describes that the built-in function has two possible behaviors at this time:

  1. Will cause execution to stop until the conditional branch has been completely resolved;
  2. Speculative execution can be allowed to continue, but if the limit is exceeded, 0 is used instead untrusted_value;

The manual describes that it may be unsafe to access any memory location when the speculation is executed incorrectly. At this time, the code can be rewritten as:

int array[500];
int f (unsigned untrusted_index)
{
    
    
  if (untrusted_index < 500)
    return *__builtin_speculation_safe_value (&array[untrusted_index], NULL);
  return 0;
}

This situation is actually more confusing. I think it may describe the second possibility of execution, that is, using 0 is an unsafe behavior. At this time, if it array[untrusted_index]is a predicted value, then we directly put it in the cacheNULL

__builtin_expect

A simple and rude function, described in the manual as follows:

  • You may use __builtin_expect to provide the compiler with branch prediction information.

Of course, you can use this if you are sure of the access frequency of your program, but as ridiculed in the manual:

as programmers are notoriously bad at predicting how their programs actually perform.

But many scenarios are still very useful. For a very classic example, logging is required gettid. This is a system call, which is obviously very expensive, but it will not change during the running of the thread. So the value of caching is a very normal and Reasonable operation, we can write like this, the code comes from adlserver of adl :

//currentTheread.h
extern __thread int t_cachedTid;
void cacheTid();
inline int tid() {
    
    
  if (__builtin_expect(t_cachedTid == 0, 0)) {
    
    
    cacheTid();
  }
  return t_cachedTid;
}
//currentTheread.cpp
__thread int CurrentThread::t_cachedTid = 0;

void CurrentThread::cacheTid() {
    
    
  if (t_cachedTid == 0) {
    
    
    t_cachedTid = adl::gettid();
  }
}

The code is very easy to understand, so I won't explain it.

to sum up

Wish you all collect more New Year's Eve money on New Year's Day! In the new year, the body will be healthier, and the mood will be particularly good! Good luck every day, the taste is delicious! Gold is out at home, and banknotes are long on the wall.

reference:

  1. "What 's wrong with the if-else branch in the code?" In addition to maintainability, does it have any impact on the efficiency of program operation?
  2. "In- depth understanding of CPU branch prediction (Branch Prediction) model "
  3. wiki Branch predictor
  4. GNU manual https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
  5. " Branch Prediction "
  6. " The Beauty of Geek Time Design Pattern "
  7. Bit Twiddling Hacks

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/113797121