foreword

任何优秀的大软件里面都是一个优秀的小程序。

text

current reading progress

P79

Chapter 1 Optimization Overview

Knuth: 我们应当忘记小的性能改善，大部分过早优化都是万恶之源.
Comprehension: A small performance optimization may waste a lot of time. Only when the frequency of use is high and the time-consuming is serious, it is absolutely necessary to optimize.
It takes the same amount of time to write efficient code as to write inefficient code, so why would anyone go out of their way to write inefficient code?
The performance of only 10% of the code in the program is important.
Common sense may be the worst enemy of performance improvement.
Performance offenders: function calls, memory allocation, loops.
Remember to turn on the optimization options of the compiler.
It is good C++ coding style to write small member functions to access member variables.
A few important tips to improve program performance:

Precomputation: Move calculations from runtime to link, compile, or design time.

Delay calculation: If the current calculation result will not be used, then the calculation will be postponed until the calculation result is really needed.

Caching: saving and taking expensive computations.

Reducing calls to the memory manager is a very effective means.
Reduce memory allocation and copying.
Modern compilers have done a good job of optimizing local improvements, and it is not necessary to replace the i++ of the entire code with ++i.
Improve the concurrency of the program, and make good use of tools for synchronizing concurrent threads so that they can share data.
Make good use of the compiler: remember to turn on the optimization options of the compiler.
https://www.qt.io/zh-cn/blog/2018/12/03/modern-qt-development-top-10-tools-using If you have time, you should learn about the tools here, this is for yourself extension.
It is good C++ coding style to write many small functions to access member variables of various classes.
For code optimization, learning and using the optimal algorithm for searching and sorting is the way to go.

Chapter 2 Computer Behavior Affecting Optimization

C++11 provides the std::atomic<> feature, which allows memory to behave as if it were a simple linear storage of bytes for a brief period of time.
Accessing a data structure (such as an array or vector) that contains consecutive addresses is faster than accessing a data structure that contains nodes linked by pointers, because less storage space is required for consecutive addresses.
Virtual memory just creates the illusion of having sufficient physical memory.
Von Neumann Bottleneck: The interface to main memory is the bottleneck that limits execution speed.
An unaligned memory access is equivalent to twice as many bytes as if they were in the same word.
Reading a byte that is not in the cache will cause many adjacent bytes to be cached as well.
A jump instruction or a jump subroutine instruction will change the execution address to a new value, and the execution address will be updated after executing the jump instruction for a period of time.
What happens when the thread is switched? : Save the registers in the processor for the thread that is about to be suspended, and then load the previously saved registers for the thread that is about to be resumed.
What happens when the program is switched?

All dirty cache interfaces must be flushed to physical memory.
All processor registers need to be preserved.
"The mapping relationship between physical address and virtual address needs to be preserved." Page table

Accessing shared data between threads is much slower than accessing non-shared data.
The overhead of a system call is hundreds of times that of a function call in a single-threaded program.
The microprocessor's memory control logic may choose to delay writing to memory to optimize memory bus usage.
Calculations are faster than decisions.

Chapter 3 Measurement Performance

Galileo: Measure the measurable and make the unmeasurable measurable.
90/10 rule: A program spends 90% of its time executing 10% of its code.
If the code being optimized is not a significant percentage of the overall running time of the program, it is not worthwhile to optimize it very successfully.
When measuring performance, ask yourself: why is this code hot?
In the process of testing, after modification, you can use a pen to record the running time of each test. Through multiple comparisons, you can clearly understand which modification is effective, and you need to record this modification , which is conducive to the overall comparison later.
Optimization efforts are dominated by two numbers: the performance baseline measurement before optimization and the performance target value. Measuring performance benchmarks is important not only to gauge the success of each individual improvement, but also to explain the cost of optimization to other stakeholders.
List of performance test items:

a. Startup time: the time elapsed from the time the user presses the Enter key until the program enters the main loop processing loop. In fact, it is mainly from the beginning of the main() function -> the time to enter the main loop.
b. Exit Time: The time elapsed from when the user clicks the close icon or enters the exit command until the program actually exits completely. Usually, developers can get the exit time by measuring the time from when the main window receives the close command to when the program exits main(). Because the time required to restart a service or long-running program is equal to its exit time plus its start time.
c. Response time: the average time or the longest time to execute a command. Below 0.1 seconds: the user is in direct control 0.1 second to 1 second: the user is in control of the command 1 second to 10 seconds: the computer is in control Above 10 seconds: drink Take a coffee break.
d. Throughput. Throughput is expressed as the average number of operations performed by the system per unit of time under a certain test load.

What is an analyzer?
A profiler is a program that generates statistics about the execution time of another program. The analyzer can output a report containing the execution frequency of each statement or function, and the cumulative execution time of each function.
Measuring runtime is an effective way to test hypotheses about how to reduce the performance overhead of a particular function.
Accessing memory is far more expensive than other instructions.
The string concatenation operator is expensive because a temporary string is created for the concatenation, which calls the memory manager many times to allocate memory for the temporary string.
Copy-on-write: Also known as implicit sharing, the copying of internal resources is postponed until the first write, and when it is modified, the copying operation is performed.
Using compound assignment operations to avoid temporary strings is similar to: result = result +s[i]; this will generate temporary strings. If i is large, it will be even more exaggerated, and the performance will be greatly reduced. If result += s[i]；used , then all calls to the memory manager to allocate a temporary string object to hold the connection result are removed. All in all, using compound assignment operations reduces calls to the memory manager.
Reduce memory reallocation by reserving storage space . That is to say, using reserve can remove the reallocation of strings, and can also improve the cache locality of the read data, so that we can get better improvements from it.
When passing a string with a formal parameter, if passed directly, the function will usually copy a formal parameter string. This also wastes some efficiency. So, we std::string strchange to std::string const & sthis, which will give the function a constant reference as a parameter , which cannot be changed, and will not be copied.
But in fact, if you just change this, there will be no performance optimization. The reason is that when the formal parameter is passed, a pointer is passed in, and in this case, every time the function is entered, it needs to be dereferenced. It is speculated that these additional overheads may be enough to cause performance degradation. So, this leads to point 16.
The solution is to use iterators . String iterators are simple pointers to character buffers. Using an iterator saves two dereference operations.

Another advantage of this method is that the value of s.end() used to control the for loop will be cached when the loop is initialized, which can save a certain amount of overhead. Instead of calling Length every time.
Eliminate copying of return values . When the function returns a value, it is possible for C++ to call the copy constructor to set the processing result to the calling context.
**When the program has extremely high performance requirements, it is not necessary to apply the C++ standard library, but to use the C-style string function. **Note: Except in some extremely restrictive embedded environments, it is no problem to declare the worst case buffer on the stack as 1000 or even 10000 characters.

This will solve 6 times faster than the previous version. But you need to manage the temporary storage space yourself, which is error-prone.
Use a better algorithm and reduce the number of for loops.
Use the functions that come with C++ to optimize better algorithms . 0.65
Instead of creating a new string, modify the value of the argument string returned as the result . 0.81
string_view: Contains an unowned pointer to a string and a value representing the length of the string.
If a team feels that they need to make some changes to the strings they use, it is best to define a project-wide typedef at the beginning of the design:

typedef string MyString

Generally speaking, if string is used as the return value, it is best not to reverse the return value, that is, convert it to char* again, because it is likely that you have already done it inside the function. Once the action of char* -> string is done. Turning back later is tantamount to wasting time.

for example:

string MyClass::Name const()
{
    
    
    return "MyString";
}

As far as the above example is concerned, a conversion from char* -> string has actually been done.

Generally speaking, it is recommended to make the return value into char*, and then convert it when needed, and do not do too many meaningless conversions.

By the way, string is converted to char*: char * str = str.c_str();

Treating strings as objects rather than values reduces the frequency of memory allocation and copying.
Returning the function's result as a reference to the calling method via an output parameter reuses the actual parameter's storage, which may be more efficient than allocating new storage.

Chapter 5 - Optimization Algorithms

The binary algorithm is a commonly used algorithm with O(logn).
When multiple linear time algorithms are combined, it may cause their time overhead to become O(N2).
Optimization mode:

1. Precomputation: Remove calculations from hot parts by performing calculations before hot code.

2. Delayed calculations: remove calculations from certain code paths by performing calculations when they really need to be performed

3. Batch processing: multiple elements are calculated together at a time, instead of only one element at a time.

4. Caching: reduce the amount of computation by saving and reusing the results of expensive computations.

5. Specialization: Reduce computation by removing unused commonality.

6. Improve processing capacity: reduce the overhead of loop processing by processing a large set of data at a time.

Two-phase construction: When instances are constructed statically, the information needed to construct the object is often missing, so deferring initialization until enough extra data means that the constructed object is always efficient and flat.
Copy-on-write: When an object is copied, its dynamic member variables are not copied, but the dynamic variables are shared by two instances. Only when one of the instances wants to modify the variable will the actual copy be performed.
A few examples of batch processing (goal: collect multiple jobs, process them together):

1. Cache output: until the cache is full or the program encounters an end-of-line or end-of-file character.

2. The best way to convert an unsorted array to a heap is to build the entire heap at once, with only O(n) overhead.

3. Multi-threaded task queue

4. Saving or updating in the background is an example of using batch processing.

A few examples of caches:

1. string will cache the length of the string and will not calculate it every time it is needed.

2. The thread pool caches those threads that are expensive to create.

3. Dynamic programming algorithm.

Double check: First eliminate some possibilities with an inexpensive check, and then test the remaining possibilities with an expensive check if necessary.

Chapter 6 - Optimizing Dynamically Allocated Variables

Removing calls to the memory manager from loop processing or frequently called functions can significantly improve performance.
When we print information for some functions, when printing information at the beginning and end, we should pay attention to whether this function is a thread and whether it will run continuously. If it will run continuously, then do not print the information in the thread , and printing at the head and tail of the thread is fine, as well as the time. If the time in the thread is not easy to measure, don't print it.

Summarize

To be updated! Welcome to bookmark this article~
insert image description here

"C++ Performance Optimization Guide" reading notes

Table of contents