[Reposted] It turns out that the CPU has done so much to optimize program performance

It turns out that the CPU has done so much for program performance optimization

This article focuses on learning memory barriers and CPU cache knowledge, so that we can understand what efforts the CPU has made to optimize program performance.

First look at the CPU cache:

CPU cache

The CPU cache is used to improve the performance of the program. The CPU has made many adjustments to the internal architecture of many processes, such as the CPU cache. Everyone knows that because the hard disk is very slow, you can load data into the memory through the cache to increase access speed. CPU processing also has this mechanism. If possible, put the time overhead of the processor accessing the main memory on the CPU cache. The CPU access speed is many times faster than the memory access speed. This is the mechanism that most processors currently use. Use the processor's cache to improve performance.

Multilevel cache

The cache of the CPU is divided into three levels of cache, so that the multi-core CPU will have multiple caches, we first look at the next level of cache (L1 Cache):

L1 Cache is the first level cache of CPU, which is divided into data cache and instruction cache. The capacity of L1 cache of general server CPU is usually 32-4096 KB.

Due to the limitation of the L1 cache capacity, in order to increase the operation speed of the CPU again, a high-speed memory is placed outside the CPU, that is, the L2 cache.

Because the capacity of L1 and L2 is still limited, a three-level cache is proposed, and L3 is now built-in. Its actual role is that the application of L3 cache can further reduce memory latency and improve the processor when computing large amounts of data. The performance of a processor with a larger L3 cache provides more efficient file system cache behavior and shorter message and processor queue lengths, and generally multiple cores share an L3 cache.

When the CPU reads data, it first looks in the L1 Cache, then in the L2 Cache, then in the L3 Cache, then in memory, and then in the external storage hard disk.

As shown in the figure below, in the CPU cache architecture, the closer the cache level is to the CPU core, the smaller the capacity and the faster the speed. The CPU Cache consists of several cache lines. The cache line is the smallest unit in the CPU Cache. The size of a cache line is usually 64 bytes, which is a multiple of 2. It varies from 32 to 64 bytes on different machines, and it is effective. Reference a block of address in main memory.

After multiple CPUs read the same data and cache it, after performing different calculations, which CPU will eventually write to the main memory? This requires the cache synchronization protocol:

Cache synchronization protocol

In this cache write-back scenario, many CPU manufacturers have proposed some common protocols-the MESI protocol, which specifies that each cache has a status bit and defines the following four states:

  • Modified: The cache line has been modified (dirty line), and the content is different from the main memory, so this cache is exclusive;
  • Exclusive: The content of this cache line is the same as the main memory, but it does not appear in other caches;
  • Shared (Shared): The content of this cache line is the same as the main memory, but it also appears in other caches;
  • Invalid state: The content of this cache line is invalid (blank line).

Multi-processor, a single CPU changes the data in the cache, you need to notify other CPUs, which means that the CPU processing must control its own read and write operations, and also monitor the notifications sent by other CPUs to ensure eventual consistency.

Command reordering at runtime

In addition to the cache, the CPU's performance optimization also has runtime instruction rearrangement. You can understand it through the following diagram:

For example, there is code x = 10; y = z; in the figure, the normal execution order of this code should be to write 10 to x, read the value of z, and then write the value of z to y. In fact, the actual execution steps, CPU When executing, it may be to read the value of z first, write the value of z to y, and finally write 10 to x, why do these changes?

Because when the CPU writes the cache, it is found that the cache area is being occupied by other CPUs (for example: L3 cache). In order to improve the CPU processing performance, the subsequent read cache command may be executed first.

Instruction reordering is not random reordering, it is necessary to comply with as-if-serial semantics, as-if-serial semantics means that no matter how reordering (compiler and processor in order to improve parallelism), single-threaded program execution The result cannot be changed. The compiler, runtime, and processor must all comply with as-if-serial semantics, which means that the compiler and processor will not reorder operations that have data dependencies.

Then there will be the following two problems:

  1. There is a problem under the CPU cache:

The data in the cache and the data in the main memory are not synchronized in real time, and the data cached between the CPUs (or CPU cores) is not synchronized in real time. At the same time, the values ​​of the data at the same memory address seen by each CPU may be inconsistent.

  1. There is a problem under the CPU reordering optimization:

Although the as-if-serial semantics are adhered to, the results are guaranteed to be correct only if the single CPU executes it by itself. In multi-core multi-threading, the instruction logic cannot distinguish the cause and effect relationship, and may be executed out of order, resulting in an incorrect program execution result.

How to solve the above two problems, this needs to talk about the memory barrier:

Memory barrier

The processor provides two memory barrier instructions to solve the above two problems:

Store Memory Barrier: Store Barrier is inserted after the instruction to allow the latest data in the write cache to be written to the main memory and visible to other threads. Forced to write to main memory, this kind of display call, the CPU will not reorder the instructions because of performance considerations.

Load Memory Barrier: Inserting a Load Barrier before the instruction can invalidate the data in the cache and force loading data from the new main memory. Forcibly read the content of the main memory, keep the CPU cache consistent with the main memory, and avoid the consistency problem caused by the cache.

There are similar mechanisms in Java, such as Synchronized and volatile, which use the principle of memory barrier.

to sum up

This article mainly introduces what optimizations the CPU has made to improve the performance of the program: cache and runtime instruction rearrangement, and finally introduces the knowledge of the memory barrier.

Reference http://dwz.win/7ps

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12683635.html