What is the use of programmers learning CPU?

This article has included AndroidFamily , technical and workplace issues, please pay attention to the public account [Peng Xurui] to ask questions.

foreword

Hello everyone, I am Xiao Peng.

In the last article , we talked about the Von Neumann architecture of the computer and the five major components of the computer: controller, arithmetic unit, memory, input device, and output device. In the current computer system, the CPU is the core of the entire computer, mainly including the controller and the arithmetic unit.

In the follow-up articles, we will start from the basic understanding of the CPU, and gradually connect the CPU with the execution system, storage system and I/O system, please pay attention.


Xiaopeng's Android communication group 02 has been established, scan the QR code at the end of the article to enter~


mind Mapping:


1. Understand the CPU central processing unit

1.1 What is a CPU?

The central processing unit (Central Processing Unit, CPU), also known as the central processing unit or the main processor, is the core of the entire computer and one of the most expensive components in the entire computer.

From a hardware point of view: CPU is composed of ultra-large-scale transistors;

From the functional point of view: CPU is composed of 4 parts: clock, register, controller and arithmetic unit.

  • 1. Clock (Clock): responsible for sending out clock signals, and can also be located outside the CPU;
  • 2. Register (Register): responsible for temporarily storing instructions or data, located at the top of the pyramid of the memory system. The use of registers can make up for the speed difference between CPU and memory, reduce the number of CPU memory accesses, and improve CPU throughput;
  • 3. Controller (Control Unit): responsible for controlling the execution of program instructions, including reading instructions and data from the main memory and sending them to registers, and then writing the results calculated by the arithmetic unit back to the main memory;
  • 4. Arithmetic Logic Unit (ALU): responsible for executing the instructions fetched by the controller, including arithmetic operations and logic operations.

冯·诺依曼架构

—— Picture quoted from Wikipedia

1.2 Why study CPU?

For most programmers, the daily work is dealing with high-level languages ​​such as Java and C++, and does not directly deal with the CPU. So why do we spend so much time learning about CPUs? I think there are the following reasons:

  • Reason 1 - Mastering the principles of the CPU can develop higher-performance programs: understanding the working principles of the CPU helps to design higher-performance algorithms or codes, such as improving program efficiency by avoiding false sharing and improving cache hit ratios, etc. Need to have a certain understanding of the CPU cache mechanism;

  • Reason 2 - Accumulation of expansion solutions: The CPU is the most complex module in the entire computer system, and it is also the commanding height of contemporary computer science. Accumulating solutions inside the CPU can provide ideas for similar problems encountered in the future, and achieve the role of bypassing the analogy. For example, the CPU cache elimination strategy is similar to the application memory cache elimination strategy;

  • Reason 3 - CPU is the bottom knowledge of the knowledge system: When we think or solve a certain problem, we need to use deeper knowledge accumulation to explain, and CPU is the bottom knowledge in the knowledge system. For example, in issues such as the visibility of the memory system, the IO_WAIT of the execution system, and the design of the thread pool, it is necessary to have a certain understanding of the CPU's execution mechanism.

CPU

——The picture is quoted from the source of the picture

1.3 General-purpose processors and special-purpose processors

In early computer systems, there was only one general-purpose processor, and all computing tasks could be completed with one processor. Later, people discovered that some computing tasks can be separated, and a special chip micro-architecture can be designed separately, which will be far more efficient than general-purpose processors. The most typical special-purpose processor is a GPU graphics processor.

This kind of processor used to specialize in certain computing tasks is a special-purpose processor, so why special-purpose processors are faster when dealing with certain specific problems, I think there are 3 explanations:

  • 1. Optimal architecture: Special-purpose processors only handle a small number of types of work, and can design optimal chip architectures for specific tasks, while general-purpose processors can only design global optimal architectures, but are not necessarily the optimal mechanism for performing specific tasks;
  • 2. Hardware acceleration: The calculation work of multiple instructions can be directly implemented by hardware, which can save a lot of instruction cycles compared to the CPU executing instructions one by one;
  • 3. Lower cost: The calculation process performed by the dedicated processor is fixed, and does not require functions such as CPU pipeline control and out-of-order execution, and the cost of achieving the same calculation performance is lower.

The modern computer architecture is a general-purpose processor plus multiple special-purpose processors. This kind of design that uses different computing units to complete different types of computing tasks is also called heterogeneous (Heterogeneous Computing).

多处理器架构


2. Instruction Set Architecture ISA

2.1 What is an instruction set architecture?

The machine language that a CPU can understand is an instruction (Instruction Code), and all the instructions that a CPU can understand are an instruction set (Instruction Set).

In order to ensure the compatibility between chips, chip manufacturers do not design a new instruction set for each new chip, but promote the instruction set as a standard specification. This specification is Instruction Architecture (ISA) ,

Compared with the instruction set architecture, the hardware circuit design of the CPU to realize the specific instruction set function is the micro architecture (Micro Architecture). If you use the way of thinking in software, ISA is the functional interface of the CPU, which defines the standard specification of the CPU, and the micro-architecture is the function realization of the CPU, which defines the specific circuit design of the CPU, and an instruction set can be compatible with different micro-architectures.

2.2 Two mainstream instruction set architectures

Because the CPU is located at the bottom and the most core part of the entire computer system, if the compatibility of the CPU is out of order, the previously developed application software or even the operating system will not be able to run on the new CPU, which will damage the ecology of chip manufacturers. deadly. Therefore, the instruction set architecture is relatively stable, and chip manufacturers will be very cautious when adding or deleting instructions in the ISA.

At present, there are only two ISAs that can effectively occupy the market share, and they also represent the two development directions of complexity and simplification:

  • x86 architecture: the complex instruction set architecture introduced by Intel in the 1970s;
  • ARM architecture: The simplified instruction set architecture introduced by ARM in the 1980s. The familiar Apple M1 chip, Huawei Kirin chip, and Qualcomm Snapdragon chip are all ARM architectures (in fact, ARM does not produce chips, but licenses them with technology. mode operation).

2.3 Complex instruction set and reduced instruction set

During the development of the CPU instruction set, two types of instruction sets were formed:

  • Complex Instruction Set Computer (CISC): It emphasizes that a single instruction can perform multiple basic operations at the same time, and a large amount of work can be completed with a small number of instructions, and the execution efficiency is higher;
  • Reduced Instruction Set (Reduced Instruction Set Computer, RISC): It emphasizes that a single instruction can only perform one or a few basic operations, there is no duplication or redundancy between instructions, and more instructions are required to complete the same work.

In early computer systems, the instruction set was generally very simple, and there was no distinction between complex and streamlined. As the functions of application software become more and more abundant, the application layer is also pushing chip architects to introduce more powerful instruction sets in order to simplify programming and improve performance. For example, some audio and video-oriented instructions can simultaneously complete multiple data encoding and decoding within one instruction.

It was indeed a good choice at the time. The reason is that the speed difference between the CPU and the main memory is too great, and implementing program functions with fewer instructions (higher instruction density) can reduce the number of memory accesses. By virtue of this, the advantages of complex instruction sets over reduced instruction sets are almost comprehensive:

  • Advantage 1: It can reduce the memory and disk space occupied by the program;
  • Advantage 2: It can reduce the bandwidth required to obtain instructions from memory or disk, and can improve the transmission efficiency of the bus system;
  • Advantage 3: The CPU L1 Cache can accommodate more instructions, which can improve the cache hit rate. And multiple threads in a modern computer will share the L1 Cache, and fewer instructions are more beneficial to the cache hit rate;
  • Advantage 4: The CPU L2 Cache can accommodate more data, and it is also beneficial to improve the cache hit rate for programs that operate on large amounts of data.

However, these advantages come at a price:

  • Disadvantage 1 - Complicated processor design: the more complex the instruction, the more complex the processor circuit design for parsing the instruction will be, and the higher the execution performance and power consumption will be;
  • Disadvantage 2 - Instruction function overlap: Many newly added instructions have overlapping functions, which does not conform to the principle of orthogonality of the instruction set, and the usage rate of many newly added complex instructions is very low, but the processor has paid a disproportionate amount. design cost;
  • Disadvantage 3 - The instruction length is not uniform: the instruction length is not uniform, although it is beneficial to use Huffman coding to further increase the instruction density (short length for instructions with high frequency, and large length for instructions with high frequency), but the instruction length is different, and the execution time There are also long and short, which is not conducive to the realization of a pipelined structure.

Therefore, by the 1980s, the Reduced Instruction Set RISC gradually surfaced. Currently, most low-end and mobile systems adopt RISC architecture, such as Android system, Mac system and Microsoft Surface series.

Compared with the complex instruction set, the reduced instruction set puts more emphasis on "orthogonality" . A single instruction can only perform one or a few basic operations, and there is no repetitive or redundant function between instructions. Moreover, each instruction has the same length , which is very convenient for implementing a pipelined structure.

There is a misunderstanding in many materials on the Internet: the reduced instruction set simplifies the size of the instruction set. This is wrong, the accurate statement is to simplify the complexity of the instruction set.

To sum up: With higher instruction density, the complex instruction set is overall better than the reduced instruction set in terms of performance (memory/disk usage, CPU Cache hit rate, TLB miss rate), while the reduced instruction set sacrifices instruction density for simpler The processor architecture balances performance with power consumption.

instruction set type CISC RISC
Number of instructions Large number of instructions Relatively few instructions
instruction length different length the same length
command function There is overlap Orthogonal
example x86 ARM、MIPS

3. CPU performance indicators

3.1 Execute system parameters

  • 1. Main frequency (Frequency/Clock Rate): There is an Oscillator Crystal , and the crystal oscillator will send a signal to the controller at a certain frequency. This signal frequency is the main frequency of the CPU. Main frequency is the most important parameter of CPU, the faster the main frequency, the faster the instructions that the computer can complete per unit time. The main frequency of the CPU is not fixed. When the CPU is running, you can choose low frequency, full frequency or even overfrequency operation, but the higher the operating frequency, the higher the power consumption;

  • 2. Clock cycle (Clock Cycle): the other side of the main frequency, that is, the time interval for the signal sent by the crystal oscillator, clock cycle = 1/main frequency;

  • 3. FSB: The FSB is the clock frequency provided by the motherboard for the CPU. In early computers, the main frequency of the CPU and the FSB were the same, but as the main frequency of the CPU became higher and higher, the speed of other devices could not keep up. So now the main frequency and the FSB are not equal;

  • 4. Program execution time:

    • 4.1 Elapsed time (Wall Clock Time / Elapsed Time): the time elapsed from the start of the program to the end of the program;

    • 4.2 CPU time (CPU Time): The time the CPU actually executes the program, including only the time when the program obtains the CPU time slice (user time + system time). Since the CPU will execute multiple tasks in parallel, the program execution time will be less than the elapsed time;

    • 4.3 User Time (User Time): In the user mode, the CPU switches to the execution time of the program;

    • 4.4 System time (Sys Time): In kernel mode, the time when the CPU switches to the program for execution;

3.2 Storage system parameters

  • Word length (Word): The basic unit of CPU processing data at the same time per unit time. The number of CPU bits refers to the word length of the CPU. For example, the word length of a 32-bit CPU is 32 bits, and the word length of a 64-bit CPU is 64 bits. ;

  • Address Bus Width: The address bus transmits address signals, and the address bus width also determines the addressing capability of a CPU, that is, how much data space can be accessed at most. For example, a 32-bit address bus can address 4GB of data space;

  • Data Bus Width: The data bus transmits data signals, and the data bus width also determines the information transmission capability of a CPU.

Distinguish between several other capacity units:

  • Byte: Byte is the basic unit of computer data storage, even if one bit is stored, it needs to be stored in one byte;

  • Block: Block is the basic unit of CPU Cache management data, also called CPU cache line;

  • Segmentation/Page: Segment/page is the basic unit for the operating system to manage virtual memory.

Related article: What does a computer's memory pyramid look like?


4. Factors Affecting CPU Performance

As the core component of the computer, the CPU must start towards more powerful performance in the future. From the perspective of looking at the CPU, we also need to have a certain overall view:

  • 1. Improving CPU performance is not only the task of the CPU: a computer system is a complex system composed of multiple components, and it is meaningless to talk about parts without the whole;
  • 2. Balance performance and power consumption: Generally speaking, the higher the computing performance of the CPU, the greater the power consumption. We need to comprehensively consider the relationship between performance and power consumption. It is meaningless to talk about performance without power consumption.

4.1 Increase CPU frequency

Increasing the main frequency has the most direct impact on CPU performance. The main development direction of CPU in the past few decades is also on how to increase the CPU main frequency.

However, in recent years, the speed of CPU main frequency seems to have encountered a bottleneck. Because you want a faster main frequency, you must either let the CPU run at full frequency or overclock, or upgrade the chip manufacturing process to squeeze more transistors into the unit volume. Both of these methods will increase CPU power consumption and cause battery life and heat dissipation problems. If these two problems are not solved, it is impossible to break through the main frequency bottleneck.

主频的瓶颈

—— Picture quoted from Wikipedia

4.2 Multi-core parallel execution

Since the performance of the single-core CPU has encountered a bottleneck, then if 2 cores, 4 cores or even more are stuffed into the CPU chip at the same time, wouldn't the performance of the entire CPU chip be directly doubled?

The ideal is beautiful, but the reality is that performance does not always increase linearly with the number of cores. When the number of cores is small, the acceleration effect obtained by increasing the degree of parallelism is similar to a linear improvement, but after increasing to a certain extent, it will tend to a limit, indicating that the improvement effect of increasing the degree of parallelism also has a bottleneck.

why? Because no matter how high the parallelism of the program is, there will eventually be a result summary task, and the summary task cannot be executed in parallel, but can only be executed serially. For example, we use Java's Fork/Join framework to decompose a large task into multiple subtasks to be executed in parallel, and finally we need to merge the results of the subtasks serially.

This conclusion also has an empirical law - Amdahl's Law (Amdahl's Law) , which explains the efficiency improvement after parallel computing of processors. We call the serial part the serial component W s W_sWs, the parallel part is called the parallel component W p W_pWp, it is the serial component that limits the limit of performance improvement, the larger the serial component, the lower the limit.

  • The execution time after parallelism is W pp + W s \frac{W_p}{p} + W_spWp+Ws
  • The acceleration factor after parallelism is W s + W p W s + W pp \frac{W_s+W_p}{W_s+\frac{W_p}{p}}Ws+pWpWs+Wp, when the degree of parallelism p tends to infinity, the improvement limit is W s + W p W s \frac{W_s+W_p}{W_s}WsWs+Wp

并行度、并行分量对提升效果的影响

——The picture is quoted from Wiki Encyclopedia

Explanation: Take the green curve as an example, the parallel component of the program is 95%, the serial component is 5%, and the final improvement limit will be 20 times.

4.3 Instruction Reordering

Increasing the number of cores is the most direct way to increase parallelism, but it is not the only way.

In order to improve the degree of parallelism, modern CPUs will reorder program instructions to a certain extent under the premise of observing the principle of single-thread data dependency. In fact, not only the CPU, there are three levels of reordering from source code to instruction execution:

  • 1. Compiler reordering: For example, operations that are repeatedly called in the loop are executed outside the loop in advance;
  • 2. Processor system reordering: For example, instruction parallelism technology overlaps and executes multiple instructions, or uses branch prediction technology to execute branch instructions in advance, and puts the calculation results in the hardware cache of the reorder buffer (Reorder Buffer). After the program really enters the branch, it directly uses the settlement result in the cache;
  • 3. Memory system reordering: For example, the write buffer and invalidation queue mechanism is a visibility problem, and it is also an instruction reordering problem from the perspective of memory.

指令重排序类型

Related articles: 12 pictures to understand CPU cache consistency and MESI protocol, is it really the same?

4.4 SoC chip - on-chip and off-chip dual-bus structure

With the advancement of chip integrated circuit technology, the five major components (calculator, controller, memory, input and output device interfaces) in the von Neumann architecture can also be integrated on a chip to form a system that is close to a complete computer. This chip is also called System on Chip (Soc). The SoC chip aggregates the various components originally distributed on the motherboard into the same chip, and the bus information transmission efficiency between different components is higher.

Related Article: Diagramming the Internal Computer Highway - The Bus System


5. Summary

Today, we briefly discussed the basic concepts of the CPU, and many problems are just scratching the surface. In subsequent articles, we will connect CPU in series from three perspectives: execution system, storage system and I/O system. Please pay attention.


References

Xiaopeng's Android communication group 02 group

Guess you like

Origin blog.csdn.net/pengxurui/article/details/128125176