In layman's language computer composition principle: Superscalar and VLIW- how to make more than one CPU throughput? (Say 26)

First, the primer

To date, more than half the column has. Over the past more than 20 years say, I tell you the content of many of them revolve around how to enhance the performance of the CPU of the problem. We first look at fourth lecture, I do not know if you remember this formula:

CPU program instruction execution time = Number × CPI × Clock Cycle Time

This formula, there is an indicator of CPI called. We know that the reciprocal of the CPI, also called IPC (Instruction Per Clock), which is the number of instructions can be executed inside a clock cycle, it represents the CPU throughput. Well, this indicator, on our previous sections repeatedly optimized flow
waterline architecture CPU, the number can reach it?

The answer is that in the best case, IPC only to 1. Because no matter what line do the optimization level, even if done instruction execution level of chaos

This shows that no matter how well optimized can follow instructions, only one clock cycle after executing such an instruction, CPI is only 1. However, we are now using Intel CPU or ARM CPU, the general CPI can do 2 or more, this is how to do it?

Today, we take a look at what modern CPU use "black science and technology."

Second, multiple transmit and superscalar: two instructions in the same execution Practice

1, integer and floating point computation circuit, CPU level is separated

Speaking before the CPU hardware, we put all arithmetic and logic operations are abstracted into an ALU such a "black box." You should also remember the first mentioned 13 Lecture 16, on adders, multipliers, and even part of the floating-point computations, in fact, the integer calculation and
calculation of floating-point difference is not small. In practice, integer and floating point computation circuit, the CPU is a separate level.

Up to 80386, we are not CPU circuit dedicated floating-point computations of. At that time the floating-point calculations are carried out using simulation software. So, in the 80386, Intel 386 to 387 with a single chip, designed to make floating-point operations. At that time,
you buy 386 chips, then there will 386sx and 386dx two chips to choose from. 386dx is to bring the 387 chip floating-point computations, and sx is without a chip floating-point computations.

2, how to achieve parallelism?

In fact, we are now using Intel CPU chip is the same. While floating-point calculations have become part of the CPU inside, but not all functions are calculated in a ALU inside, the truth is, we have multiple ALU. This is why, at 24 to talk about the order execution of
waiting, you will see that in fact the implementation phase of instruction is determined by a number of functional units (FU) Parallel (Parallel) carried out.

However, during the execution of instructions out of order, we instruction fetch (IF) and instruction decode (ID) portion is not performed in parallel.

Since the implementation level instruction can be performed in parallel, why not fetch and instruction decode it? If you want to achieve parallelism, how to do it?

In fact, if we fetch and decode instructions, too, by adding hardware parallel way just fine. We can remove the disposable inside the plurality of instructions from memory, and then distributed to a plurality of parallel instruction decoder, decodes, and correspond to different functional units for processing.
In this way, we are in one clock cycle, to complete the instruction is not the only one. IPC also can do a greater than 1.

This CPU design, we call multi-emitter (Mulitple Issue) and superscalar (Superscalar).
What is multiple transmit it? The word sounds very abstract, it actually means is that we the same time, the multiple instructions at the same time might launch (Issue) to a different decoder or subsequent processing of the pipeline to go.

In superscalar CPU inside, there are a lot of lines parallel bars, not just a pipeline. "Superscalar" The word is that we have been in a clock cycle which can only perform a scalar (Scalar) operations. In the case of multi-launch,
we will be able to go beyond this limit, while multiple computing.

You can see a schematic diagram of the superscalar pipelined design of my painting. Look closely and you should see an interesting phenomenon, the length of the pipeline each functional unit is different. In fact, the length of the pipeline would have a different functional unit is not the same. We usually 14
pipeline, the pipeline is usually refers to the length of the integer calculation instruction. If floating-point operations, the actual length of the pipeline will be longer.

Third, the failure of Intel's work: Itanium's VLIW design

1, out of order, superscalar technology in actual hardware level, in fact, to implement all kind of trouble

Whether it is a few miles before speaking speaking out of order, or now further superscalar technology, the actual hardware level, in fact, to implement all kind of trouble. This is because, on the inside and superscalar out of order execution systems, we rely on the CPU to solve the problem of conflict. This is the first few
talk about the issue we are talking about adventure.

CPU needs to perform before the instruction, to determine whether there is a dependency between instructions. If there is a corresponding dependencies, the instruction can not be distributed to the implementation phase. Because of this, we are talking about multiple transmit the above features of superscalar CPU, also known as dynamic multi-processor launch . The dependence
relationship detection will make our CPU circuit becomes more complicated.

2, we can not fail to analyze and resolve the matter into the software dependencies inside to dry it?

As a result, computer scientists and engineers will have a bold idea. We can not fail to analyze and resolve the matter dependencies, on the hardware inside, but put the software inside to dry it?

If you remember, I also talked about in Lecture 4, in order to optimize the execution time of the CPU, the key is dismantling this formula:

CPU program instruction execution time = Number × CPI × Clock Cycle Time

At that time we said, which this formula, we can optimize the number of instructions by the index compiler improvements. That Next, we take a look at a very bold idea of CPU design, called VLIW design (Very Long Instruction Word, VLIW) . This design
does not only want the compiler to optimize the number of instructions, I would like to directly through the compiler to optimize CPI.

3, a well-known "epic" failed

Around this design, Intel is a well-known "epic" failure, which is famous Itanium IA-64 architecture (Itanium) processors. But, this time, responsibility for failure in Intel, but also can pull another company called Silicon Valley origin, that is HP.
It is called "epic" level failure, this statement comes from HP's first to take the name of this architecture, the explicit instructions of concurrent operation (ExplicitlyParallel Instruction Computer), the abbreviation of the name EPIC , happens to be "epic" means.

Well it happens, and Itanium processors and before I introduced you to the Pentium 4, as in the market for a failed product. After a 12-year-old design and development, generation Itanium only sold a few thousand units. The second-generation Itanium, after repeatedly struggling since 2002 for 16 years,
culminating in 2018 by Intel renunciation, out of the market. Since then, the world is no longer the "epic" the server.

4, we take a look, very long instruction word of the Itanium processor is how is it children?

So, we take a look, very long instruction word of the Itanium processor is how is it children.

In and out of order execution in superscalar CPU architecture, instruction dependencies front, by a CPU inside the hardware circuit to detect. And to the very long instruction word architecture inside , the work to the compiler software.

I started talking about the fifth column, give you saw a lot of C code into assembly code and machine code control. The compiler in this process, in fact, be able to know before and after the data-dependent. Thus, we can let the compiler does not depend on the relationship between the position of the code exchange. Then, then a plurality of
consecutive instructions packaged into a command packet. Itanium CPU 3 is to pack instructions into one instruction.

CPU at run time, an instruction fetch is not, but a remove instruction packet. Then, parse the entire command packet decoding, parsing instructions directly 3 run in parallel. Can be seen, the use of very long instruction word architecture CPU, the same pipelined architecture. In other words, a group
(Group) instructions, still have to go through multiple clock cycles. Likewise, the next set of instructions, and the like is not performed after the completion of a set of instructions executed again, but in the instruction decode stage in a set of instructions, the instruction fetch begins.

A point worth noting is that pipeline stalls this matter in very long instruction word which, very often by the compiler to do. In addition to stopping the entire processor pipeline, a very long instruction word of a CPU can not pausing clock cycles and wait for the completion of the operation performed in front-dependent. Compilers need
to insert a NOP operation in place, compiled machine code directly inside, pipeline stalls put this thing on the arrangements in place at the software level.

5, there are many reasons for the failure of Itanium, which is an important reason for this is "forward-compatible."

Although the idea is beautiful Itanium, Intel also had hoped to make a new generation of Itanium-based alternative to x86 architecture, but ultimately Itanium or after tossing back and forth for nearly 30 years failed. 2018, Intel Itanium 9500 declaration will be discontinued in 2021.
There are many reasons for the failure of Itanium, which is an important reason for this is "forward-compatible."

On the one hand, the instruction set x86 and Itanium processors are different. This means that all of the original programs on the x86 is no way to run on Itanium, and the need to re-compile the job by the compiler.

On the other hand, the VLIW architecture Itanium processor determines, if Itanium need to upgrade the degree of parallelism, it is necessary to increase the number of instructions contained in an instruction package, say from 3 becomes six. Once done so, it is also a VLIW architecture, the same instruction set Itanium CPU, the program also need
to be recompiled. Because the original compiler determines the dependencies between instructions and instruction 3 from the package consisting of three instructions, commands now become 6 and 6 composed of instruction packet. The compiler need to be recompiled, and switching sequence of instructions NOP operation, in order to satisfy the condition. Indeed, we need to re-write the compiler to make the program up and running on the new CPU.

So, Itanium becomes a forward compatible neither easy, not easy to backward-compatible CPU. Well, it's not surprising failure. It can be seen on the idea advanced technical ideas, in the practical application of industry-specific experience more practical test. Whether instruction set forward and
compatibility, or the corresponding future expansion CPU, at design time, needs to do more to consider practical factors.

IV Summary extension

This lecture, I come with you to launch the CPU performance of a new challenge: Let the CPU throughput, which is capable of more than one IPC.

I first introduce you to the superscalar, ie Superscalar this method. Superscalar CPU can make not only in the execution phase of the instruction parallel, at the time of instruction fetch and decode instructions, and it is parallel. By superscalar techniques, such that the CPU can use your IPC than 1.

In the x86 CPU Intel's, from the Pentium era, for the first time began to introduce superscalar technology, on a higher level of overall CPU performance. Corresponding technology has been in use up to now. Superscalar pipelining technology and other technologies, like you saw earlier, depending on the hardware level, can
the corresponding instruction has detected dependency, solve the "adventure" problem. So, it makes the CPU circuit becomes more complex.

Because of these complexities, and Hewlett-Packard and Intel jointly launched the famous Itanium processor. By the compiler level, directly before and after the analysis of the dependence of instruction. Thus, the hardware after the code is compiled, you can get direct instruction exchange of good order. And these instructions, and may be
part of the line is executed, it will be packaged together to form a package instructions. Itanium processor when the instruction fetch and instruction decode, is no longer to get a single instruction, but such an instruction packet. And instruction execution stage, the instruction package all instructions can be executed in parallel.

Although it looks, VLIW more disruptive at the technical level, not just transform a hardware level, and the level of use of the software compiler, combined to solve the problem of increasing the CPU instruction throughput. Ultimately, however, VLIW has not been recognized by the market and the industry.

HP and Intel Itanium processor jointly developed strong ill-fated. From 1989 it began the development until 2001 was released the first generation of Itanium processors. However, after 12 years of development, the first-generation Itanium processor ultimately only sold a few thousand units. The 2002 release of the Itanium 2 processor, did not

They can save their own destiny. Final in 2018, Intel announced the Itanium out of the market. Since then, the market is no longer capable of large-scale commercial VLIW architecture of the processor.

Guess you like

Origin www.cnblogs.com/luoahong/p/11441329.html