多标量处理器论文翻译

督促自己读论文
Abstract
Multiscalar processors use a new, aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs. A single program is divided into a collection of tasks by a combination of software and hardware. The tasks are distributed to a number of parallel processing units which reside within a processor complex. Each of these units fetches and executes instructions belonging to its assigned task. The appearance of a single logical register ﬁle is maintained with a copy in each parallel processing unit. Register results are dynamically routed among the many parallel processing units with the help of compiler-generated masks. Memory accesses may occur speculatively without knowledge of preceding loads or stores. Addresses are disambiguated dynamically, many in parallel, and processing waits only for true data dependences.
This paper presents the philosophy of the multiscalar paradigm, the structure of multiscalar programs, and the hardware architecture of a multiscalar processor. The paper also discusses performance issues in the multiscalar model, and compares the multiscalar paradigm with other paradigms. Experimental results evaluating the performance of a sample of multiscalar organizations are also presented.
摘要
多标量处理器使用一种新的，激进的实现方式将高层程序语言中的指令进行并行指令，一个单独的程序被分割为一系列软硬件结合的任务，这些任务被分配到一个处理器集合中的多个并行的处理单元，每个处理单元取指并执行各自任务的指令。每个独立的抽象寄存器以副本的形式保存在每一个并行处理单元中。寄存器的运算结果根据编译器产生的掩码进行动态路由，对内存的投机访问导致无法预测其访存行为。为了保证数据地址访问的结果的一致性，处理器需要等待实际依赖的数据产生。
这篇论文通过介绍多标量的整体思路与方法，多标量程序的结构，多标量处理器的硬件结构三个层面来阐述多标量处理器。本论文还进行了多标量处理器的性能问题，并与其他结构的进行比较，并呈现了一些测试结果。

Interduction
The basic paradigm of sequencing through a program, i.e., the fetch-execute cycle using a program counter, has been with us for about 50 years. A consequence of this sequencing paradigm is that programs are written with the tacit assumption that instructions will be executed in the same order as they appear in the program. To achieve high performance, however, modern processors attempt to execute multiple instructions simultaneously, and in some cases in a different order than the original program sequence. This reordering may be done in the compiler, in the hardware at execution time, or both. Superscalar and VLIW processors belong to this class of architectures that exploit instruction level parallelism (ILP).
1 介绍
程序指令的排序范式，例如基于程序计数器的读取-执行模式已经使用了50年，这种顺序范式使得程序的书写顺序与程序在编译后的先后顺序相同，然而，为了得到更加高效的性能，现代的处理器尝试同时执行多条指令，并在某些条件下，打乱程序的执行顺序，这种重排是在编译阶段完成或者硬件执行时完成，或两者都进行。超标量技术和VLIW处理器都属于这种利用指令级并行的体系结构。

ILP processors and compilers typically convert the total ordering of instructions as they appear in the original program into a partial ordering determined by dependences on data and control. Control dependences (which appear as conditional branches) present a major obstacle to highly parallel execution because these dependences must be resolved before all subsequent instructions are known to be valid.

ILP处理器和编译器通常将指令在原始程序中的顺序转换为数据以及控制依赖的顺序。由于分支后的指令需要根据分支指令的执行结果来判断是否执行，因此，该控制依赖项（如分支）是阻碍高度并行执行的主要障碍。

Focusing on control dependences, one can represent a static program as a control ﬂow graph (CFG), where basic blocks are nodes, and arcs represent ﬂow of control from one basic block to another. Dynamic program execution can be viewed as walking through the program CFG, generating a dynamic sequence of basic blocks which have to be executed for a particular run of the program

根据程序的控制依赖，可以将程序的控制转换为控制流图（CFG），其将程序块表示为节点，并用边表示各个程序块的控制流。动态执行程序可以看作是在控制流图中进行遍历，生成程序的特定执行基本块间的动态序列。

To achieve high performance, an ILP processor must attempt to walk through the CFG with a high level of parallelism. Branch prediction with speculative execution is one commonly-used technique for raising the level of parallelism that can be achieved during the walk. The primary constraint on any parallel walk, however, is that it must preserve the sequential semantics assumed in the program

为了可以高效的运行程序，ILP处理器必须尝试在更高层次上并行遍历CFG，在遍历过程中，对分支进行推测性的执行是一种提高并行度的常用手段。然而，任何并行执行的首要约束在于必须保持程序的源语义。

In the multiscalar model of execution, the CFG is partitioned into portions called tasks. A multiscalar processor walks through the CFG speculatively, taking task-sized steps, without pausing to inspect any of the instructions within a task. A task is assigned to one of a collection of processing units for execution by passing the initial program counter of the task to the processing unit. Multiple tasks then execute in parallel on the processing units, resulting in an aggregate execution rate of multiple instructions per cycle.

在多标量并行执行模型中，CFG被分割为多个任务，一个多标量处理器投机性的遍历CFG，将任务作为最小单元进行遍历。每个任务将其起始的程序地址分配给一个程序执行单元，多个任务同时执行，进而提高每个周期内指令的执行数量。

At this level, the concept sounds simple, however, the key to making it work is the proper resolution of inter-task data dependences. In particular, data that is passed between instructions via registers and memory must be routed correctly by the hardware. Furthermore, it is in this area of inter-task data communication that the multiscalar approach differs signiﬁcantly from more traditional multiprocessing methods.

在这个层次上，这个概念看起来很简单，然而，达到这个目的的关键在于处理好任务间的控制以及数据依赖。特别是两个任务间的数据通过寄存器以及内存的传递必须在硬件上得到正确的路由。此外，在任务间的通信方式上，多标量处理器于其他传统的处理器上有着显著的不同。

This paper describes the multiscalar approach to exploiting ﬁne-grain parallelism (or instruction-level parallelism or ILP). Section 2 provides an overview of the multiscalar paradigm. A breakdown of the distribution of the available processing unit cycles in multiscalar execution follows in Section 3. In Section 4, we compare multiscalar with other ILP paradigms. A performance evaluation of potential conﬁgurations of a multiscalar processor is given in Section 5. In Section 6, we summarize this work and offer concluding remarks.

本片论文用于描述采用细粒度并行的多标量执行方法，在第二章概述多标量范式，在第三章对多标量执行的处理器中期进行详细说明，第四章，将多标量处理器于其他处理器进行比较，并在第五章对多标量处理器进行性能评估，第六章进行总结。

A multiscalar program must provide the means to support a fast walk (through the CFG) that distributes tasks en masse to processing units. Below, we describe three distinct types of information maintained within a machine-level multiscalar program to facilitate this end: (i) the actual code for the tasks which comprises the work, (ii) the details of the structure of the CFG, and (iii) the communication characteristics of individual tasks.

多标量程序必须提供一种可以在CFG图中快速遍历的方法，以便将任务分发给每个处理单元，下面，我们，我们将描述在机器级的多标量程序中维护的三种不同的类型信息，以达到这个目的：
1.任务的实际代码组成本工作

CFG的实际结构
单个任务的通信特性
The speciﬁcation of the code for each task is routine. A task is speciﬁed as a set of instructions, in the same fashion as a program fragment for a sequential machine. Although the instruction set architecture (ISA) in which the code is represented affects the design of each individual processing unit, it has little inﬂuence on the rest of the design of a multiscalar processor. Hence, the instruction set used to specify the task is of secondary importance. (The signiﬁcance of this fact is that an existing ISA may be used without a major overhaul.)

每个任务的代码都是特定的，一个任务是由一系列的指令构成，其方式与顺序的程序片段相同，尽管，在各个不同的体系结构中指令集各有不同，影响单独的处理单元的设计，但对多标量处理器的影响很小，因此，指定任务的指令集是次要的（这个特性的意义在于现有的ISA可以在没有大规模调整的情况下进行使用）

多标量处理器论文翻译

猜你喜欢