Computer Architecture-Final Review

Computer Architecture-Final Review

Chapter 1 Fundamentals of Quantitative Design and Analysis

| 1.2.6 Degree of Parallelism and Classification of Parallel Architectures

There are two main types of parallelism in applications:

  • Data-level parallelism: Parallelism achieved by operating many data items simultaneously
  • Task-level parallelism: Create work tasks that can be processed individually and executed massively in parallel

All computers can be divided into: based on the parallelism of instruction flow and data flow:

  • SISD (Single Instruction Stream Single Data): Single processor, but can take advantage of instruction-level parallelism
  • SIMD (Single Instruction Stream Multiple Data Stream): The same instruction is executed by multiple processors using different data streams, which can develop data-level parallelism
  • MISD (Multiple Instruction Stream Single Data Stream): does not exist temporarily
  • MIMD (Multiple Instruction Streams, Multiple Data Streams): Each processor extracts its own instructions and operates on its own data, targeting task-level parallelism.

| 1.3 Computer Architecture

See Appendix A

| 1.4.1 Bandwidth trumps latency

The speed of bandwidth improvement is much greater than the speed of latency improvement

| 1.9 Quantitative principles of computer design

Three basic principles of computer design:

  • Take advantage of parallelism
  • locality principle
  • Focus on common situations: Amdahl’s law

Speedup ratio = original execution time using improved execution time Speedup ratio = \frac{Original execution time}{Using improved execution time}Speedup ratio=Use improved execution timeOriginal execution time

New execution time = original execution time × [ ( 1 − upgrade ratio) + upgrade ratio, upgrade acceleration ratio] New execution time = original execution time \times [(1-upgrade ratio) +\frac{upgrade ratio}{upgrade acceleration ratio} ]new execution time=Original execution time×[(1Upgrade ratio )+Upgrade acceleration ratioUpgrade ratio]

Total speedup ratio = original execution time new execution time = 1 ( 1 − upgrade ratio) + upgrade ratio upgrade speedup ratio total speedup ratio = \frac{original execution time}{new execution time}=\frac{1}{(1- Upgrade ratio)+\frac{Upgrade ratio}{Upgrade acceleration ratio}}Overall speedup=new execution timeOriginal execution time=(1Upgrade ratio )+Upgrade acceleration ratioUpgrade ratio1

Chapter 3 Instruction Level Parallelism

| 3.1 Instruction-level parallelism

Modern processors implement instruction level parallelism (ILP) through pipelines to improve performance. The CPI of a pipelined processor is:
Pipeline CPI = Ideal pipeline CPI + Structured hazard pause + Data hazard pause + Control hazard pause Pipeline CPI = Ideal pipeline CPI + Structured hazard pause + Data hazard pause + Control hazard pausePipeline CP I=Ideal pipeline CP I+Structured Adventure Pause+Data adventure pause+Controlling risky pauses
The unit of pauses is per instruction. The main ILP technologies are:

technology Which part of the CPI should be lowered? Hardware/software based
Forward Data adventure pause hardware
Basic dynamic scheduling (scoreboard) Data adventure pause hardware
Dynamic scheduling with rename (Tomaluso) Data adventure pause hardware
Delayed branch and basic branch prediction Controlling adventure pauses hardware
dynamic branch prediction Controlling adventure pauses hardware
loop unrolling Controlling adventure pauses software
Compiler pipeline scheduling Data adventure pause software

The most common risk in the pipeline is data risk. Data risk is caused by dependencies (dependencies). There are three types of dependencies:

  • The operand of the current instruction is the result of the previous instruction, which is true data dependence and may lead to read-after-write (RAW) risks.
  • The result of the current instruction is written to the operand of the previous instruction, which is anti-dependent and may lead to a write-after-read (WAR) risk.
  • The current instruction and the previous instruction write to the same location, and the output is dependent, which may lead to write-after-write (WAW) risks.

Correlation is only the relationship between instructions and does not necessarily cause risks in the pipeline . It depends on the specific implementation structure of the pipeline.

| 3.2 Loop expansion

Loop unrolling is a technology that improves instruction-level parallelism. When using loop unrolling, the following decisions and transformations must be made:

  • Make sure iteration loops are not relevant so loop unrolling is useful
  • Use different registers to avoid constraints caused by using the same register for different operations (using multiple variables)
  • Remove redundant tests and branch instructions
  • Observe whether the load and store instructions of different iterations are irrelevant and do not reference the same address before scheduling the swap position.
  • Schedule code

| Loop expansion-example

Consider the following code snippet:

for(i=999;i>=0;i--){
    x[i] = x[i] + s;
}

The corresponding RISC-V 64-bit instruction is as follows, assuming that x[i] and s are both floating point numbers:

Loop:
	fld f0,0(x1)	 #x1是x[999]的地址
	fadd.d f4,f0,f2  #f2存储s
	fsd f4,0(x1)
	addi x1,x1,-8	 #双精度浮点数数组,i--相当于地址-8
	bne x1,x2,Loop

For the RISC-V pipeline, assume that the floating point operation latency is as follows and the integer operation latency is 0:

Insert image description here

Then you can get the pipeline execution sequence:

Loop:	fld	f0,0(x1)
		stall
		fadd.d f4,f0,f2
		stall
		stall
		fsd f4,0(x1)
		addi x1,x1,-8
		bne x1,x2,Loop

The completion of the above instruction sequence requires 8 cycles, and one pause can be reduced through scheduling:

Loop:	fld	f0,0(x1)
		addi x1,x1,-8
		fadd.d f4,f0,f2
		stall
		stall
		fsd f4,0(x1) 		
		bne x1,x2,Loop

Because of this code, the iterations are irrelevant. Unroll this loop four times. The first three branch jump instructions can be deleted. i– can be transformed into i=i-4. Try to use different registers to avoid constraints. , the new instruction sequence is:

Loop:	fld	f0,0(x1)
		fadd.d f4,f0,f2
		fsd f4,0(x1)
		
		fld	f6,-8(x1)
		fadd.d f8,f6,f2
		fsd f8,0(x1)
		
		fld	f10,-16(x1)
		fadd.d f12,f10,f2
		fsd f12,-16(x1)
		
		fld	f14,-24(x1)
		fadd.d f16,f14,f2
		fsd f16,-24(x1)
		
		addi x1,x1,-32
		bne x1,x2,Loop

There will be a pause between loading and operation, and there will also be a pause between operation and storage. Therefore, this instruction sequence needs to be scheduled. Since different registers are used, there will be a pause between loading and loading, and between operation and operation. There will be no pauses, so schedule it all together:

Loop:	fld	f0,0(x1)
		fld	f6,-8(x1)
		fld	f10,-16(x1)
		fld	f14,-24(x1)
		
		fadd.d f4,f0,f2
		fadd.d f8,f6,f2
		fadd.d f12,f10,f2
		fadd.d f16,f14,f2
		
		fsd f4,0(x1)
		fsd f8,0(x1)
		fsd f12,-16(x1)
		fsd f16,-24(x1)

		addi x1,x1,-32
		bne x1,x2,Loop	

Now there will be no pause in the entire sequence, and the calculation of these four loops can be completed in 14 cycles, with an average of 3.5 cycles per loop, which is obviously faster than not using loop expansion.

| 3.4-3.5 Dynamic Scheduling & Tomasulo Algorithm

Dynamic scheduling is more flexible, allowing instructions to be executed in different orders and reducing stalls. Rearranging the execution of instructions by hardware allows efficient execution of code on different pipelines and can handle dependencies (involving memory references) that are not known at compile time, as well as cache misses.

The scoreboard algorithm introduced in Appendix C is a dynamic scheduling algorithm that allows instructions to be executed out of order. The scoreboard will check various data risks and structure risks separately at each stage of the instruction, and avoid risks through pauses.

The Tomasulo algorithm uses a similar idea to the scoreboard algorithm to record the execution status of instructions and check for risks. The difference is that Tomasulo uses register renaming to eliminate two false data dependencies, WAR and WAW, and eliminates the corresponding two data risks. Moreover, in the Tomasulo algorithm, a functional unit has multiple buffers, and instructions using the same functional unit can wait at the reservation station when the functional unit is occupied, reducing pauses caused by structural risks.

Implementation structure of Tomasulo algorithm

The implementation structure of Tomasolu algorithm is as follows:

Insert image description here

  • FP OP Queue: Instruction queue, instructions are issued from here
  • Reservation Stations: Reservation stations, which retain transmitted command information and buffered data
  • Address Unit: Address calculation unit. The storage address will be calculated before execution of the storage instruction.
  • Memory Unit: storage unit
  • CDB: Data broadcast bus, which goes directly to the register file and reservation station to transmit data.

Reserved station, register result status table, instruction status table

In the Tomasulo algorithm, the execution of instructions has three stages: launch, execution, and write back. The command status table is as follows:

Insert image description here

  • Instruction emission: Tomasulo algorithm sequentially emits instructions. The condition for judging whether it can be issued is that there is a free space in the reservation station of the corresponding path of the instruction . Once the instruction is issued, it will occupy the entry in the reservation station and update the reservation station and register result status table. While transmitting, the data that can be read will be read to the reservation station. When updating the register result status table, the latest write information of the subsequent instructions is always retained in the table , and the subsequent instructions are written to the destination register. The results of the previous instructions are not written. Instructions that require the value of the previous instructions can be directly Get data by listening to CDB.
  • Execution: After the source data is ready, execution starts, and the execution device functional unit is occupied.
  • Write back: Directly pass the data to the register file and each reservation station through the CDB bus, update the register file according to the register result status table, and clear the information of the reservation station and register result status table.

In the scoreboard, each configuration path can only store one instruction. The Tomasulo algorithm configures a set of buffers for each path . For the same computing unit, multiple instructions can be buffered. When the computing unit is occupied, subsequent instructions can be Reservation station awaits. The reservation station can directly record the data that can be read in the reservation station , and for unready data, the broadcast data will be captured as soon as the calculation is completed, and the data source will be marked with the number of the reservation station instead of the register number, thus realizing register relocation . Naming .

Insert image description here

Figure 2: Reservation Station

Like the scoreboard, the Tomasulo algorithm also records the register result status and records the data source of the register update. The data source selects the latest instruction data .

Disadvantages of Tomasulo's algorithm

  • Complex to implement
  • Requires high-speed CDB
  • Performance limited by CDB

| Tomasulo algorithm form filling

When filling in the form with the Tomasulo algorithm, you should pay attention to directly filling in the operands with values ​​that can be read. And if the operand op in the current cycle is being written back, if the instruction issued in the current cycle requires op, the op will be read directly in the reservation station, instead of having to be read in the next cycle like the scoreboard. After writing back (the table at the cycle time when writing back), the register result status table should also record the corresponding value. Similar to the scoreboard, during the instruction writeback cycle, the reservation station's information about this instruction can be deleted when filling out the form.

Chapter 5 Thread-Level Parallelism

| 5.4 Directory Consistency Protocol and Monitoring Consistency Protocol

Multicore processors may have both shared and dedicated levels of cache. If multiple processors share data on the memory, they may cache the shared data in their own dedicated caches. Different processors may store different values ​​of the same data in their own caches. This is a cache consistency problem.

The way to ensure cache consistency is to use a consistency protocol. There are two consistency protocols:

  • Directory: Keeps the shared state of physical memory blocks in a location called a directory
  • Listening: If a cache has a copy of a physical memory block, the shared state of that block is tracked. All caches are accessible through some broadcast medium, and all cache controllers listen to this medium.

Both protocols are write invalidation protocols, where the processor invalidates other copies while performing a write operation.

Snoop coherence protocols are typically implemented with a finite-state controller that changes the state of selected cache blocks and accesses or invalidates data using the bus. By sending a signal to the bus, the block of other CPUs is invalidated, and by sending a signal, other CPUs are informed that a read or write loss has occurred.

In the directory consistency protocol, each processor has a directory that stores the status of each cacheable block. The information includes which caches have copies of this block, whether it needs to be updated, etc. Each block of memory has a corresponding entry in the directory. Each directory entry consists of an access status and a bit vector. The bit vector records whether each processor has a copy of the block. The directory consistency protocol does not notify other processors of a block failure through broadcast. Instead, it notifies the corresponding CPU that the block has failed or completed the work of updating the block based on the bit vector. And the CPU that belongs to the memory where the block is located serves as the host to communicate with the processor that needs to communicate, and completes the write failure notification and write-back operations.

Comparison of two methods

The listening method is based on the bus and implements write failures by broadcasting signals. The advantage is that it does not require additional storage space to maintain consistency information. The disadvantage is poor scalability. The greater the number of processors, the greater the pressure on bus communication.

The directory method uses a centralized directory to maintain consistency information, which increases storage overhead. However, the consistency information is stored in the directory in a centralized manner, and the directory structure itself is distributed, so it is scalable. The biggest advantage of the directory method is that it can be implemented in a distributed system, does not require a bus, and is scalable.

Appendix A

| A.2 Classification of instruction set architecture

According to storage type classification, there are three types of instruction set architectures:

  • stack architecture
  • accumulator architecture
  • General purpose register architecture: operands are registers or memory locations
    • Register-memory architecture: The operand can be a memory address, and the program size is small, but the complexity of the instruction will be different, and the clock cycles required for completion will be different.
    • Load-store architecture (register-register architecture): Only load-store instructions can access memory. The number of clocks required for instruction execution is similar, which facilitates the pipeline of instructions, but the program size is large.

| A.3-A.6 Characteristics of instruction set architecture

  • Explain address & addressing mode: Addresses can be expressed in two ways: big-endian and little-endian. Addressing of data is usually aligned. Addressing methods include register addressing, immediate addressing, displacement addressing, direct addressing, etc. An architecture should support at least the most commonly used register indirect addressing, displacement addressing and immediate addressing.
  • Operand type and size
  • Operations in the instruction set: generally support at least arithmetic and logical operations, data transfer operations, control operations, and system operations
  • Control flow instructions: conditional branch, jump, procedure call, procedure return
  • Instruction set encoding: allow as many registers and addressing modes as possible, and control the instruction length, hoping that the instructions can be easily processed by the pipeline
    • Variable-length encoding: Allows all modes for all operations. Code size matters more than performance
    • Fixed-length encoding: The effect is better when the addressing mode and operands are small, and the operation and addressing mode are merged into the opcode. Performance is more important than code size
    • mixed encoding

| RISC-V architecture

Taking RISC-V64G as an example, we will introduce the above characteristics of this instruction set architecture.

  • Registers: RV64G has 32 64-bit general-purpose registers, 32 64-bit floating-point registers, and some special-purpose registers.

  • Operands: RV64G supports byte, halfword, word, doubleword and single/double precision floating point.

  • Addressing mode: The only data addressing modes are immediate addressing and offset addressing. When the offset is 0, register indirect addressing is implemented. RV64G uses 64-bit addresses and little-endian storage. Memory accesses do not need to be aligned, but using unaligned memory accesses will be very slow.

  • Instruction set encoding: fixed-length encoding. The corresponding assembly format is OP rd rs1 rs2.

Insert image description here

  • Operations: load storage, ALU operations, branches and jumps, floating point operations.
  • Control flow instructions: branch and jump instructions.

Appendix B

| B.2 Evaluate cache performance

cache performance

Memory pause time is often used to evaluate cache performance:

Memory stall cycle = number of misses x miss cost = IC (number of instructions) x memory access / instruction x miss rate x miss cost

Another metric is to use average memory access time:

Average memory access time = hit time + miss rate x miss cost

When calculating the speedup ratio brought by caching, you can calculate the CPI with and without caching, and then get the speedup ratio. The CPI execution formula only considers memory stalls, and without caching the miss rate is 100%.

Four issues to consider when caching

  • Cache organization: fully associative/group associative/direct mapping
  • Cache lookup: divide physical address into TAG|INDEX|BLOCK OFFSET
  • Cache replacement strategy: random/LRU (generally only pseudo-LRU is implemented)/FIFO
  • Literacy Strategies:
    • Write: direct write/writeback. Writeback is faster, only requires writes to cache, and uses less memory bandwidth. Direct writing is easier to implement, and the next level of storage always has an up-to-date copy, simplifying data consistency.
    • Write Missing: Write Dispatch/No Write Dispatch. Write-back caching often uses write dispatch, hoping that subsequent writes will be captured by the cache, while write-through caching often uses no-write dispatch.

| B.3 Evaluate cache performance

6 basic cache optimization methods:

  • Increase block size to reduce miss rates: larger blocks reduce cold misses and take full advantage of spatial locality, but blocks that are too large increase miss costs and increase the other two types of misses. The choice of block size depends on the bandwidth and latency of the low-level memory. For high-bandwidth and high-latency memories, using large blocks rarely increases the loss cost, and is encouraged to use large blocks. If the low-level processor is low-latency and low-bandwidth, it is encouraged to use small blocks. piece
  • Increase the cache to reduce the miss rate: it can reduce capacity misses, but the disadvantage is that it may extend the hit time and increase cost and power consumption.
  • Increase the degree of association to reduce the missing rate: The rule of thumb is that eight-way group association is as effective as full association. However, increasing the degree of correlation will increase the missing cost
  • Use multi-level caching to reduce miss costs: Using multi-level caching can speed up caching and expand cache capacity. In order to effectively measure the miss situation of multi-level cache, the local miss rate (the miss rate of the current cache level) and the global miss rate (the miss rate of the overall cache) are used. The speed of the first-level cache affects the processor clock frequency, and the speed of the second-level cache only affects the miss cost of the first-level cache. L2 cache has a higher miss rate, so the focus is on reducing misses, using higher associativity and larger blocks
  • Read misses have higher priority than write misses - reduce the cost of misses: For read-after-write problems, if the write has not been completed and the buffer is still being written, the read miss can first check the contents of the write buffer
  • Avoid address translation when indexing cache - Reduce hit time: The system can use virtual addresses in the cache to shorten hit time. To ensure address space protection and other reasons, one solution is to use a portion of the page offset to index the cache, while the flag matching still uses the physical address, so that the address translation can be performed while using the index to read the cache.

Appendix C

| C.1 Basic implementation of pipelined

Divide the instruction execution process into 5 cycles, start a new instruction in each cycle to realize the pipeline of instructions, and adopt the following points:

  • Use separate instruction cache and data cache to avoid memory access conflicts.
  • Registers are written during the first half of the clock cycle and read during the second half.
  • The program counter is incremented and the branch target address is calculated.
  • Pipeline registers are introduced to transfer data in consecutive pipeline stages to ensure that data at all levels will not interfere with each other.

Pipelining can improve CPU instruction throughput, but it cannot shorten the execution time of a single instruction. Because the clock speed must be greater than the slowest pipeline stage, and pipeline registers introduce latency.

| C.2 Obstacles to Streamlining - Assembly Line Adventure

Pipelined instruction execution is mainly hindered by pipeline risks, including the following three risks, and the corresponding solutions are:

  • Structural risk taking: Risk taking caused by resource conflicts.
    • pause
    • Add hardware (need to weigh whether it’s worth it)
    • Instruction reordering
  • Data hazards: Pipeline hazards caused by data dependence.
    • Forward
    • pause
    • Instruction reordering
  • Control risk: risk caused by branch instructions.
    • Basic prediction mechanism: predict checked/unchecked, branch delay, static branch prediction
    • Dynamic branch prediction: 2-bit branch predictor

| C.3 Implementation of Pipelining

MIPS moves branch calculation and detection to the ID stage, hoping to reduce pauses caused by branch risks.

Insert image description here

If the branch calculation is completed in the ID stage, some data risks will cause more pauses. Therefore, RISC-V is designed to complete the branch judgment in the EX stage . In this way, if there is no branch prediction, two instructions will be wasted, which is equivalent to two pauses . RISC-V uses dynamic branch prediction and does not use delay slots, because branch delays are not always feasible, and branch judgment is completed in the EX phase, jumps are made based on the predicted results in the ID phase, and verification is performed in the EX phase.

Insert image description here

| C.7 Basic Dynamic Scheduling-Scoreboard

In the dynamic scheduling pipeline, instructions are issued in order and executed out of order. This is achieved by using a scoreboard . The scoreboard is fully responsible for command launch and execution, including all adventure detection tasks. Out-of-order execution will cause WAR and WAW risks that do not exist in the original sequential execution pipeline. These risks are detected and processed by the scoreboard. Each instruction will enter the scoreboard, and there will be a record. The scoreboard will determine when the operands can be read and executed, and it will also control when the instruction can be written back to the target register. Each functional component has a data path (a recorded location) in the scoreboard.

There are four steps for instructions to complete in the pipeline:

  • Issue: If a functional unit of the instruction is free and there are no other active instructions targeting the same register, the instruction is issued to the functional unit. If there is a WAW hazard or a structural hazard, the command emission is stalled.
  • Read operands: The scoreboard monitors the availability of the source operand. If none of the previously issued instructions wrote to the source operand, the source operand is available. This step solves the RAW hazard.
  • Execution: The functional unit starts execution after receiving the operand. After the result is ready, it notifies the scoreboard that the execution has been completed.
  • Write results: When the scoreboard knows that the functional unit has completed execution, it checks the WAR risk and stops the completing instructions if necessary.

The scoreboard has three sections:

  • Instruction status: Indicates which step of the four steps the instruction is in
  • Functional unit status: indicates the status of the functional unit, Fj and Fk are the source register numbers, Fi is the destination register, Qj, Qk are the functional units that generate the source register, Rj and Rk indicate whether the operand is ready
  • Register result status: Indicates which functional unit will write to which register. The field will be empty after the writing is completed.

| Scoreboard filling form

When filling in the form, during the instruction reading phase, Rj and Rk are still in the ready-to-read state. After entering the execution phase, they are set to read. When entering the writeback phase, the functional unit status table of the instruction in the scoreboard and the corresponding position of the register result status table are cleared (but the next instruction using this functional unit cannot be issued in this cycle, although this entry in the table (the object is empty), and at the same time, Qj and Qk of the instruction that need to read the operand of the component are also cleared, and Rj and Rk turn to YES state, indicating that they are ready to be read, but they cannot be read until the next moment. When filling out the table, please note that the clki table shows the status at the end of clki, so you cannot clear the functional unit status when writing back an instruction, and the next instruction using the functional unit will be launched in the same cycle.

As shown in the figure below, the first LD is written back in clk4. At this time, the Integer component entry is empty, but the next LD cannot be emitted in clk4, because the first LD in clk4 is written back and the entry is cleared in this cycle. matter.


In addition, the command must check whether there is a WAR risk before writing it back , as shown in the following illustration:

Insert image description here

Guess you like

Origin blog.csdn.net/Aaron503/article/details/131083374