Preface
Multiple choice
- Be careful with absolute options
- If you can’t choose, see if you can find the pits in the options.
compilation
- Pay attention to the order of operations when filling in the blanks, add parentheses when adding parentheses
- When filling in the machine code, pay attention to the little endian method
- Pay attention to the instruction suffix and the length of the register!
- When filling the stack frame, pay attention to the memory scale
processor
- Processor warm-up cycle
- Potential data hazards between cycles
- When calculating the number of cycles, consider the first/last cycle separately
- Pay attention to whether it is Cnd or! Cnd
optimization
- Pay attention to the situation where memory aliases cause critical paths
- Data dependency graph
Cache
-
Pay attention to the order, the default is advanced first
-
Read the questions carefully and carefully!
-
Find a large enough place, list all the data neatly, first convert to hexadecimal, and then convert to binary
Chp2 data
1 integer
Positive numbers overflow as negative numbers, and negative numbers overflow as positive numbers
assert(0x80000000 > 0); // 十六进制数先转unsigned, 再转long; 这里转成unsigned
int x = 0x80000000;
assert(x < 0);
Classic example: INT_MIN == -INT_MIN
2 Floating point
Space allocation
sign | exp | normalized range | frac | |
---|---|---|---|---|
float | 1 | 8 | -126~127 | 23 |
double | 1 | 11 | -1022~1023 | 52 |
Special value
-
In
float nan1 = u2f(0xffc00000u); float nan2 = u2f(0xffc00001u); printf("%d %d\n", nan1==nan1, nan1!=nan1); printf("%d %d\n", nan1==nan2, nan1!=nan2);
result:
0 1 0 1
Simply put: Nan!= Nan Heng is established
-
denormed
-
INF
3 float and int
- float to int, directly intercept
- int/double to float, rounding
4 About TMIN
cout << (-2147483648 > 0) << endl; // 0
cout << (0x80000000 > 0) << endl; // 1
Explanation:
Hexadecimal conversion order int->unsigned->long; decimal conversion order: int->long
So the first one is treated as long and the second one is treated as unsigned
5 Big-endian and little-endian
Big-endian: put the high bit in front (low address)
Little-endian: low-order first (higher address)
6 other
Pay attention to operator precedence
When type conversion,Change the size first, then perform signed/unsigned conversion
Chp3 compilation
1 Basic
format | Value |
---|---|
$Imm | Imm |
Imm | M[Imm] |
(r1, r2) | M[R[r1]+R[r2]] |
Imm(r1,r2) | |
(,r1,s) | M[s*R[r1]] |
Imm(,r1,s) |
scaler must be 1/2/4/8
register
% r12 ~% r15Same as ==%rbx and %rbp==, both are callee saved
%r8 and %r9 are the 5th and 6th parameters respectively
suffix
qlwb
lea can only be used with q
Operand
Unary arithmetic / logical operation ofOperand can be memory
Two of binary arithmetic/logical operationsOperands can be memory,The first operand can also be an immediate value
The first operand of MOVS/Z and CMOV instructions can be memory,The second operand can only be a register
CMOV type instruction ofOperand cannot be single byte,Specify length by register suffix(The meaning of the suffixes of movb and movl!)
Condition code
instruction | Condition code |
---|---|
leaq | No condition code |
inc/dec | Set ZF and OF, not CF |
logic operation | Let CF and OF be 0 |
Shift operation | Let CF and OF be 0 |
instruction | condition | Remarks |
---|---|---|
setl | OF^SF | |
setb | CF | |
sets | SF | If it is negative |
2 Rare assembly instructions
Data transfer related
instruction | effect | Remarks |
---|---|---|
cltq | Extend %eax sign to %rax | convert long to quad |
movabsq | The operand of movabsq is an immediate value, which can be 64 bits,The purpose can only be a register | And ordinary movq can only be 32 bits |
Arithmetic related
Shift operation: k can be stored in ==%cl== (single byte), taken by the low m bit
sar: When there is only one operand, k is 1
instruction | effect | Remarks |
---|---|---|
(i)mulq S | rdx : rax ← \leftarrow ← S * rax | i means a signed number otherwise it is an unsigned number |
cqto | rdx : rax ← \leftarrow ← rax | consert quad to oct Cqto before using idivq Clear rdx before using divq |
(i)divq | rax ← \leftarrow ← rdx:rax / S rdx ← \leftarrow ← rdx:rax % S |
noteS is the divisor |
Jump related
jmp *%rax
jmp *(%rax)
Process related
leave
Is equivalent to:
movq %rbp, %rsp
popq %rbp
Followed by ret
3 Logic
See review materials P6~7
Pay attention to switch:, jmp *JUMP_LIST(, index, 8)
the ==*== is important!
4 process
Stack frame layout
- (Rbp)
- Register (callee saved)
- Local variables (not necessarily 8-byte alignment, consistent with struct alignment)
- 7+th parameter (all8-byte alignment,%rsp+8*(k-6))
- Return address RA
call and jmp generally use relative coding
5 floating point
%xmm0 return value
%xmm0~7 8 parameters
Chp4 Y86 processor
Condition code without CF
Register does not have %r15
call can only be addressed absolutely, not PC relative addressing
Push/pop %rsp behavior: always process the original value
Logic gate
The round one is AND, the sharp one is OR(It is exactly the reverse of the letter shape)
CISC vs RISC
index | CISC | RISK |
---|---|---|
delay | Different lengths | All short |
Encoding length | Variable length | Fixed length (4 bytes) |
Memory addressing | Diverse | Only base address and offset addressing |
Memory access | Arithmetic/logical operation can access memory | load/store architecture |
Arithmetic / logical operation operand | Can be memory | Can only be a register |
Degree of abstraction | abstract | The details are visible |
Condition code | Have | no |
process | Stack dense | Register intensive |
For example |
concept
delay: The time it takes to process an instruction from start to finish
Throughput: the total number of instructions processed per unit time (unit: GIPS, or number of instructions/ns)
throughput = 1 maximum module delay + register delay (ps) ∗ 1000 throughput = \dfrac{1}{maximum module delay Time + register delay (ps)} * 1000Swallow spit amount=Most large mold block delay time+Register memory device delay time ( P S )1∗1000
其他
- 注意运行前要填充流水线,5阶段流水线要填充4个周期
- 注意循环可能导致潜在的数据冒险
- 计算周期数时,注意单独考虑第一次/最后一次循环
Chp5 优化
循环展开级数并不是越多越好,考虑容量不命中(寄存器也算)
Chp6 缓存
RAM
晶体管数/bit | 访问时间 | 成本 | 应用 | 敏感 | |
---|---|---|---|---|---|
SRAM | 6 | x1 | x1000 | 缓存 | 否 |
DRSM | 1 | x10 | x1 | 内存 | 是 |
传统DRAM
超单元:由 ω \omega ω个单元组成
DRAM芯片有rc=d个超单元
访问DRAM内容时,先发RAS请求,DRAM取出相应行的数据,放进一个缓冲区,再发CAS请求,复制出相应的 ω \omega ω位数据。RAS和CAS占用相同的引脚。两次发送是为了降低芯片的引脚数量
总共需要 ω + m a x ( l o g 2 r + l o g 2 c ) \omega+max(log_2r+log_2c) ω+max(log2r+log2c)个引脚
增强DRAM
名称 | 特点 |
---|---|
FPM | 对于同一行数据的访问,可以直接从缓冲区中读取,只发一次RAS请求即可 |
EDO | FPM的增强 |
SDRAM | 比异步的更快 |
DDR SDRAM | 相比SDRAM速度翻倍 |
VDRAM | 对图形系统的优化 |
ROM
擦写次数 | 应用 | |
---|---|---|
PROM | 1 | |
EPROM | 1000 | |
EEPROM | 10^5 | 闪存、SSD |
固件:ROM上的程序,例如BIOS、驱动程序
BUS
总线事务:读事务、写事务
总线:系统总线、内存总线
DISK
注意单位GB与GiB的区别
盘面->表面->磁道->扇区
柱面,个数等于每个表面的磁道个数
计算磁盘容量:注意每个盘片有两个表面
计算访问时间: = T a v g s e e k + 60 ∗ 1000 R P M ∗ ( 1 2 + 1 磁 道 平 均 扇 区 数 ) =T_{avg\ seek} + \dfrac{60*1000}{RPM}*(\dfrac{1}{2}+\dfrac{1}{磁道平均扇区数}) =Tavg seek+RPM60∗1000∗(21+磁道平均扇区数1)
寻道时间和旋转延迟大致相等,所以可以用寻道时间*2估计旋转延迟
磁盘控制器将物理磁盘与逻辑磁盘之间建立映射
概念:内存映射I/O
概念:DMA直接内存访问
SSD
读比写快
以页为单位读写
一页被擦除后才能写入数据
写慢的原因
- 擦除慢,1ms量级(读是50us量级)
- 若块中已有数据,要先复制
Cache
一路(way)有很多行(line)
缓存不命中的几种特殊情况
- 冷不命中/强制性不命中
- 冲突不命中
- 容量不命中
计算Cache Size
c a c h e s i z e = d a t a s i z e + ( v a l i d b i t s i z e + t a g s i z e ) ∗ b l o c k n u m b e r cachesize = datasize + (validbitsize + tagsize) * blocknumber cachesize=datasize+(validbitsize+tagsize)∗blocknumber
Cache参数的影响
命中时间 | 命中率 | 不命中处罚 | 有效数据占比 | |
---|---|---|---|---|
缓存大 | 增大(理解) | 增大(减少容量不命中) | —— | —— |
块大 | —— | 块大,空间局部性提高; 行数变小,时间局部性降低 |
增大(复制成本) | 增大 |
组相连度高 | 增大(理解) | 减少冲突不命中/抖动 可能放大容量不命中的影响 |
增大(选择牺牲行的成本) | 减小(tag位变长) |
存储结构越往下走,就越不能忍受不命中,宁可牺牲一点命中时间,因此会选择更高的组相连度
写策略影响
- 直写 & 非写分配
- 减少总线流量,增大复杂性(修改位dirty bit)
- 写回 & 写分配
- 层次较低的多用,因为不能忍受反复不命中