目录:
- pipelining基础:data path implications, introducing hazards, performance of pipelines, describing basic 5-stage RISC pipeline
- hazards
- how the simple 5-stage pipeline is actually implemented, focusing on control and how hazards are dealt with
- interaction between pipelining and various aspects of instruction set design (including: exceptions and their interaction with pipelining)
- how the 5-stage pipeline can be extended to handle longer-running floating-point instructions
- puts these concepts together in a case study of a deeply pipelined processor, the MIPS R4000/4400, including both the 8-stage integer pipeline and the floating-point pipeline
- concept of dynamic scheduling and the use of scoreboards to implement dynamic scheduling
一、Introduction
- Pipelining:
- multiple instructions are overlapped in execution
- parallelism among the actions needed to execute an instruction
- pipe stage / pipe segment: each step in the pipeline completes a part of an instruction. Different steps are completing different parts of different instructions in parallel.
- throughput of an instruction pipeline: determined by how often an instruction exits the pipeline
- RISC architectures: dramatic simplifications in the implementation of pipelining
- all operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register)
- the only operations that affect memory are load and store operations (memory->register / regitser->memory). Load and store operations that load or store less than a full register (e.g., a byte, 16 bits, or 32 bits) are often available
- the instruction formats are few in number, with all instructions typically being one size.
- A simple Implementation of a RISC Instruction Set (without pipelining)
- every instruction takes at most 5 clock cycles
- IF: instruction fetch cycle
- ID: instruction decode/register fetch cycle
- EX: execution/effective address cycle
- MEM: memory access
- WB: write-back cycle
- requires several temporary registers that are not part of the architecture
- only focus on integer subset of a RISC architecture (load-strore word, branch, integer ALU operations)
- every instruction takes at most 5 clock cycles
- The Classic 5-stage Pipeline for a RISC Processor
- Each of the clock cycles from previous section becomes a pipe stage – a cycle in the pipeline
- 下面是RISC pipeline要注意的一些点
- separate instruction and data memories
- perform register write in the first half of the clock cycle, and read in the second half
- pipeline registers: ensure that instructions in different stages of the pipeline do not interfere with one another. This separation is done by introducing pipeline registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. The pipeline registers also play the key role of carrying intermediate results from one stage to another where the source and destination may not be directly adjacenet
- Each of the clock cycles from previous section becomes a pipe stage – a cycle in the pipeline
- Basic Performance Issues in Pipelining
Imbalance among the pipe stages reduces performance because the clock can run no faster than the time needed for the slowest pipeline stage
二、The Major Hurdle of Pipelining: Pipeline Hazards
- Performance of Pipelines With Stalls:几个公式
- Data hazards: RAW, WAR, WAW
- RAW hazards:
上例中,only the “or” (register read occurs in the second half of the cycle) and “xor” (register read occurs in clock cycle 6) instruction operates properly - Minimizing data hazard stalls by forwarding (数据前递,also called bypassing, short-circuiting). 原理就是将结果直接传递给需要它的功能单元。具体有以下两点:
- The ALU result from both the EX/MEM and MEM/WB pipeline registers is always fed back to the ALU inputs
- If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file
再如下面的例子,we would need to forward the values of the ALU output and memory unit output from the pipeline registers to the ALU and data memory inputs
- Data Hazards Requiring Stalls: 不是所有的data hazards都可通过bypassing来解决。如下的指令序列,the data hazard from using the result of a load instruction cannot be completely eliminated with simple hardware
此时,we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. This pipeline interlock introduces a stall or bubble. The CPI for the stalled instruction increases by the length of the stall (1 clock cycle in this case). 下图上面是之前的流水线,下面是stall后的流水线
- RAW hazards:
- control hazards
- 概念:if a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken, or untaken. If instruction i is a taken branch, then the PC is usually not changed until the end of ID, after the completion of the address calculation and comparison
- 最简单的方法:redo the fetch of the instruction following a branch. The first IF cycle is essentially a stall, because it never performs useful work. But if the branch is untaken, then the repetition of the IF stage is unnecessary because the correct instruction was indeed fetched
- Reducing Pipeline Branch Penalties:4个方法
- 最简单:就是上面的,freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known
- predicted-untaken scheme: treat every branch as not taken, simply allowing the hardware to continue as if the branch were not executed. If the branch is taken, we need to turn the fetched instruction into a no-op and restart the fetch at the target address. 下图shows both situations
- treat every branch as taken
- delayed branch: 通过设置延迟槽(slot),无论分支是否成功,都要执行延迟槽中的指令,即:
(1)分支指令
(2)延迟槽
(3)后续指令
那么延迟槽的内容是什么呢?最好是执行一些无论分支成功失败都要执行的指令,最好与分支并不相干
- Performance of branch schemes:公式
- Reducing the Cost of Branches Through Prediction
- Static Branch Prediction
- Dynamic Branch Prediction and Branch-Prediction Buffers
三、How is Pipelining Implemented?
- A Simple Implementation of RISC V (unpipelined)
- A Basic Pipeline for RISC V
Because every pipe stage is active on every clock cycle, all operations in a pipe stage must complete in 1 clock cycle and any combination of operations must be able to occur at once. Furthermore, pipelining the data path requires that values passed from one pipe stage to the next must be placed in registers.
Any instruction is active in exactly one stage of the pipeline at a time; therefore, any actions taken on behalf of an instruction occur between a pair of pipeline registers. Thus, we can also look at the activities of the pipeline by examining what has to happen on any pipeline stage depending on the instruction type
解释上图,每个阶段干的工作有:- IF:1)从内存中取出当前指令存入IF和ID中间的寄存器IF/ID.IR;2)计算下一个PC值
- ID:1)从寄存器中读取操作数rs1和rs2;2)将NPC和IR传递给下一个阶段;3)对指令中的立即数字段进行符号扩展
- EX:
- ALU:1)把当前指令IR传给下一个阶段;2)进行ALU操作
- Load/Store:1)把当前指令IR传给下一个阶段;2)ALU将地址与立即数相加得到真正地址;3)对于存储指令:将(要写的)数据B传给下一个阶段
- Branch:1)计算分支目标地址;2)判断分支条件是否成立,结果存入EX/MEM.cond寄存器
- MEM:
- ALU:1)继续传递指令;2)传递ALU结果
- Load/Store:1)继续传递指令;2)对于加载指令:从上阶段计算出来的真内存地址那儿进行数据加载,存入MEM/WB.LMD;对于存储指令:将数据B写入该内存地址
- WB:
- ALU:1)将ALU结果写回目标寄存器(需要用到传过来的指令中的rd值)
- Load only:1)将加载的数据(此时位于MEM/WB.LMD中)写回目标寄存器(需要用到传过来的指令中的rd值)
- Implementing the Control for the RISC V Pipeline
下图解释了数据前递(forwarding)机制,主要用于解决数据冒险,允许指令在还未完成WB阶段之前,将其计算结果直接用于后续指令
- Dealing with branches in the pipeline
- 决定分支指令是否taken:在EX阶段,通过比较两个寄存器中的值来判定
- 还需要计算branch target address
- 添加一个adder:在ID阶段计算两种可能的PC值,以达到在EX cycle前结束前就选择好正确的PC
上图shows a pipelined data path assuming the adder in ID and the evaluation of the branch condition in EX. This pipeline will incur a two-cycle penalty on branches
原来的情况:
当前指令位于PC,运行当前指令时,PC进行+4一直传到EX阶段(同时也会传回PC),然后根据是否branch taken,选择传回IF阶段的Mux信号。若branch taken,就会将PC+4的值再与branch确定的值进行相加,然后传给PC,这样就达到下一次从PC那拿到的指令就是branch taken下的指令了。
ID阶段加了adder的情况:
当前指令位于PC,运行当前指令时,PC进行+4传到ID阶段的adder(同时也会传回PC),然后adder再将branch taken情况下的PC偏移值加上去,传回给IF阶段的Mux。后面当前指令运行到EX阶段后会判断出是否branch taken,然后将判断信号传给IF阶段的Mux,此时Mux就会根据是否branch taken来确定要不要修改PC的值。
也就是,加了adder后,ID阶段结束就计算出branch两种结果的下一次PC值;EX阶段结束就能正确更新PC值
The deeper the pipeline, the worse the branch penalty in clock cycles, and the more critical that branches be accurately predicted
四、What Makes Pipelining Hard to Implement?
- Dealing with Exceptions
- Instruction Set Complications
五、Extending the RISC V Integer Pipeline to Handle Multicycle Operations
- Hazards and Forwarding in Longer Latency Pipelines
- Maintaining Precise Exceptions
- Performance of a Simple RISC V FP Pipeline
六、Putting It All Together: The MIPS R4000 Pipeline
七、Cross-Cutting Issues
- compiler/static scheduling: 对于有data dependence时,the compiler can attempt to schedule instructions to avoid the hazard(之前要么是forwarding,要么是stall,no new instructions are fetched or issued until the dependence is cleared)
- Dynamic Scheduled Pipelines (with a scoreboard): hardware rearranges the instruction execution to reduce the stalls
- 记分板算法