Computer Organization Chapter #4
Part2
Ahmed Hashim
96610606
Computer Organization
Chapter 4 - Part2
Parallel Processing
In this part we are trying to implement RISC-V in pipeline hardware.
There are 3 types of execute any instruction:
① Single Cycle
also called sequential
② Multi Cycle
③ Pipeline
also called parallel
Contact me: 1
Eng. Ahmed Hashim
| 96610606
1) Single Cycle
Assume we have 3 instructions
lw $s0, 12($s1)
add $s0, $s1, $s2
beq $s2, $t1, Done
Lw
add
bne
Notes
1 In single cycle CPI always I
2 Cycle time Total If ID EX Mem WB
for any instruction
3 Total time for given Code IC 3
Time IC CPI I
3 I Cycle time
Contact me: 2
Eng. Ahmed Hashim
| 96610606
2) Multi Cycle
Assume we have 3 instructions
lw $s0, 12($s1)
add $s0, $s1, $s2
beq $s2, $t1, Done
Lw
add
bne
Notes
1 In multi cycle CPI differ between types
Instruction IF ID EX MEM WB
R-type ✓ ✓ ✓ ✓
lw ✓ ✓ ✓ ✓ ✓
sw ✓ ✓ ✓ ✓
beq ✓ ✓ ✓
j ✓
so have to calculate average CPI
we
2 Cycle time max CIF ID EX Mem WB
3 Total time for given Code IC 3
Time IC CPI I
3 average CPI Cycle time
Contact me: 3
Eng. Ahmed Hashim
| 96610606
3) Pipeline
Assume we have 3 instructions
lw $s0, 12($s1)
add $s0, $s1, $s2
beq $s2, $t1, Done
Lw
bne
Notes
1 In Pipeline CPI always num of stages
2 Cycle time max If ID EX Mem WB
3 Total time for given Code IC 3
Time I CPI I IC 1 I
I 5 Cycle time 2 time
7 Cycle time
Contact me: 4
Eng. Ahmed Hashim
| 96610606
Problem #1 (4.16 in textbook)
In this exercise, we examine how pipelining affects the clock cycle time of the processor.
Problems in this exercise assume that individual stages of the data-path have the
following latencies:
Also, assume that instructions executed by the processor are broken down as follows:
1) What is the clock cycle time in a pipelined and non-pipelined processor?
2-a) What is the total latency of an LW instruction in a pipelined and non-pipelined
processor?
Contact me: 8
Eng. Ahmed Hashim
| 96610606
2-b) What is the total latency if we have code of 100 instructions in a pipelined and
non-pipelined processor?
3) If we can split one stage of the pipelined datapath into two new stages, each with
half the latency of the original stage, which stage would you split and what is the
new clock cycle time of the processor
Contact me: 8
Eng. Ahmed Hashim
| 96610606
4,5) What is the utilization of the data memory, read-register port, write-register port
of the “Registers” unit?
6) Assuming using a multi-cycle organization, find clock cycle times and execution time
if we have 100 instructions?
Contact me: 8
Eng. Ahmed Hashim
| 96610606
Problem #2
Assume new processor composed of only 4 stages, s1=100ps, s2=200ps, s3=300ps,
and s4=400ps.
When using pipeline we will need intermediate registers that takes 20ps.
a) Calculate CPI, Cycle time, and Total latency if we have 100 instructions for non-
pipelining?
b) Repeat a) using pipeline?
c) Find throughout and speed up?
Contact me: 8
Eng. Ahmed Hashim
| 96610606
Problem #3
The instruction set in a certain machine are divided into 4 instructions (A, B, C, and
D). While the execution of the instructions can be divided into 4 stages (S1, S2, S3,
and S4). Given table:
S1 (50 ps) S2 (70 ps) S3 (50 ps) S4 (80 ps)
A(30%) √ √
B(20%) √ √ √
C(20%) √ √ √ √
D(30%) √ √
1. If single cycle organization is used. What will the clock cycle be?
2. Assuming a multi cycle organization is used. What is the average CPI?
3. Complete the following
Multi-Cycle Pipelined
Cycle time (ps) 80 ps 80 ps
Number of Cycles (A) 2 4
Number of Cycles (B) 3 4
Number of Cycles (C) 4 4
4. If stage S4 is divided into 2 stages each requires 40 ps. Complete the table
Multi-Cycle Pipelined
Cycle time (ps) 70 ps 70 ps
Number of Cycles (A) 2 5
Number of Cycles (B) 3 5
Number of Cycles (C) 5 5
Contact me: 8
Eng. Ahmed Hashim
| 96610606
There are problems happened because of Pipelining
also called Hazards
1) Structural Hazards
using same component multiple times at same time
2) Data Hazards
read data before write new value
3) Control Hazards
only in beq
① Structural hazards
I 2 3 4 5 6 7 8
Access memory Access Register file
Contact me: 8
Eng. Ahmed Hashim
| 96610606
Problem I
in Clock cycle 4 1st 4th instructions
need to access memory
BUT 1st IR access DMem Clu Sw
4th IR Imem
RISC V uses two separate memories
Problem 2
in Clock cycle 5 1st 4th instructions
need to access Rfile read 8 write
d
During If During WB
To solve this
a
write in 1st half
u read in 2nd half
clock Cycle
Contact me: 88
Eng. Ahmed Hashim
| 96610606
② Data hazards
read data before being updated
Assume given code
add $s0, $s1, $s2
sub $s2, $s1, $s0
save result
in 50
read
Idata from 50
read old value of 50 in CC 3
before writing it in e5
Contact me: 88
Eng. Ahmed Hashim
| 96610606
There are 3 different solutions for Data hazards
1) Rewrite Code
try to make distance between lines that have problem
2) Stall/Freeze or add NOP
perform nothing until data saved then read it
3) Forwarding
take output once it became available
a) Rewrite Code
Assume given code
add $s0, $s1, $s2 write so
sub $s2, $s1, $s0 read so
lw $t2, 0($t1)
sw $t1, 0($t3)
Can rewrite code solve data hazards?
2 instructions p w j Is
did a b d f p Wl Il j I j t I
add
Lw
sw
sub
Contact me: 88
Eng. Ahmed Hashim
| 96610606
b) Stall/Freeze/Bubbles or NOP
Assume given code
add $s0, $s1, $s2 write so
sub $s2, $s1, $s0 read 50
How to solve problem using stalls?
add
sub
another way
add
NOP
NOP
sub
Contact me: 88
Eng. Ahmed Hashim
| 96610606
c) Forwarding
alternative of waiting
There are 4 types of forwarding
1) R-type then R-type EX to EX
2) R-type then sw EX to MEM
3) lw then R-type MEM to EX
4) lw then sw MEM to MEM
1) R-type then R-type
Assume given code
add $s0, $s1, $s2 write 50
sub $s2, $s1, $s0 read 50
How to solve problem using forwarding?
I 2 3 4 5 6
add
sub f
result of 50 calculated by EX in CC 3
we need it in Ex in at
Contact me: 88
Eng. Ahmed Hashim
| 96610606
2) R-type then sw
Assume given code
add $s0, $s1, $s2 write 50
sw $s0, 12($s1) read 50
How to solve problem using forwarding?
I 2 3 4 5 6
add
SW ly
can't forward from CC3 to 5
I 2 3 4 5 6
add
sw
Contact me: 88
Eng. Ahmed Hashim
| 96610606
3) lw then R-type
Assume given code
lw $s0, $s1, $s2 lw $s0, $s1, $s2
lw $s1, 12($s0) addi $s1, $s0, 10
How to solve problem using forwarding?
I 2 3 4 5 6
Lw
addi
Can't forward from Ccg to CCy
Need 1 Stall 0,41461
I 2 3 4 5 6 7
Nop
addi
J
Contact me: 88
Eng. Ahmed Hashim
| 96610606
4) lw then sw
Assume given code
lw $s0, 12($s1)
sw $s0, 16($s2)
How to solve problem using forwarding?
I 2 3 4 5 6
Lw
SW
Nole
Teases 1,2 4 No stall needed
BUT only case 3 Need 1 Stall
When using forwarding
Contact me: 88
Eng. Ahmed Hashim
| 96610606
③ Control hazards
only when using beq
Assume given code
beq $s0, $s1, Exit
add $s2, $s1, $s0
sub $s4, $t2, $t1
mul $t2,$s1, $s7
...
...
Exit: or $s1, $s2, $s3
beg
beg compare if rs rt o us in EX stage
then we will know which next
Contact me: 88
Eng. Ahmed Hashim
| 96610606
To solve control Hazard without waiting
Predict EJ
EF GEM
MT Flush Is
2341 IN Ig
same previous example
predict NOT Taken
beg
60611
flush
0007
add
sub
0000
mul
as
correct
or
incorrect GII t.tw
flush
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem #1
Given MIPS code
add $t3, $t1, $t1
lw $t2, 60($t1)
lw $t1, 40($t2)
slt $t1, $t1, $t2
sw $t1, 20($t2)
a) Determine Dependancies?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
b) If we only use Stalls, draw and find number of cycles?
im
in
im
in
im
in
Contact me: 88
Eng. Ahmed Hashim
| 96610606
c) If we use Full Forwarding, draw and find number of cycles and CPI?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem #2
Given MIPS code
or r1,r2,r3
or r2,r1,r4
or r1,r1,r2
Also, assume the following cycle times for each of the options related to forwarding:
No Forwarding Full Forwarding ALU-ALU Forwarding Only
250ps 300ps 290ps
1) Indicate dependences and their type.
Contact me: 88
Eng. Ahmed Hashim
| 96610606
2) Assume there is no forwarding in this pipelined processor. Indicate hazards and
add nop instructions to eliminate them.
roti
roti
roti
roti
3) Assume there is full forwarding. Indicate hazards and add NOP instructions to
eliminate them.
4) What is the speedup achieved by adding full forwarding to a pipeline that had no
forwarding?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem #3 (4.10 in textbook)
Given MIPS code
sw r16,12(r6)
lw r16,8(r6)
beq r5,r4,Label // Assume r5 = r4
add r5,r1,r4
slt r5,r15,r4
Label:
1) For this problem, assume that all branches are perfectly predicted (this eliminates all
control hazards). If we only have one memory (for both instructions and data).
What is the total clock cycles. Can you do the same with this structural hazard?
Why?
we have only 1 memory
Sw
Lw
J busy in Lw Sw
Can't perform If
beg
add gift
sit
Total 11 cycles
we can't use NOP as it also
considered an operation
we only can stall 1 freeze
Contact me: 88
Eng. Ahmed Hashim
| 96610606
2) Assuming stall-on-branch and no delay slots, what speedup is achieved on this code
if branch outcomes are determined in the ID stage, relative to the execution
if branch outcomes are determined in the EX stage?
Branch at ID
Sw
Lw
10 Cycles
beg
add
s.lt
Branch at EX
su
Lw 11 Cycles
beg
add
I
sit
Contact me: 88
Eng. Ahmed Hashim
| 96610606
3) Assuming branch prediction used, what is the number of clock cycle if branch
predict not to branch?
sw
LW
beg
add
Flush
sit 6888005
Branch must taken
as v4 r5
so add sit must be stopped
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem #4
Given RISC-V code
add $t1, $t1, $t2
sub $t3, $t1, $t2
and $t4, $t4, $t1
lw $t5, 80($t2)
addi $t5, $t5, 10
1) Assume no forwarding, insert NOP to resolve data hazards, find clock cycles?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
2) Can reorder code solve problem or minimize number of stalls?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem #5
For the problems in this exercise, assume that there are no pipeline stalls and that the
breakdown of executed instructions is as follows:
add addi not beq lw sw
20% 20% 0% 25% 25% 10%
1) In what fraction of all cycles is the data memory used?
2) In what fraction of all cycles is the input of the sign-extend circuit needed? What is
this circuit doing in cycles in which its input is not needed?
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Branch Prediction
1-bit 2-bits
1 11 10
O 00 01
Find prediction accuracy if
a) 1-bit initially at Taken (1).
b) 2- bits initially at Not Taken (01).
Given actual prediction outcome:
NT, T, NT, NT, T, T, T, NT
Solution
1-bit Accuracy = /8
Location T(1)
Actual NT T NT NT T T T NT
Prediction
2-bits Accuracy = /8
Location 01
Actual NT T NT NT T T T NT
Prediction
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Branch Prediction
① Static Prediction
- Always predict that branch will not taken and wait for result
- Always predict that branch will be taken and wait for result
There are two cases:-
‣ Prediction is correct, continue executing in expected order
‣ Prediction is incorrect, flush the fetched instructions, and go to correct
instruction
② Dynamic Predict Prediction
1-bit 2-bits
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Problem
In this exercise examines the accuracy of various branch predictors for the following
repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT
1) What is the accuracy of always-taken and always-not-taken predictors for this
sequence of branch outcomes?
2) What is the accuracy of the two-bit predictor for the first 4 branches in this pattern,
assuming that the predictor starts of in the bottom left state 2-bit predictor?
3) What is the accuracy of the two-bit predictor if this pattern is repeated forever?
Contact me: 8
Eng. Ahmed Hashim
| 96610606
4) Design a predictor that would achieve a perfect accuracy if this pattern is repeated
forever.You predictor should be a sequential circuit with one output that provides a
prediction (1 for taken, 0 for not taken) and no inputs other than the clock and the
control signal that indicates that the instruction is a conditional branch.
5) What is the accuracy of your predictor from 4) if it is given a repeating pattern that is
the exact opposite of this one?
6) Repeat 4), but now your predictor should be able to eventually (after warm-up
period during which it can make wrong predictions) start perfectly predicting both
this pattern and its opposite.Your predictor should have an input that tells it what
the real outcome was. Hint: this input lets your predictor determine which of the
two repeating patterns it is given.
Contact me: 8
Eng. Ahmed Hashim
| 96610606
Problem (Nested Loops)
Consider the following code fragment: (R0 stores #0)
DADDI R3, R0, #5
DADDI R1, R0, #0
L2: DADDI R1, R1, #1
DADDI R2, R1, #0
L1: BSUBI R2, R2, #1
BGTZ R2, L1 (Branch 1)
DSUB R4, R3, R1
BGTZ R4, L2 (Branch 2)
”BGTZ Rx, L” branches to L if and only if the value in Rx is greater than zero. Branch 1 is
executed 15 times. Branch 2 is executed 5 times. Assume that 1-bit branch predictors
are used. When the above code starts to execute, both predictors contain the value N
(not taken). Use the following tables to record the prediction and action of each branch.
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Intermediate Registers
1) IF/ID
PC (32 bits)
Needed in EX step to calculate new PC in Branch
IR (32 bits)
Needed in ID step to be decoded in opcode, rd, rs1, rs2, imm, address
2) ID/EX
PC (32 bits)
D[rs1] (64 bits)
Needed in EX step to calculate ALUResult
D[rs2] (64 bits)
Needed in EX step to calculate ALUResult
Needed in MEM step to save dat in Memory
imm64 (64 bits)
Needed in EX step to calculate ALUResult, and Branch address
func (4 bits)
Needed in EX step to determine operation in ALU (R type)
rd (5 bits)
Needed in WB step to determine write register
Contact me: 88
Eng. Ahmed Hashim
| 96610606
3) EX/MEM
PC+imm*2 (32 bits)
Needed in MEM step to save new PC
ALUResult (64 bits)
Needed in MEM step to identify Memory address
Needed in WB step to be written in write data in RFile
Z flag (1 bits)
Needed in MEM step to select next PC
D[rs2] (64 bits)
rd (5 bits)
4) MEM/WB
Read Data (64 bits)
Needed in WB step to be written in RFile
ALUResult (65 bits)
Needed in WB step to be written in RFile
rd (5 bits)
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Forwarding Unit
00
10 Eximim
01 Mem WB
00 852
10 EX Mem
01 mem WB
1) EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))
ForwardB = 10
add É 51 52 F D E M w
sub 53 50,852 F D E M W
F
Contact me: 88
Eng. Ahmed Hashim
| 96610606
2) MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs1))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs1))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))
and (MEM/WB.RegisterRd = ID/EX.RegisterRs2))
ForwardB = 01
add 50 51 52 F D E M W
fwfsub 50 50 F D E
ME
M W
sit 53 10 53
Dfw f m w
Contact me: 88
Eng. Ahmed Hashim
| 96610606
Quiz CH#4.2
1. Pipelining provides faster execution time for an instruction
a. True
b. False
2. The pipeline system performance is dependent only on the number of stages in the system, i.e. the
more pipeline stages the better is performance
a. True
b. False
3. Pipelining provides instruction level parallelism
a. True
b. False
4. Given a 50 instruction program, and if each instruction takes 1.5 ns to execute. The same program is
applied to a 4 stage pipeline system, where the slowest stage delay is 375 ps. Then the pipeline is
faster than the single cycle system by
a. 3.77
b. 3.55
c. 3.95
d. None of the above
5. A 3 stage pipeline system is faster than a 5 stage system for the same program
a. True
b. False
6. Given 2 machines, M1 and M2, where M1 has 5 stages and M2 has 3 stages, with the following timings:
M1 takes 2 ns per stage, and M2 takes 3 ns per stage. If a 100 instructions program is fed through
both systems then
a. M2 is faster than M1 by (5x2/3x3) ns
b. M1 is faster than M2 by (5x2/3x3) ns
c. M2 is faster than M1 by 98 ns
d. M1 is faster than M2 by 98 ns
7. The pipeline system is usually governed by the stage with the slowest delay
a. True
b. False
8. ‘Structural hazard’ is already solved in RISC-V
a. True b. False
9. Pipelining is an implementation technique in which multiple instructions are overlapped in execution.
a. True b. False
10. In pipelined implementations, the latency (time for each instruction) decreases.
a. True b. False
Contact me: 1
Eng. Ahmed Hashim
| 96610606
11. A `load use data hazard’ is a form of hazard which requires two stall cycles even if forwarding is
implemented.
a. True b. False
12. `Branch taken’ is a branch where the branch condition is satisfied and the Program Counter (PC)
becomes the branch target.
a. True b. False
13. ‘Forwarding’ is synonym for ‘ByPassing’
a. True b. False
14. ‘Branch taken’ is a branch where the branch condition is satisfied and the PC becomes the target
a. True b. False
15. The ‘branch hazard’ is the opposite of the ‘control hazard’
a. True b. False
16. For the following sequence
lw $2, 100($1)
add $1, $2, $3
We have
a. RAW and WAR hazards
b. RAW hazards
c. WAR hazards
d. No hazards
17. Some cases of hazards occur because the data needed not yet available in the system
a. True b. False
18. In order to stall the pipeline
a. Make the content of all the control signals in EX/MEM = 0
b. Make the content of all the control signals in ID/EX = 0
c. Make the ALU input = 0
d. None of the mentioned
19. In multi pipeline diagram we can view the state of all the stages in the pipeline at once
a. True b. False
20. Hazards can occur in all units of RISC-V computer
a. True b. False
21. For a 6 stages pipeline system, and a program with 30 instructions that has 10% load, and if each load
takes 1 stall then the total number of Clock cycles needed to execute the program is
a. 33
b. 38
c. 62
d. None if the above
Contact me: 2
Eng. Ahmed Hashim
| 96610606
22. Without any forwarding and prediction, the following segment of a program will need:
lw r1, 100(r2)
add r3, r1, r2
May assume read and write in RFile can occur in the same cycle
a. 3 stalls
b. 2 stalls
c. 1 stall
d. None of them
23. The following code will create a critical hazard
sw $3, 200($4)
lw $3, 100($2)
a. True b. False
24. The following code will create a hazard that cannot be resolved without any penalty
lw $3, 200($4)
sw $3, 100($2)
a. True b. False
25. The following code will create a hazard that cannot be resolved without any penalty
lw $3, 200($4)
add $4, $2, $3
a. True b. False
26. A technique of resolving data hazards without cost
a. Forwarding
b. Inserting NOP
c. Rearranging code
d. All of the above
27. The latency of an instruction when executed in a pipelined processor compared to single cycle
processor
a. Maybe less in the pipelined processor
b. Will never change
c. Maybe more in the pipelined processor
d. Depends on the number of stages in the pipeline
28. How many clock needed to execute a RISC-V program with 4 instructions on 5 stages pipeline,
assuming that there are no hazards
a. 5
b. 8
c. 9
d. 20
Contact me: 3
Eng. Ahmed Hashim
| 96610606
29. The following program executed on the 5 stage RISC-V pipeline CPU
I1: OR $t3, $t4. $t2
I2: AND $t1, $t2. $t2
I3: SW $t5, 33($t1)
I4: SW $t3, 44($t1)
If the cycle time without forwarding is 200 ps, while with forwarding is 250 ps. How much speed-up we
gain when using Forwarding:
a. 0.75 b. 1
c. 1.25 d. None of the above
30. Prediction is useful in pipelining to resolve
a. Data Hazards
b. Control Hazards
c. Structural Hazards
d. None of the above
31. For the sequence:
lw $3, 200($4)
add $4, $2, $3
How many NOP instructions should be inserted to solve data hazard, if using full forwarding
a. 1
b. 2
c. 3
d. None of the above
32. Consider the following RISC-V code
LOOP: slt $t2, $0, $t1
beq $t2, $0, Done
addi $t1, $t1, -1
j LOOP
Done
If $t1 initialized with 9. What is the percentage of branch prediciton if not-taken policy is used?
a. 10% b. 90%
c. 100% d. None of them
33. A pipeline hazard that cannot be resolved without introducing extra hardware to the processor
a. Data b. Structural
c. Control d. All of them
34. Branch history table Is used by
a. Static branch predictor
b. Taken branch predictor
c. Dynamic branch predictor
d. Not-Taken branch predictor
Contact me: 4
Eng. Ahmed Hashim
| 96610606