PIPELINE
PIPELINE
Q:
For an n-stage pipeline implementation of some computation, the maximum speedup that
can be obtained is upper bounded by:
a. 2n
b. n
c. 2n
d. None of the above
Correct answer is (b).
The maximum speedup that can be obtained in a pipeline is upper bounded by the
number of stages.
Q:Consider the following processors, where the inter-stage pipeline registers are assumed
to be of zero latency, and the stage delays are specified in nanoseconds. Which of the
following pipelines will have the highest clock frequency?
a. 4-stage pipeline with stage delays 1, 2, 2 and 1
b. 4-stage pipeline with stage delays 1, 1.5, 1.5, and 1.5
c. 5-stage pipeline with stage delays 0.5, 1, 1, 0.6 and 1
d. 5-stage pipeline with stage delays 0.5, 0.5, 0.3, 1 and 1.1
Correct answer is (c).
Maximum clock frequency is limited by the slowest pipeline stage. The slowest
pipeline stage is the smallest in option (c), namely, “1”. For (a), it is “2”; for (b), it is
“1.5”; and for (d) it is “1.1”.
Q:
The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first
stage is replaced with a functionally equivalent design involving two stages with
respective delays 600 and 350 picoseconds. The throughput of the pipeline increases by
………… percent.
Correct answer is 33.3%.
Pipeline 1: To process n data, time = 3 + 800n
Throughput = 1/800 (approx.)
Pipeline 2: To process n data, time = 4 + 600n
Throughput = 1/600 (approx..)
% improvement = (1/600 – 1/800) / (1/800) * 100 = 33.3
Q:
. What are the drawbacks for implementing multicycle operations in a single clock cycle by
slowing down the clock?
a. The pipeline control becomes more complex.
b. Causes severe degradation of performance, as all other operations are also slowed
down.
c. Additional types of data hazards can show up.
d. None of the above.
Correct answer is (b).
Simply slowing down the clock will never result in (a) or (c). However, performance
will degrade because all operations depend on the clock.
Q. The following may occur when multicycle operations are allowed in the execution unit
(EX stage):
a. A later instruction may finish earlier.
b. Two or more instructions may try to write into a register simultaneously in WB stage.
c. RAW hazards resulting in several stall cycles can arise.
d. All of the above.
Correct answer is (d).
all of (a), (b) and (c) can happen for multicycle operations in the EX stage.
Q:
The Fetch, Decode, Execute, Memory and Write Back stages of a pipelined processor have
the latencies 200ps, 140ps, 160ps, 190ps and 100ps respectively. Assume that when
pipelining, each pipeline stage costs 10ps extra for the registers between pipeline stages. If
you could split one of the pipeline stages into 2 equal halves, what is the new latency (in ps)
for an instruction? (rounded to one decimal point)
Q:
Suppose that an unpipelined processor has a cycle time of 25ns, and that it's data path is
made up of modules with latencies of 2,3,4,7,3,2 and 4ns(in that order).In pipelining this
processor ,it is not possible to rearrange the order of the modules(for examples, putting the
register read stage before the instruction decide stage) or to divide a module into multiple
pipeline stages(for complexity reasons). Given pipeline latches with 1ns latency .if the
processor is divided into the request number of stages that allow is to achieve the minimum
latency from part 1,what is the latency of the pipeline?
(a). no latency
(b). 35 ns latency
(c). 40 ns latency
(d). 56 ns latency
Solution:
In the question it is “if the processor is divided into the fewest number of stages”
Also, we cannot change the order of the stages, so we can only combine consecutive stages
such that maximum stage latency should be 7ns (because it is already highest and we want
lowest latency possible). One possible combination could be:
2, (3 + 4), 7, 3, (2 + 4) = 2, 7, 7, 3, 6
k = 5 and max(2, 7, 7, 3, 6) = 7
latch latency = 1ns
therefore,
Q:
Q:
Consider a 5--stage pipeline - IF (Instruction Fetch), ID (Instruction Decode and register
read), EX (Execute), MEM (Memory) and WB (Write Back). All register reads take place in
the second phase of a clock cycle and all register writes occur in the first phase. Consider
the execution of the following instruction sequence:
• I : R <- R + R
1 1 2 3
• I : R <- R - R
2 3 1 2
• I : R <- R * R
4 2 3 1
If the number of RAW (Read after write) hazards is denoted by A, WAR (Write after read)
hazards by B, and WAW (Write after write) hazards by C, then A+B+C :
Q:
Consider an instruction pipeline with five stages without any branch prediction: Fetch
Instruction(FI), Decode Instruction(DI), Fetch Operand(FO), Execute instruction(EI) and
Write Operand(WO). The stage delays for FI, DI, FO, EI and WO are 4ns, 5ns, 12 ns, 7 ns
and 6ns respectively. There are intermediate storage buffers after each stage and the delay
of each buffer is 1ns. A program consisting of 12 instructions I , I , I …..I is executed in this
1 2 3 12
pipelined processor. Instruction I is the only branch instruction and its branch target is I . If
4 10
the branch is taken during the execution of this program, the time(in ns) needed to complete
the program is ________
Q:
Register renaming is able overcome which of the data hazards?
A.RAW
B. WAW
C. WAR
D. RAR
Solution: BC
WAW
WAR
Q:
T1 is the time taken by first instruction to complete in a non-pipeline system,and
T2 is the time taken by first instruction to complete in a pipeline system with inter-
stage buffer registers. What is the relationship between T1 and T2?
a. T1<T2
b. T1>=T2
c. T1=T2
d. T1>T2
Solution:a
T1<T2
Q:
An instruction pipeline has a single functional unit to perform
arithmeticoperations. It consists of 4 stages to implement three instructions (ADD,
MUL, SUB). Allstages, except the execution stage, take 1 clock, while the
execution stage for ADD andSUB takes 2 clocks each and for MUL it takes 3
clocks. If all the instructions are executed in the above order, how many clocks
are required to complete these 3 instructions?
a. 7
b. 8
c. 9
d. 10
Solution: d. 10
Q:
Given a non-pipelined architecture running at 1GHz, that takes 5 cycles
tofinish an instruction. You want to make it pipelined with 5 stages. The
increase inhardware forces you to run the machine at 800MHz. The only
stalls are caused bymemory and branch instructions. 25% of the total
instructions are memoryinstructions and a stall of 70 cycles happens in 2%
of the memory instructions.20% of the total instructions are branch
instructions and a stall of 2 cycleshappens in 10% of the branch
instructions. What is the speedup that can beachieved with pipelining as
compared to non-pipelined design?
Answer:
2.87
Twp = 5*(1/ 10 ) 9
Tp = (.25( .98*1 + .02 * 71) + .20( .90*1 + .1*3) + .55*1 ) * (1/ 800*10 ) 6
Then Speed up = 5*(1/ 10 ) / (.25( .98*1 + .02 * 71) + .20( .90*1 + .1*3) +
9
After Solving
speed up = 4 / 1.39
= 2.877697842
= 2.87.
Q. Consider an instruction pipeline with four stages with the stage delays 5 nsec, 6 nsec, 11
nsec, and 8 nsec respectively. The delay of an inter-stage register stage of the pipeline is
1 nsec. What is the approximate speedup of the pipeline in the steady state under ideal
conditions as compared to the corresponding non-pipelined implementation?
a. 4.0
b. 2.5
c. 1.1
d. 3.0
Correct answer is (b).
Time taken to execute N instructions in non-pipelined implementation will be (5 + 6
+ 11 + 8)N = 30N
Clock period for pipelined implementation = max{5,6,11,8} + 1 = 12. Time taken
for the pipelined implementation = (3 + N)12 = 12N (approx.) Speedup = 30N /
12N = 2.5
Q. Consider an instruction pipeline with five stages without any
branch prediction: Instruction Fetch (IF), Instruction Decode (ID), Operand
Fetch (OF), Execute (EX) and Operand Write (OW). The stage delays for IF,
ID, OF, EX and OW are 5 nsec, 7 nsec, 10 nsec, 8 nsec and 6 nsec,
respectively.
There are intermediate storage buffers after each stage and the delay of each
buffer is 1 nsec. A program consisting of 12 instructions I1, I2, …, I12 is
executed in the pipelined processor. Instruction I4 is the only branch instruction
and its branch target is I9. If the branch is taken during the execution of this
program, the time needed to complete the program is:
a. 132 nsec
b. 154nsec
c. 176 nsec
d. 328 nsec
Q. In pipeline, what are the measures that can be taken to reduce the impact of
data hazards?
a. Splitting the memory into separate Instruction and Data memories.
b. Implement data forwarding in the datapath.
c. Allow split register write and read during the two halves of the same clock
cycle.
d. Replicate the register bank.
Correct answers are (b) and (c).
Option (a) reduces the impact of structural hazard. Option (d) will also not help in
mitigating data hazards.
Data forwarding and split register access can reduce the number of stall cycles.
Q. In a pipeline, which of the following scenarios of data dependency will always
result in a pipeline stall due to data hazard without any instruction scheduling?
a. An ADD instruction followed by a SUB instruction.
b. A STORE instruction followed by a LOAD instruction
c. A LOAD instruction followed by an ADD instruction.
d. None of the above.
Correct answer is (c).
Only a LOAD followed by an immediate use will result in a mandatory stall in the
pipeline.
Q. Instruction scheduling can be used to eliminate data and control hazard by:
a. Schedule the execution of the instruction only if there is no hazard.
b. Allowing the compiler the move instructions around to fill the LOAD/BRANCH
delay slot(s) with meaningful instructions.
c. Using a special hardware to check for hazard and issue instructions only when
possible.
d. None of the above.
Correct answer is (b).
Instruction scheduling is a compiler technique where instructions are moved around
keeping dependencies in mind so as to reduce the wasted cycles due to stalls.
Q. Consider a pipeline with ideal CPI of 1. Assume that 30% of all instructions
executed are branch, out of which 80% are taken branches. The pipeline
speedup for predict taken and delayed branch approaches to reduce branch
penalties will be:
a. 4.10 and 4.45
b. 3.25 and 4.35
c. 3.67 and 4.25
d. 3.85 and 4.35
Correct answer is (d).
For predict taken, branch penalty = 1
Speedup = 5 / (1 + 0.30 x 1) = 3.85
For delayed branch, branch penalty = 0.5
Speedup = 5 / (1 + 0.30 x 0.5) = 4.35
Q:
The design team for a simple, single-issue processor is choosing between a pipelined
or non-pipelined implementation. Here are some design parameters for the two possibilities:
(a) For a program with 20% ALU instructions, 10% control instructions and 75%
memory instructions, which design will be faster? Give a quantitative CPI average
for each case.
(b) For a program with 80% ALU instructions, 10% control instructions and 10%
memory instructions, which design will be faster? Give a quantitative CPI average
for each case.
Q:
Match the following:
A. Branch Prediction
B. Instruction Scheduling
C. Delay Slots
D. Increasing functional units
E. Caches
I.Data hazard
II.Structural
III.Control
Solution:
A. III
B. II & III
C.III
D.II
E. I
Structural, data and control hazards typically require a processor pipeline to stall.
It addresses control hazards. It helps to avoid a stall that would result due branch
target identification during the decode stage by scheduling the execution of some
other instruction which anyway has to execute irrespective of the branch condition.
(e) caches
It addresses data hazards. In particular, caches help to reduce memory latency and
hence reduce the load-use latency which in turn reduce the stall duration and
improves execution time (by maintaining pipeline steady state).
Q:
Solution: CD
Q:
Using the code below, count the number of all of the dependence types (RAW, WAR,
WAW).
I0: A = B + C;
I1: C = A - B;
I2: D = A + C;
I3: A = B * C * D;
I4: C = F / D;
I5: F = A ˆ G;
I6: G = F + D;
I0 I1 I0 I1 I0 I3
I0 I2 I1 I3 I1 I4
I1 I2 I2 I4
I3 I5 I3 I4
I2 I3 I4 I5
I1 I3 I5 I6
I2 I4
I3 I5
I5 I6
Q:
Given four instructions, how many unique comparisons (between register sources
and destinations) are necessary to find all of the RAW, WAR, and WAW
dependences. Answer for the case of four instructions, and then derive a general
equation for N instructions. Assume that all instructions have one register destination
and two register sources.
The first summand is for RAW comparisons, the second summand is for WAR
comparisons and the last summand is for WAW comparisons.
Q:
Which of the following are the reasons that in pipelining throughput will not improve as
pipelining is increased indefinitely
A. Pipelining has a fixed (or relatively fixed) absolute overhead per stage which
results from latch overhead and clock/data skew.
B. increasing the pipeline depth lengthens hazard penalties, increasing the CPI.
C. the latency of a pipeline stage can be driven to zero
D. increasing the depth of the pipeline between the fetch and execute stage
decreases the branch miss prediction penalty.
Solution:
Pipelining has a fixed (or relatively fixed) absolute overhead per stage which results
from latch overhead and clock/data skew. This means that the latency of a pipeline
stage cannot be driven to zero. Second, increasing the pipeline depth lengthens
hazard penalties, increasing the CPI. For instance, increasing the depth of the pipeline
between the fetch and execute stage increases the branch miss prediction penalty.
Q:
Consider a machine with a 5-stage pipeline with a cycle time of 10ns. Assume that you
are executing a program where a fraction, f, of all instructions immediately follow a load
upon which they are dependent.
(a) With forwarding enabled what is the total execution time for N instructions, in terms of f ?
Q:
solution:
For a non-pipelined system:
(Tnp)=130ns
For a pipelined system:
• Total number of stages
• (k)=5
Total number of instruction/task
(n)=1000
Total time required to perform a single task in pipelined processor
(Tp)=30ns
Speedup = (n*T )/(k+(n-1)T )
np p