0% found this document useful (0 votes)
14 views13 pages

PIPELINE

i want it now

Uploaded by

vidhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

PIPELINE

i want it now

Uploaded by

vidhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

PIPELINE

Q:

For an n-stage pipeline implementation of some computation, the maximum speedup that
can be obtained is upper bounded by:
a. 2n
b. n
c. 2n
d. None of the above
Correct answer is (b).
The maximum speedup that can be obtained in a pipeline is upper bounded by the
number of stages.

Q:Consider the following processors, where the inter-stage pipeline registers are assumed
to be of zero latency, and the stage delays are specified in nanoseconds. Which of the
following pipelines will have the highest clock frequency?
a. 4-stage pipeline with stage delays 1, 2, 2 and 1
b. 4-stage pipeline with stage delays 1, 1.5, 1.5, and 1.5
c. 5-stage pipeline with stage delays 0.5, 1, 1, 0.6 and 1
d. 5-stage pipeline with stage delays 0.5, 0.5, 0.3, 1 and 1.1
Correct answer is (c).
Maximum clock frequency is limited by the slowest pipeline stage. The slowest
pipeline stage is the smallest in option (c), namely, “1”. For (a), it is “2”; for (b), it is
“1.5”; and for (d) it is “1.1”.

Q:
The stage delays in a 4-stage pipeline are 800, 500, 400 and 300 picoseconds. The first
stage is replaced with a functionally equivalent design involving two stages with
respective delays 600 and 350 picoseconds. The throughput of the pipeline increases by
………… percent.
Correct answer is 33.3%.
Pipeline 1: To process n data, time = 3 + 800n
Throughput = 1/800 (approx.)
Pipeline 2: To process n data, time = 4 + 600n
Throughput = 1/600 (approx..)
% improvement = (1/600 – 1/800) / (1/800) * 100 = 33.3

Q:
. What are the drawbacks for implementing multicycle operations in a single clock cycle by
slowing down the clock?
a. The pipeline control becomes more complex.
b. Causes severe degradation of performance, as all other operations are also slowed
down.
c. Additional types of data hazards can show up.
d. None of the above.
Correct answer is (b).
Simply slowing down the clock will never result in (a) or (c). However, performance
will degrade because all operations depend on the clock.
Q. The following may occur when multicycle operations are allowed in the execution unit
(EX stage):
a. A later instruction may finish earlier.
b. Two or more instructions may try to write into a register simultaneously in WB stage.
c. RAW hazards resulting in several stall cycles can arise.
d. All of the above.
Correct answer is (d).
all of (a), (b) and (c) can happen for multicycle operations in the EX stage.

Q:
The Fetch, Decode, Execute, Memory and Write Back stages of a pipelined processor have
the latencies 200ps, 140ps, 160ps, 190ps and 100ps respectively. Assume that when
pipelining, each pipeline stage costs 10ps extra for the registers between pipeline stages. If
you could split one of the pipeline stages into 2 equal halves, what is the new latency (in ps)
for an instruction? (rounded to one decimal point)

Q:
Suppose that an unpipelined processor has a cycle time of 25ns, and that it's data path is
made up of modules with latencies of 2,3,4,7,3,2 and 4ns(in that order).In pipelining this
processor ,it is not possible to rearrange the order of the modules(for examples, putting the
register read stage before the instruction decide stage) or to divide a module into multiple
pipeline stages(for complexity reasons). Given pipeline latches with 1ns latency .if the
processor is divided into the request number of stages that allow is to achieve the minimum
latency from part 1,what is the latency of the pipeline?
(a). no latency
(b). 35 ns latency
(c). 40 ns latency
(d). 56 ns latency

Solution:

In the question it is “if the processor is divided into the fewest number of stages”

Also, we cannot change the order of the stages, so we can only combine consecutive stages
such that maximum stage latency should be 7ns (because it is already highest and we want
lowest latency possible). One possible combination could be:

2, (3 + 4), 7, 3, (2 + 4) = 2, 7, 7, 3, 6

k = 5 and max(2, 7, 7, 3, 6) = 7
latch latency = 1ns

therefore,

Latency of the pipeline would be 5*(7+1) = 40ns

Q:

A non-pipeline processor X has a clock frequency of 2.5GHz and an average CPI of


3. Processor Y an improved version of X, is designed with 5 stage linear instruction
pipeline. However, due to latch delay and clock skew the clock rate of Y is only 2 GHz. If a
program consists of one million instructions are executed on both the processors the
speedup of processor Y as compared to X is :

Q:
Consider a 5--stage pipeline - IF (Instruction Fetch), ID (Instruction Decode and register
read), EX (Execute), MEM (Memory) and WB (Write Back). All register reads take place in
the second phase of a clock cycle and all register writes occur in the first phase. Consider
the execution of the following instruction sequence:

• I : R <- R + R
1 1 2 3

• I : R <- R - R
2 3 1 2

• I : M[R +1000] <- R


3 1 1

• I : R <- R * R
4 2 3 1

If the number of RAW (Read after write) hazards is denoted by A, WAR (Write after read)
hazards by B, and WAW (Write after write) hazards by C, then A+B+C :



Q:
Consider an instruction pipeline with five stages without any branch prediction: Fetch
Instruction(FI), Decode Instruction(DI), Fetch Operand(FO), Execute instruction(EI) and
Write Operand(WO). The stage delays for FI, DI, FO, EI and WO are 4ns, 5ns, 12 ns, 7 ns
and 6ns respectively. There are intermediate storage buffers after each stage and the delay
of each buffer is 1ns. A program consisting of 12 instructions I , I , I …..I is executed in this
1 2 3 12

pipelined processor. Instruction I is the only branch instruction and its branch target is I . If
4 10

the branch is taken during the execution of this program, the time(in ns) needed to complete
the program is ________

Q:
Register renaming is able overcome which of the data hazards?
A.RAW
B. WAW
C. WAR
D. RAR

Solution: BC
WAW
WAR

Q:
T1 is the time taken by first instruction to complete in a non-pipeline system,and
T2 is the time taken by first instruction to complete in a pipeline system with inter-
stage buffer registers. What is the relationship between T1 and T2?
a. T1<T2
b. T1>=T2
c. T1=T2
d. T1>T2

Solution:a
T1<T2

Q:
An instruction pipeline has a single functional unit to perform
arithmeticoperations. It consists of 4 stages to implement three instructions (ADD,
MUL, SUB). Allstages, except the execution stage, take 1 clock, while the
execution stage for ADD andSUB takes 2 clocks each and for MUL it takes 3
clocks. If all the instructions are executed in the above order, how many clocks
are required to complete these 3 instructions?
a. 7
b. 8
c. 9
d. 10

Solution: d. 10

Q:
Given a non-pipelined architecture running at 1GHz, that takes 5 cycles
tofinish an instruction. You want to make it pipelined with 5 stages. The
increase inhardware forces you to run the machine at 800MHz. The only
stalls are caused bymemory and branch instructions. 25% of the total
instructions are memoryinstructions and a stall of 70 cycles happens in 2%
of the memory instructions.20% of the total instructions are branch
instructions and a stall of 2 cycleshappens in 10% of the branch
instructions. What is the speedup that can beachieved with pipelining as
compared to non-pipelined design?

Answer:
2.87

speed up = Twp(without pipeline) / Tp(with pipeline)

Twp = 5*(1/ 10 ) 9

Tp = (.25( .98*1 + .02 * 71) + .20( .90*1 + .1*3) + .55*1 ) * (1/ 800*10 ) 6

Then Speed up = 5*(1/ 10 ) / (.25( .98*1 + .02 * 71) + .20( .90*1 + .1*3) +
9

.55*1 ) * (1/ 800*10 ) 6

After Solving

speed up = 4 / 1.39

= 2.877697842

= 2.87.

Q. Consider an instruction pipeline with four stages with the stage delays 5 nsec, 6 nsec, 11
nsec, and 8 nsec respectively. The delay of an inter-stage register stage of the pipeline is
1 nsec. What is the approximate speedup of the pipeline in the steady state under ideal
conditions as compared to the corresponding non-pipelined implementation?
a. 4.0
b. 2.5
c. 1.1
d. 3.0
Correct answer is (b).
Time taken to execute N instructions in non-pipelined implementation will be (5 + 6
+ 11 + 8)N = 30N
Clock period for pipelined implementation = max{5,6,11,8} + 1 = 12. Time taken
for the pipelined implementation = (3 + N)12 = 12N (approx.) Speedup = 30N /
12N = 2.5
Q. Consider an instruction pipeline with five stages without any
branch prediction: Instruction Fetch (IF), Instruction Decode (ID), Operand
Fetch (OF), Execute (EX) and Operand Write (OW). The stage delays for IF,
ID, OF, EX and OW are 5 nsec, 7 nsec, 10 nsec, 8 nsec and 6 nsec,
respectively.
There are intermediate storage buffers after each stage and the delay of each
buffer is 1 nsec. A program consisting of 12 instructions I1, I2, …, I12 is
executed in the pipelined processor. Instruction I4 is the only branch instruction
and its branch target is I9. If the branch is taken during the execution of this
program, the time needed to complete the program is:
a. 132 nsec
b. 154nsec
c. 176 nsec
d. 328 nsec

Correct answer is (b).


Minimum clock period = max{5,7,10,8,6} + 1 = 11
I1: IF ID EX ME WB
I2: IF ID EX ME WB
I3: IF ID EX ME WB
I4: IF ID EX ME WB
I5: . . . . .
I6: . . . . .
I7: . . . . .
I8: . . . . .
I9: IF ID EX ME WB
I10: IF ID EX ME WB
I11: IF ID EX ME WB
I12: IF ID EX ME WB

Total 14 clock cycles are needed, i.e. 14 x 11 = 154 nsec.


Q. Consider a RISC machine where each instruction is 4 bytes long. Conditional
and unconditional branch instructions use PC-relative addressing mode with
Offset specified in bytes to the target location of the branch instruction. Also, the
Offset is always with respect to the address of the next instruction in the program
sequence. Consider the following instruction sequence:
Instruction i: ADD R2,R3,R4
Instruction i+1: SUB R5,R6,R7
Instruction i+2: SEQ R1,R9,R10
Instruction i+3: BEQZ R1,Offset
If the target of the branch instruction is i, the decimal value of Offset will
be …………………
Correct answer is -16.
Assume that instruction “i” starts from memory address X.
Address of instruction i+1 = X + 4
Address of instruction i+2 = X + 8
Address of instruction i+3 = X + 12
Address of instruction i+4 = X + 16
So, Offset = X – (X + 16) = -16
Q. A 5-stage pipelined processor has the stages: Instruction Fetch
(IF), Instruction Decode (ID), Operand Fetch (OF), Execute (EX) and
Write Operand (WO). The IF, ID, OF, and WO stages take 1 clock cycle each
for any instruction. The EX stage takes 1 clock cycle for ADD and
SUB instructions, 3 clock cycles for MUL instruction, and 6 clock cycles for
DIV instruction. Operand forwarding is used in the pipeline (for
data dependency, OF stage of the dependent instruction can be executed only
after the previous instruction completes EX). What is the number of clock cycles
needed to execute the following sequence of instructions? MUL R2,R10,R1
DIV R5,R3,R4
ADD R2,R5,R2
SUB R5,R2,R6
a. 13
b. 17
c. 15
d. 19
Correct answer is (c).
MUL R2,R10,R1: IF ID OF EX EXEX WO
DIV R5,R3,R4: IF ID OF EX EXEXEXEXEX WO
ADD R2,R5,R2: IF ID - - - - - - OF EX WO SUB R5,R2,R6: IF - - - - -
- ID - OF EX WO Number of clock cycles = 15.

Q. In pipeline, what are the measures that can be taken to reduce the impact of
data hazards?
a. Splitting the memory into separate Instruction and Data memories.
b. Implement data forwarding in the datapath.
c. Allow split register write and read during the two halves of the same clock
cycle.
d. Replicate the register bank.
Correct answers are (b) and (c).
Option (a) reduces the impact of structural hazard. Option (d) will also not help in
mitigating data hazards.
Data forwarding and split register access can reduce the number of stall cycles.
Q. In a pipeline, which of the following scenarios of data dependency will always
result in a pipeline stall due to data hazard without any instruction scheduling?
a. An ADD instruction followed by a SUB instruction.
b. A STORE instruction followed by a LOAD instruction
c. A LOAD instruction followed by an ADD instruction.
d. None of the above.
Correct answer is (c).
Only a LOAD followed by an immediate use will result in a mandatory stall in the
pipeline.
Q. Instruction scheduling can be used to eliminate data and control hazard by:
a. Schedule the execution of the instruction only if there is no hazard.
b. Allowing the compiler the move instructions around to fill the LOAD/BRANCH
delay slot(s) with meaningful instructions.
c. Using a special hardware to check for hazard and issue instructions only when
possible.
d. None of the above.
Correct answer is (b).
Instruction scheduling is a compiler technique where instructions are moved around
keeping dependencies in mind so as to reduce the wasted cycles due to stalls.
Q. Consider a pipeline with ideal CPI of 1. Assume that 30% of all instructions
executed are branch, out of which 80% are taken branches. The pipeline
speedup for predict taken and delayed branch approaches to reduce branch
penalties will be:
a. 4.10 and 4.45
b. 3.25 and 4.35
c. 3.67 and 4.25
d. 3.85 and 4.35
Correct answer is (d).
For predict taken, branch penalty = 1
Speedup = 5 / (1 + 0.30 x 1) = 3.85
For delayed branch, branch penalty = 0.5
Speedup = 5 / (1 + 0.30 x 0.5) = 4.35

Q:

The design team for a simple, single-issue processor is choosing between a pipelined
or non-pipelined implementation. Here are some design parameters for the two possibilities:

Parameter Pipelined Version Non-Pipelined Version


Clock Rate 500MHz 350 MHz

CPI for ALU instructions 1 1

CPI for Control 2 1


instructions

CPI for Memory 2.7 1


instructions

(a) For a program with 20% ALU instructions, 10% control instructions and 75%
memory instructions, which design will be faster? Give a quantitative CPI average
for each case.

Average CPI for Pipelined Version = (0.2*1 + 0.1*2 + 0.7*2.7) = 2.29


Average CPI for Non-Pipelined Version = (0.2*1 + 0.1*1 + 0.7*1) =
1.0 CPU execution time for Pipelined version = 2.26/(500 Mhz) = 4.5ns
CPU execution time for Non-Pipelined version = 1.0/(350 Mhz) =
2.8ns The non-pipelined version is faster.

(b) For a program with 80% ALU instructions, 10% control instructions and 10%
memory instructions, which design will be faster? Give a quantitative CPI average
for each case.

Average CPI for Pipelined Version = (0.8*1 + 0.1*2 + 0.1*2.7) = 1.27


Average CPI for Non-Pipelined Version = (0.8*1 + 0.1*1 + 0.1*1) =
1.0 CPU execution time for Pipelined version = 1.27/(500 Mhz) =
2.54ns CPU execution time for Non-Pipelined version = 1.0/(350 Mhz) =
2.8ns The pipelined version is faster.

Q:
Match the following:
A. Branch Prediction
B. Instruction Scheduling
C. Delay Slots
D. Increasing functional units
E. Caches

I.Data hazard
II.Structural
III.Control

Solution:
A. III
B. II & III
C.III
D.II
E. I

Structural, data and control hazards typically require a processor pipeline to stall.

(a) Branch Prediction


It addresses control hazards by guessing the outcome of a branch instruction and
then speculatively executes the instructions on one side of the branch to keep the
pipeline moving. Predictions can be made in hardware or in software by the
compiler.

(b) Instruction Scheduling

It addresses structural hazards and data hazards. It addresses data hazards by


either moving instructions that are not dependent on an instruction, say A, before
some instructions that depend on A and thus avoiding the stall that would have
occurred otherwise. It addresses structural hazards by making sure instructions that
use functional units that have limited number of instances are be scheduled far apart
from each other and there is no unnecessary stall due to this. It can be done
in hardware (superscalar processor) or statically by the compiler

(c) delay slots

It addresses control hazards. It helps to avoid a stall that would result due branch
target identification during the decode stage by scheduling the execution of some
other instruction which anyway has to execute irrespective of the branch condition.

(d) increasing availability of functional units (ALUs, adders etc)

It helps to avoid structural hazards. It is possible to run multiple instructions of the


same type at the same time if we have replicated functional units

(e) caches

It addresses data hazards. In particular, caches help to reduce memory latency and
hence reduce the load-use latency which in turn reduce the stall duration and
improves execution time (by maintaining pipeline steady state).

Q:

Which is the Incorrect statement/s:

A. An instruction A is said to be dependent on an instruction B if the A’s execution is


determined by some condition computed by B
B. An instruction A is said to be dependent on an instruction B if A uses some data
value that is produced by B
C. Only data dependencies cause hazards
D. Dependencies always cause hazard

Solution: CD

An instruction A is said to be dependent on an instruction B if the A’s execution is


determined by some condition computed by B or if A uses some data value that is
produced by B. A hazard is situation which prevents the pipelined execution of an
program and causes a stall. Hazards are usually a consequence of having data
dependencies between instructions, but it is possible for hazards to manifest on an
architecture even though intrinsically there are no dependences between
instructions due to limitations in the number of resources (registers/functional units).

Q:
Using the code below, count the number of all of the dependence types (RAW, WAR,
WAW).

I0: A = B + C;
I1: C = A - B;
I2: D = A + C;
I3: A = B * C * D;
I4: C = F / D;
I5: F = A ˆ G;
I6: G = F + D;

Solution: RAW =9, WAR = 6, WAW=2


RAW Dependence WAR Dependence WAW Dependence

From Instr To Instr From Instr To Instr From Instr To Instr

I0 I1 I0 I1 I0 I3

I0 I2 I1 I3 I1 I4

I1 I2 I2 I4

I3 I5 I3 I4

I2 I3 I4 I5

I1 I3 I5 I6

I2 I4

I3 I5

I5 I6

Q:

Given four instructions, how many unique comparisons (between register sources
and destinations) are necessary to find all of the RAW, WAR, and WAW
dependences. Answer for the case of four instructions, and then derive a general
equation for N instructions. Assume that all instructions have one register destination
and two register sources.

For four instructions, the number of unique comparisons:

(2(3) + 2(2) + 2(1)) + (2(3) + 2(2) + 2(1)) + (3 + 2 + 1) = 30

The first summand is for RAW comparisons, the second summand is for WAR
comparisons and the last summand is for WAW comparisons.

The general equation for N instructions = (5*(n-1)*n)/2

Q:

Which of the following are the reasons that in pipelining throughput will not improve as
pipelining is increased indefinitely
A. Pipelining has a fixed (or relatively fixed) absolute overhead per stage which
results from latch overhead and clock/data skew.
B. increasing the pipeline depth lengthens hazard penalties, increasing the CPI.
C. the latency of a pipeline stage can be driven to zero
D. increasing the depth of the pipeline between the fetch and execute stage
decreases the branch miss prediction penalty.

Solution:

Pipelining has a fixed (or relatively fixed) absolute overhead per stage which results
from latch overhead and clock/data skew. This means that the latency of a pipeline
stage cannot be driven to zero. Second, increasing the pipeline depth lengthens
hazard penalties, increasing the CPI. For instance, increasing the depth of the pipeline
between the fetch and execute stage increases the branch miss prediction penalty.

Q:
Consider a machine with a 5-stage pipeline with a cycle time of 10ns. Assume that you
are executing a program where a fraction, f, of all instructions immediately follow a load
upon which they are dependent.

(a) With forwarding enabled what is the total execution time for N instructions, in terms of f ?

When pipeline is filled,


(1 – f)*N instructions take 1 cycle
f*N instructions take 2 cycles (including 1 cycle for load-use stall)
Total cycles = (1-f)*N + 2*f*N + 4 (then number of cycles to fill the

pipeline) Total time = 10 *(N*(1+f) + 4)

Q:

Non pipelined system takes 130ns to process an instruction . A program of 1000


instructions is executed in non pipelined system. Then same program is processed with
processor with 5 segment pipeline with clock cycle of 30 ns/stage.

Determine speed up ratio of pipeline.

solution:
For a non-pipelined system:

• Total number of instruction/task


• (n)=1000

Total time required to perform a single task in pipelined processor

(Tnp)=130ns
For a pipelined system:
• Total number of stages
• (k)=5
Total number of instruction/task

(n)=1000
Total time required to perform a single task in pipelined processor

(Tp)=30ns
Speedup = (n*T )/(k+(n-1)T )
np p

speedup =4.316ns is the answer.

You might also like